[Objective] This paper proposes a new method to calculate the similarity of science and technology documents combining the information of texts and formulas, aiming to improve the performance of traditional methods. [Methods] Firstly, we mapped feature elements of single formula into position vector, which helped us calculate the similarity of single formula. Secondly, we computed the coverage and similarity of formula between documents. Finally, the similarity of science and technology documents were calculated by combining information of texts and formulas. [Results] We compared the classification results of the new method and the traditional ones. We found that the macro average F-score of the new method was increased by 6.7%. [Limitations] The test sets do not collect formula information of documents, which need to be expanded. [Conclusions] The new method could calculate document similarity more accurately.
徐建民, 许彩云. 基于文本和公式的科技文档相似度计算*[J]. 数据分析与知识发现, 2018, 2(10): 103-109.
Xu Jianmin,Xu Caiyun. Computing Similarity of Sci-Tech Documents Based on Texts and Formulas. Data Analysis and Knowledge Discovery, 2018, 2(10): 103-109.
(Guo Qinglin, Li Yanmei, Tang Qi.Similarity Computing of Documents Based on VSM[J]. Application Research of Computers, 2008, 25(11): 3256-3258.)
doi: 10.3969/j.issn.1001-3695.2008.11.015
(Wu Duojian.Research and Implementation of Document Similarity Based on Word2Vec[D]. Xi’an: Xidian University, 2016.)
[3]
Pôssas B, Ziviani N, Meira W Jr, et al.Set-based Vector Model: An Efficient Approach for Correlation-based Ranking[J]. ACM Transactions on Information Systems, 2005, 23(4): 397-429.
doi: 10.1145/1095872
[4]
郭喜跃. 面向开放领域文本的实体关系抽取[D]. 武汉: 华中师范大学, 2016.
[4]
(Guo Xiyue.Entity Relation Extraction for Open Domain Text[D]. Wuhan: Central China Normal University, 2016.)
(Hu Jiming, Xiao Lu.Semantic Incremental Improvement on Vector Space Model for Text Modeling[J]. New Technology of Library and Information Service, 2014(10): 49-55.)
[6]
Baeza-Yates R, Ribeiro-Neto B. 现代信息检索[J]. 北京:机械工业出版社, 2004.
[6]
(Baeza-Yates R, Ribeiro-Neto B.Mordern Information Retrieval[M]. Beijing: China Machine Press, 2004.
[7]
Zhang X L, Yang T, Fan B Q, et al.Novel Method for Measuring Structure and Semantic Similarity of XML Documents Based on Extended Adjacency Matrix[J]. Physics Procedia, 2012, 24:1452-1461.
doi: 10.1016/j.phpro.2012.02.215
[8]
Mahmood Q, Qadir M A, Afzal M T.Document Similarity Detection Using Semantic Social Network Analysis on RDF Citation Graph[C]//Proceedings of the 9th International Conference on Emerging Technologies. IEEE, 2014: 1-6.
[9]
唐亚伟. 公式相似度算法及其在论文查重中的应用研究[D]. 锦州:渤海大学, 2013.
[9]
(Tang Yawei.Research on Mathematical Formula Similarity Algorithm and the Application Research in Paper Plagiarism Detection[D]. Jinzhou: Bohai University, 2013.)
[10]
Amarnath P, Partha P, Sandip S, et al.MathIRs: Retrieval System for Scientific Documents[J]. Computación y Sistemas, 2017, 21(2): 253-265.
(Wang Ruijia.Research on Multi-modal Semantic Features Extraction and Expression System in Scientific and Technical Literature —— The Case of Mathematical Formula[D]. Beijing: Institute of Scientific and Technical Information of China, 2012.)
[12]
卢托. 科技文档中数学公式的描述与检索[D]. 武汉: 华中科技大学, 2007.
[12]
(Lu Tuo.The Description and Retrieval of Math Formulas in Scientific Documents[D]. Wuhan: Huazhong University of Science and Technology, 2007.)
[13]
林晓燕. PDF文档的数学公式识别与检索研究[D]. 北京:北京大学, 2014.
[13]
(Lin Xiaoyan.Research on Method of Mathematical Formula Detection in PDF Documents[D]. Beijing: Peking University, 2014.)
[14]
Chen K, Zhang Z, Long J, et al.Turning from TF-IDF to TF-IGM for Term Weighting in Text Classification[J]. Expert Systems with Applications, 2016, 66(C): 245-260.
doi: 10.1016/j.eswa.2016.09.009
(Jin Yi.The Evolution of Mathematical Formula and Its Canonical Expression[J]. Journal of Northwest University: Natural Science Edition, 1988, 18(2): 120-124.)
[16]
Lin X, Gao L, Hu X, et al.A Mathematics Retrieval System for Formulae in Layout Presentations[C]// Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 2014: 697-706.
[17]
周志华. 机器学习[M]. 北京:清华大学出版社, 2016.
[17]
(Zhou Zhihua.Machine Learning[M]. Beijing: Tsinghua University Press, 2016.)
(Dong Kansheng, Fang Jinyun.Word Order Similarity Algorithm Based on Vector Distance[J]. Journal of Chinese Information Processing, 2009, 23(3): 45-50.)
[19]
Wu S, Bi Y, Zeng X, et al.Assigning Appropriate Weights for the Linear Combination Data Fusion Method in Information Retrieval[J]. Information Processing and Management, 2009, 45(4): 413-426.
doi: 10.1016/j.ipm.2009.02.003
(Xu Jianmin, Wang Ping.Small Chinese Information Retrieval Test Collections: Construction and Analysis[J]. Journal of Intelligence, 2009, 28(1): 13-16.)
doi: 10.3969/j.issn.1002-1965.2009.01.004