Please wait a minute...
Data Analysis and Knowledge Discovery  2018, Vol. 2 Issue (10): 103-109    DOI: 10.11925/infotech.2096-3467.2018.0211
Current Issue | Archive | Adv Search |
Computing Similarity of Sci-Tech Documents Based on Texts and Formulas
Xu Jianmin(), Xu Caiyun
School of Cyber Security and Computer, Hebei University, Baoding 071002, China
Download: PDF (603 KB)   HTML ( 2
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a new method to calculate the similarity of science and technology documents combining the information of texts and formulas, aiming to improve the performance of traditional methods. [Methods] Firstly, we mapped feature elements of single formula into position vector, which helped us calculate the similarity of single formula. Secondly, we computed the coverage and similarity of formula between documents. Finally, the similarity of science and technology documents were calculated by combining information of texts and formulas. [Results] We compared the classification results of the new method and the traditional ones. We found that the macro average F-score of the new method was increased by 6.7%. [Limitations] The test sets do not collect formula information of documents, which need to be expanded. [Conclusions] The new method could calculate document similarity more accurately.

Key wordsFormula Similarity      Document Similarity      Coverage Degree      Scientific and Technical Documents     
Received: 26 February 2018      Published: 12 November 2018
ZTFLH:  G202 TP391  

Cite this article:

Xu Jianmin,Xu Caiyun. Computing Similarity of Sci-Tech Documents Based on Texts and Formulas. Data Analysis and Knowledge Discovery, 2018, 2(10): 103-109.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2018.0211     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2018/V2/I10/103

数据集构成 贝叶斯检索(60篇) 个性化推荐(60篇) 人脸识别(60篇) 用户影响力(60篇) 文本分类(60篇)
基准文档 1 1 1 1 1
天然相似 17 11 11 11 11
背靠背修改 24 48 48 48 48
中英互译 18 0 0 0 0
主题 向量空间 文本和公式
P R F P R F
贝叶斯检索 1 0.25 0.4 1 0.71 0.83
个性化推荐 0.88 0.96 0.92 0.96 0.92 0.94
人脸识别 0.89 1 0.94 0.83 1 0.91
用户影响力 0.85 0.92 0.88 0.83 1 0.91
文本分类 0.6 0.88 0.71 0.86 0.79 0.83
[1] 郭庆琳, 李艳梅, 唐琦. 基于VSM的文本相似度计算的研究[J]. 计算机应用与研究, 2008, 25(11): 3256-3258.
doi: 10.3969/j.issn.1001-3695.2008.11.015
[1] (Guo Qinglin, Li Yanmei, Tang Qi.Similarity Computing of Documents Based on VSM[J]. Application Research of Computers, 2008, 25(11): 3256-3258.)
doi: 10.3969/j.issn.1001-3695.2008.11.015
[2] 吴多坚. 基于Word2Vec的中文文本相似度研究与实现[D]. 西安: 西安电子科技大学, 2016.
[2] (Wu Duojian.Research and Implementation of Document Similarity Based on Word2Vec[D]. Xi’an: Xidian University, 2016.)
[3] Pôssas B, Ziviani N, Meira W Jr, et al.Set-based Vector Model: An Efficient Approach for Correlation-based Ranking[J]. ACM Transactions on Information Systems, 2005, 23(4): 397-429.
doi: 10.1145/1095872
[4] 郭喜跃. 面向开放领域文本的实体关系抽取[D]. 武汉: 华中师范大学, 2016.
[4] (Guo Xiyue.Entity Relation Extraction for Open Domain Text[D]. Wuhan: Central China Normal University, 2016.)
[5] 胡吉明, 肖璐. 向量空间模型文本建模的语义增量化改进研究[J]. 现代图书情报技术, 2014(10): 49-55.
[5] (Hu Jiming, Xiao Lu.Semantic Incremental Improvement on Vector Space Model for Text Modeling[J]. New Technology of Library and Information Service, 2014(10): 49-55.)
[6] Baeza-Yates R, Ribeiro-Neto B. 现代信息检索[J]. 北京:机械工业出版社, 2004.
[6] (Baeza-Yates R, Ribeiro-Neto B.Mordern Information Retrieval[M]. Beijing: China Machine Press, 2004.
[7] Zhang X L, Yang T, Fan B Q, et al.Novel Method for Measuring Structure and Semantic Similarity of XML Documents Based on Extended Adjacency Matrix[J]. Physics Procedia, 2012, 24:1452-1461.
doi: 10.1016/j.phpro.2012.02.215
[8] Mahmood Q, Qadir M A, Afzal M T.Document Similarity Detection Using Semantic Social Network Analysis on RDF Citation Graph[C]//Proceedings of the 9th International Conference on Emerging Technologies. IEEE, 2014: 1-6.
[9] 唐亚伟. 公式相似度算法及其在论文查重中的应用研究[D]. 锦州:渤海大学, 2013.
[9] (Tang Yawei.Research on Mathematical Formula Similarity Algorithm and the Application Research in Paper Plagiarism Detection[D]. Jinzhou: Bohai University, 2013.)
[10] Amarnath P, Partha P, Sandip S, et al.MathIRs: Retrieval System for Scientific Documents[J]. Computación y Sistemas, 2017, 21(2): 253-265.
[11] 王睿佳. 科技文献的多模态语义关联特征提取与表示体系研究——以数学公式为例[D]. 北京: 中国科学技术信息研究所, 2012.
[11] (Wang Ruijia.Research on Multi-modal Semantic Features Extraction and Expression System in Scientific and Technical Literature —— The Case of Mathematical Formula[D]. Beijing: Institute of Scientific and Technical Information of China, 2012.)
[12] 卢托. 科技文档中数学公式的描述与检索[D]. 武汉: 华中科技大学, 2007.
[12] (Lu Tuo.The Description and Retrieval of Math Formulas in Scientific Documents[D]. Wuhan: Huazhong University of Science and Technology, 2007.)
[13] 林晓燕. PDF文档的数学公式识别与检索研究[D]. 北京:北京大学, 2014.
[13] (Lin Xiaoyan.Research on Method of Mathematical Formula Detection in PDF Documents[D]. Beijing: Peking University, 2014.)
[14] Chen K, Zhang Z, Long J, et al.Turning from TF-IDF to TF-IGM for Term Weighting in Text Classification[J]. Expert Systems with Applications, 2016, 66(C): 245-260.
doi: 10.1016/j.eswa.2016.09.009
[15] 今一. 数学公式的演变及其规范表达[J]. 西北大学学报: 自然科学版, 1988, 18(2): 120-124.
[15] (Jin Yi.The Evolution of Mathematical Formula and Its Canonical Expression[J]. Journal of Northwest University: Natural Science Edition, 1988, 18(2): 120-124.)
[16] Lin X, Gao L, Hu X, et al.A Mathematics Retrieval System for Formulae in Layout Presentations[C]// Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 2014: 697-706.
[17] 周志华. 机器学习[M]. 北京:清华大学出版社, 2016.
[17] (Zhou Zhihua.Machine Learning[M]. Beijing: Tsinghua University Press, 2016.)
[18] 董刊生, 方金云. 基于向量距离的词序相似度算法[J]. 中文信息学报, 2009, 23(3): 45-50.
[18] (Dong Kansheng, Fang Jinyun.Word Order Similarity Algorithm Based on Vector Distance[J]. Journal of Chinese Information Processing, 2009, 23(3): 45-50.)
[19] Wu S, Bi Y, Zeng X, et al.Assigning Appropriate Weights for the Linear Combination Data Fusion Method in Information Retrieval[J]. Information Processing and Management, 2009, 45(4): 413-426.
doi: 10.1016/j.ipm.2009.02.003
[20] 李湘东, 阮涛, 刘康. 基于维基百科的多种类型文献自动分类研究[J]. 数据分析与知识发现, 2017, 1(10): 43-52.
[20] (Li Xiangdong, Ruan Tao, Liu Kang.Automatic Classification of Documents from Wikipedia[J]. Data Analysis and Knowledge Discovery, 2017, 1(10): 43-52.)
[21] David L. Reuters-21578 Text Categorization Collection [DS/OL]. [2018-06-01]. .
[22] 文本分类(复旦)测试语料[OL]. [2018-06-01]. .
[22] (Text Classification Corpus (Fudan) [OL]. [2018-06-01].
[23] 徐建民, 王平. 小型中文信息检索测试集的构建与分析[J]. 情报杂志, 2009, 28(1): 13-16.
doi: 10.3969/j.issn.1002-1965.2009.01.004
[23] (Xu Jianmin, Wang Ping.Small Chinese Information Retrieval Test Collections: Construction and Analysis[J]. Journal of Intelligence, 2009, 28(1): 13-16.)
doi: 10.3969/j.issn.1002-1965.2009.01.004
[1] Ye Huanzhuo, Wu Di. Approximately Duplicate Data Cleaning Algorithm Based on Improved Edit Distance[J]. 现代图书情报技术, 2011, 27(7/8): 82-90.
[2] Ye Huanzhuo, Wu Di. A Survey of Approximately Duplicate Data Cleaning Method[J]. 现代图书情报技术, 2010, 26(9): 56-66.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn