Please wait a minute...
Advanced Search
数据分析与知识发现  2018, Vol. 2 Issue (10): 103-109    DOI: 10.11925/infotech.2096-3467.2018.0211
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于文本和公式的科技文档相似度计算*
徐建民(),许彩云
河北大学网络空间安全与计算机学院 保定 071002
Computing Similarity of Sci-Tech Documents Based on Texts and Formulas
Jianmin Xu(),Caiyun Xu
School of Cyber Security and Computer, Hebei University, Baoding 071002, China
全文: PDF(603 KB)   HTML
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】针对仅利用文本信息计算科技文档相似度存在的不足, 提出一种结合文本和公式信息计算科技文档相似度的方法。【方法】将单个公式的特征元素映射为位置向量, 计算得到单个公式的相似度; 计算文档间的公式覆盖度和相似度; 结合文本和公式信息计算得到科技文档相似度。【结果】比较本文方法和传统向量空间方法的分类性能, 结果显示本文方法在宏平均F值上最大可提高6.7%。【局限】没有包含文档公式信息的公开测试集, 自行构建的数据集规模较小。【结论】结合公式信息计算文档相似度, 不仅能有效提高文档相似度计算的准确性, 而且可以实现跨语言文档的相似度计算。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
徐建民
许彩云
关键词 公式相似度文档相似度覆盖度科技文档    
Abstract

[Objective] This paper proposes a new method to calculate the similarity of science and technology documents combining the information of texts and formulas, aiming to improve the performance of traditional methods. [Methods] Firstly, we mapped feature elements of single formula into position vector, which helped us calculate the similarity of single formula. Secondly, we computed the coverage and similarity of formula between documents. Finally, the similarity of science and technology documents were calculated by combining information of texts and formulas. [Results] We compared the classification results of the new method and the traditional ones. We found that the macro average F-score of the new method was increased by 6.7%. [Limitations] The test sets do not collect formula information of documents, which need to be expanded. [Conclusions] The new method could calculate document similarity more accurately.

Key wordsFormula Similarity    Document Similarity    Coverage Degree    Scientific and Technical Documents
收稿日期: 2018-02-26     
基金资助:*本文系河北省自然基金项目“基于贝叶斯网络的话题识别与追踪方法研究”(项目编号: 2015201142)和国家社会科学基金后期资助项目“基于术语关系的贝叶斯网络检索模型扩展”(项目编号: 17FTQ002)的研究成果之一
引用本文:   
徐建民,许彩云. 基于文本和公式的科技文档相似度计算*[J]. 数据分析与知识发现, 2018, 2(10): 103-109.
Jianmin Xu,Caiyun Xu. Computing Similarity of Sci-Tech Documents Based on Texts and Formulas. Data Analysis and Knowledge Discovery, DOI:10.11925/infotech.2096-3467.2018.0211.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2018.0211
图1  公式fik特征元素映射过程
图2  公式fjl特征元素映射过程
数据集构成 贝叶斯检索(60篇) 个性化推荐(60篇) 人脸识别(60篇) 用户影响力(60篇) 文本分类(60篇)
基准文档 1 1 1 1 1
天然相似 17 11 11 11 11
背靠背修改 24 48 48 48 48
中英互译 18 0 0 0 0
表1  数据集统计
图3  不同的K值对分类结果的影响
图4  不同的值$\alpha $对测试集整体分类性能的影响
主题 向量空间 文本和公式
P R F P R F
贝叶斯检索 1 0.25 0.4 1 0.71 0.83
个性化推荐 0.88 0.96 0.92 0.96 0.92 0.94
人脸识别 0.89 1 0.94 0.83 1 0.91
用户影响力 0.85 0.92 0.88 0.83 1 0.91
文本分类 0.6 0.88 0.71 0.86 0.79 0.83
表2  两种方法不同主题分类性能的三种指标值
图5  两种方法不同主题分类性能的比较
[1] 郭庆琳, 李艳梅, 唐琦. 基于VSM的文本相似度计算的研究[J]. 计算机应用与研究, 2008, 25(11): 3256-3258.
doi: 10.3969/j.issn.1001-3695.2008.11.015
(Guo Qinglin, Li Yanmei, Tang Qi.Similarity Computing of Documents Based on VSM[J]. Application Research of Computers, 2008, 25(11): 3256-3258.)
[2] 吴多坚. 基于Word2Vec的中文文本相似度研究与实现[D]. 西安: 西安电子科技大学, 2016.
(Wu Duojian.Research and Implementation of Document Similarity Based on Word2Vec[D]. Xi’an: Xidian University, 2016.)
[3] Pôssas B, Ziviani N, Meira W Jr, et al.Set-based Vector Model: An Efficient Approach for Correlation-based Ranking[J]. ACM Transactions on Information Systems, 2005, 23(4): 397-429.
doi: 10.1145/1095872
[4] 郭喜跃. 面向开放领域文本的实体关系抽取[D]. 武汉: 华中师范大学, 2016.
(Guo Xiyue.Entity Relation Extraction for Open Domain Text[D]. Wuhan: Central China Normal University, 2016.)
[5] 胡吉明, 肖璐. 向量空间模型文本建模的语义增量化改进研究[J]. 现代图书情报技术, 2014(10): 49-55.
(Hu Jiming, Xiao Lu.Semantic Incremental Improvement on Vector Space Model for Text Modeling[J]. New Technology of Library and Information Service, 2014(10): 49-55.)
[6] Baeza-Yates R, Ribeiro-Neto B. 现代信息检索[J]. 北京:机械工业出版社, 2004.
(Baeza-Yates R, Ribeiro-Neto B.Mordern Information Retrieval[M]. Beijing: China Machine Press, 2004.
[7] Zhang X L, Yang T, Fan B Q, et al.Novel Method for Measuring Structure and Semantic Similarity of XML Documents Based on Extended Adjacency Matrix[J]. Physics Procedia, 2012, 24:1452-1461.
doi: 10.1016/j.phpro.2012.02.215
[8] Mahmood Q, Qadir M A, Afzal M T.Document Similarity Detection Using Semantic Social Network Analysis on RDF Citation Graph[C]//Proceedings of the 9th International Conference on Emerging Technologies. IEEE, 2014: 1-6.
[9] 唐亚伟. 公式相似度算法及其在论文查重中的应用研究[D]. 锦州:渤海大学, 2013.
(Tang Yawei.Research on Mathematical Formula Similarity Algorithm and the Application Research in Paper Plagiarism Detection[D]. Jinzhou: Bohai University, 2013.)
[10] Amarnath P, Partha P, Sandip S, et al.MathIRs: Retrieval System for Scientific Documents[J]. Computación y Sistemas, 2017, 21(2): 253-265.
[11] 王睿佳. 科技文献的多模态语义关联特征提取与表示体系研究——以数学公式为例[D]. 北京: 中国科学技术信息研究所, 2012.
(Wang Ruijia.Research on Multi-modal Semantic Features Extraction and Expression System in Scientific and Technical Literature —— The Case of Mathematical Formula[D]. Beijing: Institute of Scientific and Technical Information of China, 2012.)
[12] 卢托. 科技文档中数学公式的描述与检索[D]. 武汉: 华中科技大学, 2007.
(Lu Tuo.The Description and Retrieval of Math Formulas in Scientific Documents[D]. Wuhan: Huazhong University of Science and Technology, 2007.)
[13] 林晓燕. PDF文档的数学公式识别与检索研究[D]. 北京:北京大学, 2014.
(Lin Xiaoyan.Research on Method of Mathematical Formula Detection in PDF Documents[D]. Beijing: Peking University, 2014.)
[14] Chen K, Zhang Z, Long J, et al.Turning from TF-IDF to TF-IGM for Term Weighting in Text Classification[J]. Expert Systems with Applications, 2016, 66(C): 245-260.
doi: 10.1016/j.eswa.2016.09.009
[15] 今一. 数学公式的演变及其规范表达[J]. 西北大学学报: 自然科学版, 1988, 18(2): 120-124.
(Jin Yi.The Evolution of Mathematical Formula and Its Canonical Expression[J]. Journal of Northwest University: Natural Science Edition, 1988, 18(2): 120-124.)
[16] Lin X, Gao L, Hu X, et al.A Mathematics Retrieval System for Formulae in Layout Presentations[C]// Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 2014: 697-706.
[17] 周志华. 机器学习[M]. 北京:清华大学出版社, 2016.
(Zhou Zhihua.Machine Learning[M]. Beijing: Tsinghua University Press, 2016.)
[18] 董刊生, 方金云. 基于向量距离的词序相似度算法[J]. 中文信息学报, 2009, 23(3): 45-50.
(Dong Kansheng, Fang Jinyun.Word Order Similarity Algorithm Based on Vector Distance[J]. Journal of Chinese Information Processing, 2009, 23(3): 45-50.)
[19] Wu S, Bi Y, Zeng X, et al.Assigning Appropriate Weights for the Linear Combination Data Fusion Method in Information Retrieval[J]. Information Processing and Management, 2009, 45(4): 413-426.
doi: 10.1016/j.ipm.2009.02.003
[20] 李湘东, 阮涛, 刘康. 基于维基百科的多种类型文献自动分类研究[J]. 数据分析与知识发现, 2017, 1(10): 43-52.
(Li Xiangdong, Ruan Tao, Liu Kang.Automatic Classification of Documents from Wikipedia[J]. Data Analysis and Knowledge Discovery, 2017, 1(10): 43-52.)
[21] David L. Reuters-21578 Text Categorization Collection [DS/OL]. [2018-06-01]. .
[22] 文本分类(复旦)测试语料[OL]. [2018-06-01]. .
(Text Classification Corpus (Fudan) [OL]. [2018-06-01].
[23] 徐建民, 王平. 小型中文信息检索测试集的构建与分析[J]. 情报杂志, 2009, 28(1): 13-16.
doi: 10.3969/j.issn.1002-1965.2009.01.004
(Xu Jianmin, Wang Ping.Small Chinese Information Retrieval Test Collections: Construction and Analysis[J]. Journal of Intelligence, 2009, 28(1): 13-16.)
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn