Data Analysis and Knowledge Discovery  2018, Vol. 2 Issue (10): 103-109    DOI: 10.11925/infotech.2096-3467.2018.0211
Computing Similarity of Sci-Tech Documents Based on Texts and Formulas
Jianmin Xu(),Caiyun Xu
School of Cyber Security and Computer, Hebei University, Baoding 071002, China
[Objective] This paper proposes a new method to calculate the similarity of science and technology documents combining the information of texts and formulas, aiming to improve the performance of traditional methods. [Methods] Firstly, we mapped feature elements of single formula into position vector, which helped us calculate the similarity of single formula. Secondly, we computed the coverage and similarity of formula between documents. Finally, the similarity of science and technology documents were calculated by combining information of texts and formulas. [Results] We compared the classification results of the new method and the traditional ones. We found that the macro average F-score of the new method was increased by 6.7%. [Limitations] The test sets do not collect formula information of documents, which need to be expanded. [Conclusions] The new method could calculate document similarity more accurately.

Key wordsFormula Similarity      Document Similarity      Coverage Degree      Scientific and Technical Documents     
Received: 26 February 2018      Published: 12 November 2018

Cite this article:

Jianmin Xu,Caiyun Xu. Computing Similarity of Sci-Tech Documents Based on Texts and Formulas. Data Analysis and Knowledge Discovery, 2018, 2(10): 103-109.

