An Improved Cosine Text Similarity Computing Method Based on Semantic Chunk Feature
Bai Rujiang1(), Leng Fuhai2, Liao Junhua1
1Institute of Scientific and Technical Information, Shandong University of Technology, Zibo 255049, China 2Institute of Policy and Management, Chinese Academy of Sciences, Beijing 100190, China
[Objective] This paper aims to improve the performance of Cosine text similarity computing method with the help of text semantic chunk feature. [Methods] First, we retrieved the project data of carbon nanotubes studies, which were pre-processed with stemming and POS techniques. Then, we identified the semantic chunk of text contents with the conditional random field model. Third, we calculated the similarity of texts based on semantic chunk feature. Finally, we compared our results with those generated by the unlabeled data. [Results] The proposed method improved the performance of Cosine similarity calculation by up to 26%. [Limitations] Our study relies on semantic chunks to annotate the computing performance. [Conclusions] The proposed method could effectively identify similar texts, and reduce the dimensions of vector space model, which improves the computing efficiency. The new method is robust and could be transferred to other fields.
白如江, 冷伏海, 廖君华. 一种基于语义组块特征的改进Cosine文本相似度计算方法*[J]. 数据分析与知识发现, 2017, 1(6): 56-64.
Bai Rujiang,Leng Fuhai,Liao Junhua. An Improved Cosine Text Similarity Computing Method Based on Semantic Chunk Feature. Data Analysis and Knowledge Discovery, 2017, 1(6): 56-64.
Jacob B, Benjamin C. Calculating the Jaccard Similarity Coefficient with Map Reduce for Entity Pairs in Wikipedia [EB/OL]. [2017-04-07]. .
[4]
Metzler D, Bernstein Y, Croft W B, et al.Similarity Measures for Tracking Information Flow[C]// Proceedings of the 14th ACM International Conference on Information and Knowledge Management. 2005:517-524.
[5]
Banerjee S, Pedersen T.Extended Gloss Overlaps as a Measure of Semantic Relatedness[C]// Proceedings of the 17th International Joint Conference on Artificial Intelligence. New York: ACM Press, 2003: 805-810.
[6]
Ponzetto P S, Strube M.Knowledge Derived from Wikipedia for Computing Semantic Relatedness[J]. Journal of Artificial Intelligence Research, 2007, 30(1): 181-212.
doi: 10.1613/jair.2308
[7]
Allan J, Bolivar A, Wade C.Retrieval and Novelty Detection at the Sentence Level[C]// Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. 2003.
[8]
Landauer T K, Foltz P W, Laham D.Introduction to Latent Semantic Analysis[J]. Discourse Processes, 1998, 25(2-3): 259-284.
Islam A, Inkpen D.Semantic Text Similarity Using Corpus-based Word Similarity and String Similarity[J]. ACM Transactions on Knowledge Discovery from Data, 2008, 2(2): 10.
doi: 10.1145/1376815.1376819
[11]
Sébastien H, David S, Sylvie R, et al.A Framework for Unifying Ontology-based Semantic Similarity Measures: A Study in the Biomedical Domain[J]. Journal of Biomedical Informatics, 2014, 48(2): 38-53.
doi: 10.1016/j.jbi.2013.11.006
pmid: 24269894
[12]
Rada R, Mili H, Bicknell E, et al.Development and Application of a Metric on Semantic Nets[J]. IEEE Transactions on Systems, Man, and Cybernetics Society, 1989, 19(1): 17-30.
[13]
Leacock C, Chodorow M.Combining Local Context and WordNet Similarity for Word Sense Identification[M]. MIT Press, 1998.
[14]
Pekar V, Staab S.Taxonomy Learning: Factoring the Structure of a Taxonomy into a Semantic Classification Decision[C]// Proceedings of the 19th International Conference on Computational Linguistics, Taipei, Taiwan, China. New York: ACM Press, 2002: 1-7.
[15]
Wu Z, Palmer M.Verb Semantics and Lexical Selection [C]//Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics. New York: ACM Press, 1994: 133-138.
[16]
Tversky A.Features of Similarity[J]. Psychological Review, 1977, 84(4): 327-352.
[17]
Wang J Z, Du Z, Payattakool R, et al.A New Method to Measure the Semantic Similarity of GO Terms[J]. Bioinformatics, 2007, 23(10): 1274-1281.
doi: 10.1093/bioinformatics/btm087
pmid: 17344234
[18]
Couto F M, Silva M, Coutinho P M.Implementation of a Functional Semantic Similarity Measure Between geNe- products[D]. Lisbon: University of Lisbon, 2003.
[19]
Othman R M, Deris S, Illias R M.A Genetic Similarity Algorithm for Searching the Gene Ontology Terms and Annotating Anonymous Protein Sequences[J]. Journal of Biomed Information, 2008, 41(1): 65-81.
doi: 10.1016/j.jbi.2007.05.010
pmid: 17681495
(Liu Hongzhe, Xu De.Review of Semantic Similarity and Correlation Calculation Based on Ontology[J]. Computer Science, 2012, 39(2): 8-13.)
doi: 10.3969/j.issn.1002-137X.2012.02.002
(Huang Chenghui, In Jian, Hou Fang.A Text Similarity Measure Based on Semantic Information and TF- IDF Method[J]. Journal of Computers, 2011, 34(5): 856-864.)
doi: 10.3724/SP.J.1016.2011.00856