|
|
An Improved Cosine Text Similarity Computing Method Based on Semantic Chunk Feature |
Bai Rujiang1( ), Leng Fuhai2, Liao Junhua1 |
1Institute of Scientific and Technical Information, Shandong University of Technology, Zibo 255049, China 2Institute of Policy and Management, Chinese Academy of Sciences, Beijing 100190, China |
|
|
Abstract [Objective] This paper aims to improve the performance of Cosine text similarity computing method with the help of text semantic chunk feature. [Methods] First, we retrieved the project data of carbon nanotubes studies, which were pre-processed with stemming and POS techniques. Then, we identified the semantic chunk of text contents with the conditional random field model. Third, we calculated the similarity of texts based on semantic chunk feature. Finally, we compared our results with those generated by the unlabeled data. [Results] The proposed method improved the performance of Cosine similarity calculation by up to 26%. [Limitations] Our study relies on semantic chunks to annotate the computing performance. [Conclusions] The proposed method could effectively identify similar texts, and reduce the dimensions of vector space model, which improves the computing efficiency. The new method is robust and could be transferred to other fields.
|
Received: 27 April 2017
Published: 25 August 2017
|
|
[1] |
Salton G, Wong A, Yang S, A Vector Space Model for Automatic Indexing[J]. Communications of the ACM, 1975, 18(11): 613-620.
doi: 10.1145/361219.361220
|
[2] |
孙建军, 成颖. 信息检索技术[M]. 北京: 科学出版, 2004.
|
[2] |
(Sun Jianjun, Cheng Ying.Information Retrieval Technology [M]. Beijing: Science Press, 2004.)
|
[3] |
Jacob B, Benjamin C. Calculating the Jaccard Similarity Coefficient with Map Reduce for Entity Pairs in Wikipedia [EB/OL]. [2017-04-07]. .
|
[4] |
Metzler D, Bernstein Y, Croft W B, et al.Similarity Measures for Tracking Information Flow[C]// Proceedings of the 14th ACM International Conference on Information and Knowledge Management. 2005:517-524.
|
[5] |
Banerjee S, Pedersen T.Extended Gloss Overlaps as a Measure of Semantic Relatedness[C]// Proceedings of the 17th International Joint Conference on Artificial Intelligence. New York: ACM Press, 2003: 805-810.
|
[6] |
Ponzetto P S, Strube M.Knowledge Derived from Wikipedia for Computing Semantic Relatedness[J]. Journal of Artificial Intelligence Research, 2007, 30(1): 181-212.
doi: 10.1613/jair.2308
|
[7] |
Allan J, Bolivar A, Wade C.Retrieval and Novelty Detection at the Sentence Level[C]// Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. 2003.
|
[8] |
Landauer T K, Foltz P W, Laham D.Introduction to Latent Semantic Analysis[J]. Discourse Processes, 1998, 25(2-3): 259-284.
|
[9] |
Lund K, Burgess C.Producing High-dimensional Semantic Spaces from Lexical Co-occurrence[J]. Behavior Research Methods Instruments & Computers, 1996, 28(2): 203-208.
doi: 10.3758/BF03204766
|
[10] |
Islam A, Inkpen D.Semantic Text Similarity Using Corpus-based Word Similarity and String Similarity[J]. ACM Transactions on Knowledge Discovery from Data, 2008, 2(2): 10.
doi: 10.1145/1376815.1376819
|
[11] |
Sébastien H, David S, Sylvie R, et al.A Framework for Unifying Ontology-based Semantic Similarity Measures: A Study in the Biomedical Domain[J]. Journal of Biomedical Informatics, 2014, 48(2): 38-53.
doi: 10.1016/j.jbi.2013.11.006
pmid: 24269894
|
[12] |
Rada R, Mili H, Bicknell E, et al.Development and Application of a Metric on Semantic Nets[J]. IEEE Transactions on Systems, Man, and Cybernetics Society, 1989, 19(1): 17-30.
|
[13] |
Leacock C, Chodorow M.Combining Local Context and WordNet Similarity for Word Sense Identification[M]. MIT Press, 1998.
|
[14] |
Pekar V, Staab S.Taxonomy Learning: Factoring the Structure of a Taxonomy into a Semantic Classification Decision[C]// Proceedings of the 19th International Conference on Computational Linguistics, Taipei, Taiwan, China. New York: ACM Press, 2002: 1-7.
|
[15] |
Wu Z, Palmer M.Verb Semantics and Lexical Selection [C]//Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics. New York: ACM Press, 1994: 133-138.
|
[16] |
Tversky A.Features of Similarity[J]. Psychological Review, 1977, 84(4): 327-352.
|
[17] |
Wang J Z, Du Z, Payattakool R, et al.A New Method to Measure the Semantic Similarity of GO Terms[J]. Bioinformatics, 2007, 23(10): 1274-1281.
doi: 10.1093/bioinformatics/btm087
pmid: 17344234
|
[18] |
Couto F M, Silva M, Coutinho P M.Implementation of a Functional Semantic Similarity Measure Between geNe- products[D]. Lisbon: University of Lisbon, 2003.
|
[19] |
Othman R M, Deris S, Illias R M.A Genetic Similarity Algorithm for Searching the Gene Ontology Terms and Annotating Anonymous Protein Sequences[J]. Journal of Biomed Information, 2008, 41(1): 65-81.
doi: 10.1016/j.jbi.2007.05.010
pmid: 17681495
|
[20] |
李文清, 孙新, 张常有, 等. 一种本体概念的语义相似度计算方法[J]. 自动化学报, 2012, 38(2): 229-235.
doi: 10.3724/SP.J.1004.2012.00229
|
[20] |
(Li Wenqing, Sun Xin, Zhang Changyou, et al.A Semantic Similarity Calculation Method of Ontology Concept[J]. Acta Automatica Sinica, 2012, 38(2): 229-235.)
doi: 10.3724/SP.J.1004.2012.00229
|
[21] |
刘宏哲, 须德. 基于本体的语义相似度和相关度计算研究综述[J]. 计算机科学, 2012, 39(2): 8-13.
doi: 10.3969/j.issn.1002-137X.2012.02.002
|
[21] |
(Liu Hongzhe, Xu De.Review of Semantic Similarity and Correlation Calculation Based on Ontology[J]. Computer Science, 2012, 39(2): 8-13.)
doi: 10.3969/j.issn.1002-137X.2012.02.002
|
[22] |
黄承慧, 印鉴, 侯昉. 一种结合词项语义信息和TF-IDF方法的文本相似度量方法[J]. 计算机学报, 2011, 34(5): 856-864.
doi: 10.3724/SP.J.1016.2011.00856
|
[22] |
(Huang Chenghui, In Jian, Hou Fang.A Text Similarity Measure Based on Semantic Information and TF- IDF Method[J]. Journal of Computers, 2011, 34(5): 856-864.)
doi: 10.3724/SP.J.1016.2011.00856
|
[23] |
刘宏哲. 文本语义相似度计算方法研究[D]. 北京: 北京交通大学, 2012.
|
[23] |
(Liu Hongzhe.Text Semantic Similarity Calculation Method [D]. Beijing: Beijing Jiaotong University, 2012.)
|
[24] |
白如江. 基于语义计算的科学研究前沿识别研究[D]. 北京: 中国科学院大学, 2015.
|
[24] |
(Bai Rujiang.Scientific Research Frontier Recognition Research Based on the Semantic Computing [D]. Beijing: Chinese Academy of Sciences, 2015.)
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|