1Chengdu Documentation and Information Center, Chinese Academy of Sciences, Chengdu 610041, China 2University of Chinese Academy of Sciences, Beijing 100049, China
[Objective] This paper analyzes the popular text similarity measures and discusses their latest developments. [Coverage] We retrieved 69 key articles from CNKI and Web of Science databases by searching “TI: ‘text similarity’ or ‘semantic similarity’ or ‘lexical similarity’ ” in Chinese and English respectively. [Methods] We systematically reviewed the text similarity measures focusing on their basic concepts, characteristics and future directions. [Results] There were four types of text similarity measures: String-based, Corpus-based, Knowledge-based and others. Measures based on the neural network, Knowledge-based measures and inter-disciplinary measures could be the future research directions. [Limitations] We did not discuss the applications of those measures. [Conclusions] This paper is a comprehensive review of text similarity measure research.
陈二静, 姜恩波. 文本相似度计算方法研究综述[J]. 数据分析与知识发现, 2017, 1(6): 1-11.
Chen Erjing,Jiang Enbo. Review of Studies on Text Similarity Measures. Data Analysis and Knowledge Discovery, 2017, 1(6): 1-11.
Gomaa W H, Fahmy A A.A Survey of Text Similarity Approaches[J]. International Journal of Computer Applications, 2013, 68(13): 13-18.
doi: 10.5120/11638-7118
[2]
Pradhan N, Gyanchandani M, Wadhvani R.A Review on Text Similarity Technique Used in IR and Its Application[J]. International Journal of Computer Applications, 2015, 120(9): 29-34.
doi: 10.5120/21257-4109
(Qin Chunxiu, Zhao Pengwei, Liu Huailiang.Research on Word Similarity Measurement[J]. Information Studies: Theory & Application, 2007, 30(1): 105-108.)
(Sun Haixia, Qian Qing, Cheng Ying.Review of Ontology-based Semantic Similarity Measuring[J]. New Technology of Library and Information Service, 2010(1): 51-56. )
(Tian Jiule, Zhao Wei.Words Similarity Algorithm Based on Tongyici Cilin in Semantic Web Adaptive Learning System[J]. Journal of Jilin University: Information Science Edition, 2010, 28(6): 602-608.)
doi: 10.3969/j.issn.1671-5896.2010.06.011
(Zhang Huanjiong, Wang Guosheng, Zhong Yixin.Text Similarity Computing Based on Hamming Distance[J]. Computer Engineering and Applications, 2001, 37(19): 21-22. )
[14]
Dice L R.Measures of the Aount of Ecologic Association Between Species[J]. Ecology, 1944, 26(3): 297-302.
[15]
Harris Z S.Distributional Structure [A]// Papers in Structural and Transformational Linguistics[M]. Springer, Dordrecht, 1970.
[16]
Salton G, Wong A, Yang C S.A Vector Space Model for Automatic Indexing[J]. Communications of the ACM, 1975, 18(11): 613-620.
doi: 10.1145/361219.361220
(Guo Qinglin, Li Yanmei, Tang Qi.Similarity Computing of Documents Based on VSM[J]. Application Research of Computers, 2008, 25(11): 3256-3258. )
doi: 10.3969/j.issn.1001-3695.2008.11.015
(Li Lian, Zhu Aihong, Su Tao.Research and Implementation of An Improved VSM-based Text Similarity Algorithm[J]. Computer Applications and Software, 2012, 29(2): 282-284. )
doi: 10.3969/j.issn.1000-386X.2012.02.082
[19]
Landauer T K, Dumais S T.A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge[J]. Psychological Review, 1997, 104(2): 211-240.
doi: 10.1037//0033-295X.104.2.211
[20]
Hofmann T.Probabilistic Latent Semantic Analysis[C]// Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence.1999.
[21]
Blei D M, Ng A Y, Jordan M I.Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
(Wang Zhenzhen, He Ming, Du Yongping.Text Similarity Computing Based on Topic Model LDA[J]. Computer Science, 2013, 40(12): 229-232. )
doi: 10.3969/j.issn.1002-137X.2013.12.049
(Xiong Daping, Wang Jian, Lin Hongfei.An LDA-based Approach to Finding Similar Questions for Community Question Answer[J]. Journal of Chinese Information Processing, 2012, 26(5): 40-45. )
doi: 10.3969/j.issn.1003-0077.2012.05.007
(Zhang Chao, Chen Li, Li Qiong.Chinese Text Similarity Algorithm Based on PST_LDA[J]. Application Research of Computers, 2016, 33(2): 375-377,383. )
doi: 10.3969/j.issn.1001-3695.2016.02.012
[25]
Hinton G E.Learning Distributed Representations of Concepts[C]//Proceedings of the 8th Annual Conference of the Cognitive Science Society. 1986.
[26]
Bengio Y, Ducharme R, Vincent P, et al.A Neural Probabilistic Language Model[J]. Journal of Machine Learning Research, 2003, 3(6): 1137-1155.
doi: 10.1007/3-540-33486-6_6
[27]
Mikolov T, Sutskever I, Chen K, et al.Distributed Representations of Words and Phrases and Their Compositionality[C]//Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013.
[28]
Pennington J, Socher R, Manning C D.GloVe: Global Vectors for Word Representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 1532-1543.
[29]
Kenter T, Rijke M D.Short Text Similarity with Word Embeddings[C]//Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. 2015: 1411-1420.
[30]
Kusner M J, Sun Y, Kolkin N I, et al.From Word Embeddings to Document Distances[C]//Proceedings of the 32nd International Conference on Machine Learning. 2015.
[31]
Huang G, Guo C, Kusner M J, et al.Supervised Word Mover’s Distance[C]//Proceedings of the 30th Conference on Neural Information Processing Systems. 2016.
[32]
Cilibrasi R L, Vitanyi P M B. The Google Similarity Distance[J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(3): 370-383.
(Liu Shengjiu, Li Tianrui, Jia Zhen, et al.Research and Application of Similarity Based on Search Engine[J]. Computer Science, 2014, 41(4): 211-214. )
doi: 10.3969/j.issn.1002-137X.2014.04.044
[34]
Sahami M, Heilman T D.A Web-based Kernel Function for Measuring the Similarity of Short Text Snippets[C]// Proceedings of the 15th International Conference on World Wide Web. 2006: 377-386.
(Chen Haiyan.Measuring Semantic Similarity Between Words Using Web Search Engines[J]. Computer Science, 2015, 42(1): 261-267.)
[36]
Hliaoutakis A. Semantic Similarity Measures in MeSH Ontology and Their Application to Information Retrieval on Medline [EB/OL]. [2016-12-08]. .
[37]
Batet M, Sanchez D, Valls A.An Ontology-based Measure to Compute Semantic Similarity in Biomedicine[J]. Journal of Biomedical Informatics, 2011, 44(1): 118-125.
doi: 10.1016/j.jbi.2010.09.002
pmid: 20837160
[38]
Rada R, Mili H, Bicknell E, et al.Development and Application of a Metric on Semantic Nets[J]. IEEE Transactions on Systems, Man, and Cybernetics, 1989, 19(1): 17-30.
doi: 10.1109/21.24528
[39]
Wu Z, Palmer M.Verb Semantic and Lexical Selection[C]// Proceedings of the 32nd Annual Meeting of the Associations for Computational Linguistics. 1994:133-138.
[40]
Richardson R, Smeaton A F, Murphy J. Using WordNet as a Knowledge Base for Measuring Semantic Similarity Between Words [EB/OL]. [2016-12-08]. .
[41]
Li Y, Bandar Z A, McLean D. An Approach for Measuring Semantic Similarity Between Words Using Multiple Information Sources[J]. IEEE Transactions on Knowledge and Data Engineering, 2003, 15(4): 871-882.
doi: 10.1109/TKDE.2003.1209005
[42]
Lin D.Principle-based Parsing without Overgeneration[C]// Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics. 1993.
[43]
Resnik P.Semantic Similarity in a Taxonomy: An Information-Based Measure and Its Application to Problems of Ambiguity in Natural Language[J]. Journal of Artificial Intelligence Research, 1999, 11:95-130.
doi: 10.1613/jair.514
[44]
Lord P W, Stevens R D, Brass A, et al.Investigating Semantic Similarity Measures across the Gene Ontology: The Relationship Between Sequence and Annotation[J]. Bioinformatics, 2003, 19(10): 1275-1283.
doi: 10.1093/bioinformatics/btg153
pmid: 12835272
(Bian Zhenxing.Research on Model of IC Parameter for Semantic Similarity of Concept in WordNet[J]. Computer Engineering and Applications, 2011, 47(19): 128-131. )
doi: 10.3778/j.issn.1002-8331.2011.19.035
[46]
Tversky A.Features of Similarity[J]. Psychological Review, 1977, 84(4): 327-352.
(Ge Bin, Li Fangfang, Guo Silu, et al.Word’s Semantic Similarity Computation Method Based on Hownet[J]. Application Research of Computers, 2010, 27(9): 3329-3333. )
doi: 10.3969/j.issn.1001-3695.2010.09.034
(Wang Yanna, Zhou Zili, He Yan.Concept Semantic Similarity Algorithm in WordNet Based on Information Content[J]. Computer Engineering, 2011, 37(22): 42-44. )
doi: 10.3969/j.issn.1000-3428.2011.22.011
(Sun Chenchen, Shen Derong, Shan Jing, et al.WSR: A Semantic Relatedness Measure Based on Wikipedia Structure[J]. Chinese Journal of Computers, 2012, 35(11): 2361-2370. )
doi: 10.3724/SP.J.1016.2012.02361
[51]
Strube M, Ponzetto S P.WikiRelate! Computing Semantic Relatedness Using Wikipedia[C]//Proceedings of the 21st National Conference on Artificial Intelligence. 2006.
[52]
Gabrilovich E, Markovitch S.Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis[C]// Proceedings of the 20th International Joint Conference on Artifical Intelligence.2007.
[53]
Milne D, Witten I H. An Effective, Low-cost Measure of Semantic Relatedness Obtained from Wikipedia Links[C]// Proceedings of the 23rd Association for the Advancement of Artificial Intelligence. 2008.
(Peng Lizhen, Wu Yangyang.Semantic Similarity Computing Based on Community Mining of Wikipedia[J]. Computer Science, 2016, 43(4): 45-49. )
doi: 10.11896/j.issn.1002-137X.2016.4.009
[56]
Lizorkin D, Medelyan O, Grineva M.Analysis of Community Structure in Wikipedia[C]//Proceedings of the 18th International Conference on World Wide Web. 2009: 1221-1222.
(Yin Kun, Yin Hongfeng, Yang Yan, et al.Semantic Similarity Computation of Baidu Encyclopedia Entries Based on SimRank[J]. Journal of Shandong University:Engineering Science, 2014, 44(3): 29-35. )
doi: 10.6040/j.issn.1672-3961.2.2013.282
(Sui Zhifang, Yu Shiwen.The Skeletal-Dependency-Tree-Based Computational Model for the Sentence Similarity[C]// Proceedings of the International Conference on Chinese Computing.1998. )
(Li Bin, Liu Ting, Qin Bing, et al.Chinese Sentence Similarity Computing Based on Semantic Dependency Relationship Analysis[J]. Application Research of Computers, 2003, 20(12): 15-17. )
doi: 10.3969/j.issn.1001-3695.2003.12.005
(Li Ru, Wang Zhiqiang, Li Shuanghong, et al.Chinese Sentence Similarity Computing Based on Frame Semantic Parsing[J]. Journal of Computer Research and Development, 2013, 50(8): 1728-1736.)
[62]
Blanco E, Moldovan D.A Semantic Logic-Based Approach to Determine Textual Similarity[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015, 23(4): 683-693.
doi: 10.1109/TASLP.2015.2403613
[63]
Jiang J J, Conrath D W.Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy[C]// Proceedings of the International Conference on Research in Computational Linguistics. 1997.
[64]
Islam A, Inkpen D.Semantic Text Similarity Using Corpus-based Word Similarity and String Similarity[J]. ACM Transactions on Knowledge Discovery from Data, 2008, 2(2): 1-25.
doi: 10.1145/1376815.1376819
[65]
Tasi C S, Huang Y M, Liu C H, et al.Applying VSM and LCS to Develop an Integrated Text Retrieval Mechanism[J]. Expert Systems with Applications, 2012, 39(4): 3974-3982.
doi: 10.1016/j.eswa.2011.09.039
(Wei Wei, Xiang Yang, Chen Qian.Combined Measurement Approach for Semantic Similarity of Terms[J]. Journal of Computer Applications, 2010, 30(6): 1668-1670. )
[67]
Liu G, Wang R, Buckley J, et al.A WordNet-based Semantic Similarity Measure Enhanced by Internet-based Knowledge[C]//Proceedings of the International Conference on Software Engineering & Knowledge Engineering.2011.
(Wang Xiaolin, Xiao Hui, Tai Weipeng.Research on Text Similarity Detection System Based on Hadoop[J]. Computer Technology and Development, 2015, 25(8): 90-93.)
[69]
Atoum I, Otoom A.Efficient Hybrid Semantic Text Similarity Using Wordnet and a Corpus[J]. International Journal of Advanced Computer Science and Applications, 2016, 7(9): 124-130.
doi: 10.14569/IJACSA.2016.070917