|
|
Review of Studies on Text Similarity Measures |
Chen Erjing1,2(), Jiang Enbo1 |
1Chengdu Documentation and Information Center, Chinese Academy of Sciences, Chengdu 610041, China 2University of Chinese Academy of Sciences, Beijing 100049, China |
|
|
Abstract [Objective] This paper analyzes the popular text similarity measures and discusses their latest developments. [Coverage] We retrieved 69 key articles from CNKI and Web of Science databases by searching “TI: ‘text similarity’ or ‘semantic similarity’ or ‘lexical similarity’ ” in Chinese and English respectively. [Methods] We systematically reviewed the text similarity measures focusing on their basic concepts, characteristics and future directions. [Results] There were four types of text similarity measures: String-based, Corpus-based, Knowledge-based and others. Measures based on the neural network, Knowledge-based measures and inter-disciplinary measures could be the future research directions. [Limitations] We did not discuss the applications of those measures. [Conclusions] This paper is a comprehensive review of text similarity measure research.
|
Received: 09 May 2017
Published: 25 August 2017
|
|
[1] |
Gomaa W H, Fahmy A A.A Survey of Text Similarity Approaches[J]. International Journal of Computer Applications, 2013, 68(13): 13-18.
doi: 10.5120/11638-7118
|
[2] |
Pradhan N, Gyanchandani M, Wadhvani R.A Review on Text Similarity Technique Used in IR and Its Application[J]. International Journal of Computer Applications, 2015, 120(9): 29-34.
doi: 10.5120/21257-4109
|
[3] |
秦春秀, 赵捧未, 刘怀亮. 词语相似度计算研究[J]. 情报理论与实践, 2007, 30(1): 105-108.
|
[3] |
(Qin Chunxiu, Zhao Pengwei, Liu Huailiang.Research on Word Similarity Measurement[J]. Information Studies: Theory & Application, 2007, 30(1): 105-108.)
|
[4] |
刘萍, 陈烨. 词汇相似度研究进展综述[J]. 现代图书情报技术, 2012(7-8): 82-89.
|
[4] |
(Liu Ping, Chen Ye.Survey of the State of the Art in Word Similarity[J]. New Technology of Library and Information Service, 2012(7-8): 82-89. )
|
[5] |
李慧. 词语相似度算法研究综述[J]. 现代情报, 2015, 35(4): 172-177.
|
[5] |
(Li Hui.A Review on the Research of Word Similarity Algorithms[J]. Journal of Modern Information, 2015, 35(4): 172-177. )
|
[6] |
韩普, 王东波, 王子敏. 词汇相似度计算和相似词挖掘研究进展[J]. 情报科学, 2016, 34(9): 161-165.
|
[6] |
(Han Pu, Wang Dongbo, Wang Zimin.Research Advancement in Word Similarity Calculation and Mining[J]. Information Science, 2016, 34(9): 161-165. )
|
[7] |
孙海霞, 钱庆, 成颖. 基于本体的语义相似度计算方法研究综述[J]. 现代图书情报技术, 2010(1): 51-56.
|
[7] |
(Sun Haixia, Qian Qing, Cheng Ying.Review of Ontology-based Semantic Similarity Measuring[J]. New Technology of Library and Information Service, 2010(1): 51-56. )
|
[8] |
刘宏哲, 须德. 基于本体的语义相似度和相关度计算研究综述[J]. 计算机科学, 2012, 39(2): 8-13.
doi: 10.3969/j.issn.1002-137X.2012.02.002
|
[8] |
(Liu Hongzhe, Xu De.Ontology Based Semantic Similarity and Relatedness Measures Review[J]. Computer Science, 2012, 39(2): 8-13. )
doi: 10.3969/j.issn.1002-137X.2012.02.002
|
[9] |
Lin D.An Information-theoretic Definition of Similarity[C]// Proceedings of the 15th International Conference on Machine Learning.1998.
|
[10] |
刘群, 李素建. 基于《知网》的词汇语义相似度计算[J]. 中文计算语言学, 2002, 7(2): 59-76.
|
[10] |
(Liu Qun, Li Sujian.Word Similarity Computing Based on How-Net[J]. Chinese Computational Linguisties, 2002, 7(2): 59-76. )
|
[11] |
董振东, 董强. 知网[EB/OL]. [2016-12-08]. .
|
[11] |
(Dong Zhendong, Dong Qiang. owNet [EB/OL]. [2016-12-08]. .
|
[12] |
田久乐, 赵蔚. 基于同义词词林的词语相似度计算方法[J]. 吉林大学学报: 信息科学版, 2010, 28(6): 602-608.
doi: 10.3969/j.issn.1671-5896.2010.06.011
|
[12] |
(Tian Jiule, Zhao Wei.Words Similarity Algorithm Based on Tongyici Cilin in Semantic Web Adaptive Learning System[J]. Journal of Jilin University: Information Science Edition, 2010, 28(6): 602-608.)
doi: 10.3969/j.issn.1671-5896.2010.06.011
|
[13] |
张焕炯, 王国胜, 钟义信. 基于汉明距离的文本相似度计算[J]. 计算机工程与应用, 2001, 37(19): 21-22.
|
[13] |
(Zhang Huanjiong, Wang Guosheng, Zhong Yixin.Text Similarity Computing Based on Hamming Distance[J]. Computer Engineering and Applications, 2001, 37(19): 21-22. )
|
[14] |
Dice L R.Measures of the Aount of Ecologic Association Between Species[J]. Ecology, 1944, 26(3): 297-302.
|
[15] |
Harris Z S.Distributional Structure [A]// Papers in Structural and Transformational Linguistics[M]. Springer, Dordrecht, 1970.
|
[16] |
Salton G, Wong A, Yang C S.A Vector Space Model for Automatic Indexing[J]. Communications of the ACM, 1975, 18(11): 613-620.
doi: 10.1145/361219.361220
|
[17] |
郭庆琳, 李艳梅, 唐琦. 基于VSM的文本相似度计算的研究[J]. 计算机应用研究, 2008,25(11): 3256-3258.
doi: 10.3969/j.issn.1001-3695.2008.11.015
|
[17] |
(Guo Qinglin, Li Yanmei, Tang Qi.Similarity Computing of Documents Based on VSM[J]. Application Research of Computers, 2008, 25(11): 3256-3258. )
doi: 10.3969/j.issn.1001-3695.2008.11.015
|
[18] |
李连, 朱爱红, 苏涛. 一种改进的基于向量空间文本相似度算法的研究与实现[J]. 计算机应用与软件, 2012, 29(2): 282-284.
doi: 10.3969/j.issn.1000-386X.2012.02.082
|
[18] |
(Li Lian, Zhu Aihong, Su Tao.Research and Implementation of An Improved VSM-based Text Similarity Algorithm[J]. Computer Applications and Software, 2012, 29(2): 282-284. )
doi: 10.3969/j.issn.1000-386X.2012.02.082
|
[19] |
Landauer T K, Dumais S T.A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge[J]. Psychological Review, 1997, 104(2): 211-240.
doi: 10.1037//0033-295X.104.2.211
|
[20] |
Hofmann T.Probabilistic Latent Semantic Analysis[C]// Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence.1999.
|
[21] |
Blei D M, Ng A Y, Jordan M I.Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
|
[22] |
王振振, 何明, 杜永萍. 基于LDA主题模型的文本相似度计算[J]. 计算机科学, 2013, 40(12): 229-232.
doi: 10.3969/j.issn.1002-137X.2013.12.049
|
[22] |
(Wang Zhenzhen, He Ming, Du Yongping.Text Similarity Computing Based on Topic Model LDA[J]. Computer Science, 2013, 40(12): 229-232. )
doi: 10.3969/j.issn.1002-137X.2013.12.049
|
[23] |
熊大平, 王健, 林鸿飞. 一种基于LDA的社区问答问句相似度计算方法[J]. 中文信息学报, 2012, 26(5): 40-45.
doi: 10.3969/j.issn.1003-0077.2012.05.007
|
[23] |
(Xiong Daping, Wang Jian, Lin Hongfei.An LDA-based Approach to Finding Similar Questions for Community Question Answer[J]. Journal of Chinese Information Processing, 2012, 26(5): 40-45. )
doi: 10.3969/j.issn.1003-0077.2012.05.007
|
[24] |
张超, 陈利, 李琼. 一种PST_LDA中文文本相似度计算方法[J]. 计算机应用研究, 2016, 33(2): 375-377,383.
doi: 10.3969/j.issn.1001-3695.2016.02.012
|
[24] |
(Zhang Chao, Chen Li, Li Qiong.Chinese Text Similarity Algorithm Based on PST_LDA[J]. Application Research of Computers, 2016, 33(2): 375-377,383. )
doi: 10.3969/j.issn.1001-3695.2016.02.012
|
[25] |
Hinton G E.Learning Distributed Representations of Concepts[C]//Proceedings of the 8th Annual Conference of the Cognitive Science Society. 1986.
|
[26] |
Bengio Y, Ducharme R, Vincent P, et al.A Neural Probabilistic Language Model[J]. Journal of Machine Learning Research, 2003, 3(6): 1137-1155.
doi: 10.1007/3-540-33486-6_6
|
[27] |
Mikolov T, Sutskever I, Chen K, et al.Distributed Representations of Words and Phrases and Their Compositionality[C]//Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013.
|
[28] |
Pennington J, Socher R, Manning C D.GloVe: Global Vectors for Word Representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 1532-1543.
|
[29] |
Kenter T, Rijke M D.Short Text Similarity with Word Embeddings[C]//Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. 2015: 1411-1420.
|
[30] |
Kusner M J, Sun Y, Kolkin N I, et al.From Word Embeddings to Document Distances[C]//Proceedings of the 32nd International Conference on Machine Learning. 2015.
|
[31] |
Huang G, Guo C, Kusner M J, et al.Supervised Word Mover’s Distance[C]//Proceedings of the 30th Conference on Neural Information Processing Systems. 2016.
|
[32] |
Cilibrasi R L, Vitanyi P M B. The Google Similarity Distance[J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(3): 370-383.
|
[33] |
刘胜久, 李天瑞, 贾真, 等. 基于搜索引擎的相似度研究与应用[J]. 计算机科学, 2014, 41(4): 211-214.
doi: 10.3969/j.issn.1002-137X.2014.04.044
|
[33] |
(Liu Shengjiu, Li Tianrui, Jia Zhen, et al.Research and Application of Similarity Based on Search Engine[J]. Computer Science, 2014, 41(4): 211-214. )
doi: 10.3969/j.issn.1002-137X.2014.04.044
|
[34] |
Sahami M, Heilman T D.A Web-based Kernel Function for Measuring the Similarity of Short Text Snippets[C]// Proceedings of the 15th International Conference on World Wide Web. 2006: 377-386.
|
[35] |
陈海燕. 基于搜索引擎的词汇语义相似度计算方法[J]. 计算机科学, 2015, 42(1): 261-267.
|
[35] |
(Chen Haiyan.Measuring Semantic Similarity Between Words Using Web Search Engines[J]. Computer Science, 2015, 42(1): 261-267.)
|
[36] |
Hliaoutakis A. Semantic Similarity Measures in MeSH Ontology and Their Application to Information Retrieval on Medline [EB/OL]. [2016-12-08]. .
|
[37] |
Batet M, Sanchez D, Valls A.An Ontology-based Measure to Compute Semantic Similarity in Biomedicine[J]. Journal of Biomedical Informatics, 2011, 44(1): 118-125.
doi: 10.1016/j.jbi.2010.09.002
pmid: 20837160
|
[38] |
Rada R, Mili H, Bicknell E, et al.Development and Application of a Metric on Semantic Nets[J]. IEEE Transactions on Systems, Man, and Cybernetics, 1989, 19(1): 17-30.
doi: 10.1109/21.24528
|
[39] |
Wu Z, Palmer M.Verb Semantic and Lexical Selection[C]// Proceedings of the 32nd Annual Meeting of the Associations for Computational Linguistics. 1994:133-138.
|
[40] |
Richardson R, Smeaton A F, Murphy J. Using WordNet as a Knowledge Base for Measuring Semantic Similarity Between Words [EB/OL]. [2016-12-08]. .
|
[41] |
Li Y, Bandar Z A, McLean D. An Approach for Measuring Semantic Similarity Between Words Using Multiple Information Sources[J]. IEEE Transactions on Knowledge and Data Engineering, 2003, 15(4): 871-882.
doi: 10.1109/TKDE.2003.1209005
|
[42] |
Lin D.Principle-based Parsing without Overgeneration[C]// Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics. 1993.
|
[43] |
Resnik P.Semantic Similarity in a Taxonomy: An Information-Based Measure and Its Application to Problems of Ambiguity in Natural Language[J]. Journal of Artificial Intelligence Research, 1999, 11:95-130.
doi: 10.1613/jair.514
|
[44] |
Lord P W, Stevens R D, Brass A, et al.Investigating Semantic Similarity Measures across the Gene Ontology: The Relationship Between Sequence and Annotation[J]. Bioinformatics, 2003, 19(10): 1275-1283.
doi: 10.1093/bioinformatics/btg153
pmid: 12835272
|
[45] |
边振兴. WordNet中概念语义相似度IC参数模型研究[J]. 计算机工程与应用, 2011, 47(19): 128-131.
doi: 10.3778/j.issn.1002-8331.2011.19.035
|
[45] |
(Bian Zhenxing.Research on Model of IC Parameter for Semantic Similarity of Concept in WordNet[J]. Computer Engineering and Applications, 2011, 47(19): 128-131. )
doi: 10.3778/j.issn.1002-8331.2011.19.035
|
[46] |
Tversky A.Features of Similarity[J]. Psychological Review, 1977, 84(4): 327-352.
|
[47] |
葛斌, 李芳芳, 郭丝路, 等. 基于知网的词汇语义相似度计算方法研究[J]. 计算机应用研究, 2010, 27(9): 3329-3333.
doi: 10.3969/j.issn.1001-3695.2010.09.034
|
[47] |
(Ge Bin, Li Fangfang, Guo Silu, et al.Word’s Semantic Similarity Computation Method Based on Hownet[J]. Application Research of Computers, 2010, 27(9): 3329-3333. )
doi: 10.3969/j.issn.1001-3695.2010.09.034
|
[48] |
王艳娜, 周子力, 何艳. WordNet中基于IC的概念语义相似度算法[J]. 计算机工程, 2011, 37(22): 42-44.
doi: 10.3969/j.issn.1000-3428.2011.22.011
|
[48] |
(Wang Yanna, Zhou Zili, He Yan.Concept Semantic Similarity Algorithm in WordNet Based on Information Content[J]. Computer Engineering, 2011, 37(22): 42-44. )
doi: 10.3969/j.issn.1000-3428.2011.22.011
|
[49] |
李文清, 孙新, 张常有, 等. 一种本体概念的语义相似度计算方法[J]. 自动化学报, 2012, 38(2): 229-235.
doi: 10.3724/SP.J.1004.2012.00229
|
[49] |
(Li Wenqing, Sun Xin, Zhang Changyou, et al.A Semantic Similarity Measure Between Ontological Concepts[J]. Acta Automatica Sinica, 2012, 38(2): 229-235. )
doi: 10.3724/SP.J.1004.2012.00229
|
[50] |
孙琛琛, 申德荣, 单菁, 等. WSR:一种基于维基百科结构信息的语义关联度计算算法[J]. 计算机学报, 2012, 35(11): 2361-2370.
doi: 10.3724/SP.J.1016.2012.02361
|
[50] |
(Sun Chenchen, Shen Derong, Shan Jing, et al.WSR: A Semantic Relatedness Measure Based on Wikipedia Structure[J]. Chinese Journal of Computers, 2012, 35(11): 2361-2370. )
doi: 10.3724/SP.J.1016.2012.02361
|
[51] |
Strube M, Ponzetto S P.WikiRelate! Computing Semantic Relatedness Using Wikipedia[C]//Proceedings of the 21st National Conference on Artificial Intelligence. 2006.
|
[52] |
Gabrilovich E, Markovitch S.Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis[C]// Proceedings of the 20th International Joint Conference on Artifical Intelligence.2007.
|
[53] |
Milne D, Witten I H. An Effective, Low-cost Measure of Semantic Relatedness Obtained from Wikipedia Links[C]// Proceedings of the 23rd Association for the Advancement of Artificial Intelligence. 2008.
|
[54] |
盛志超, 陶晓鹏. 基于维基百科的语义相似度计算方法[J]. 计算机工程, 2011, 37(7): 193-195.
doi: 10.3969/j.issn.1000-3428.2011.07.065
|
[54] |
(Sheng Zhichao, Tao Xiaopeng.Semantic Similarity Computing Method Based on Wikipedia[J]. Computer Engineering, 2011, 37(7): 193-195. )
doi: 10.3969/j.issn.1000-3428.2011.07.065
|
[55] |
彭丽针, 吴扬扬. 基于维基百科社区挖掘的词语语义相似度计算[J]. 计算机科学, 2016, 43(4): 45-49.
doi: 10.11896/j.issn.1002-137X.2016.4.009
|
[55] |
(Peng Lizhen, Wu Yangyang.Semantic Similarity Computing Based on Community Mining of Wikipedia[J]. Computer Science, 2016, 43(4): 45-49. )
doi: 10.11896/j.issn.1002-137X.2016.4.009
|
[56] |
Lizorkin D, Medelyan O, Grineva M.Analysis of Community Structure in Wikipedia[C]//Proceedings of the 18th International Conference on World Wide Web. 2009: 1221-1222.
|
[57] |
詹志建, 梁丽娜, 杨小平. 基于百度百科的词语相似度计算[J]. 计算机科学, 2013, 40(6): 199-202.
doi: 10.3969/j.issn.1002-137X.2013.06.043
|
[57] |
(Zhan Zhijian, Liang Li’na, Yang Xiaoping.Word Similarity Measurement Based on BaiduBaike[J]. Computer Science, 2013, 40(6): 199-202. )
doi: 10.3969/j.issn.1002-137X.2013.06.043
|
[58] |
尹坤, 尹红风, 杨燕, 等. 基于SimRank的百度百科词条语义相似度计算[J]. 山东大学学报:工学版, 2014, 44(3): 29-35.
doi: 10.6040/j.issn.1672-3961.2.2013.282
|
[58] |
(Yin Kun, Yin Hongfeng, Yang Yan, et al.Semantic Similarity Computation of Baidu Encyclopedia Entries Based on SimRank[J]. Journal of Shandong University:Engineering Science, 2014, 44(3): 29-35. )
doi: 10.6040/j.issn.1672-3961.2.2013.282
|
[59] |
穗志方, 俞士汶. 基于骨架依存树的语句相似度计算模型[C]//1998中文信息处理国际会议论文集. 1998.
|
[59] |
(Sui Zhifang, Yu Shiwen.The Skeletal-Dependency-Tree-Based Computational Model for the Sentence Similarity[C]// Proceedings of the International Conference on Chinese Computing.1998. )
|
[60] |
李彬, 刘挺, 秦兵, 等. 基于语义依存的汉语句子相似度计算[J]. 计算机应用研究, 2003, 20(12): 15-17.
doi: 10.3969/j.issn.1001-3695.2003.12.005
|
[60] |
(Li Bin, Liu Ting, Qin Bing, et al.Chinese Sentence Similarity Computing Based on Semantic Dependency Relationship Analysis[J]. Application Research of Computers, 2003, 20(12): 15-17. )
doi: 10.3969/j.issn.1001-3695.2003.12.005
|
[61] |
李茹, 王智强, 李双红, 等. 基于框架语义分析的汉语句子相似度计算[J]. 计算机研究与发展, 2013, 50(8): 1728-1736.
|
[61] |
(Li Ru, Wang Zhiqiang, Li Shuanghong, et al.Chinese Sentence Similarity Computing Based on Frame Semantic Parsing[J]. Journal of Computer Research and Development, 2013, 50(8): 1728-1736.)
|
[62] |
Blanco E, Moldovan D.A Semantic Logic-Based Approach to Determine Textual Similarity[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015, 23(4): 683-693.
doi: 10.1109/TASLP.2015.2403613
|
[63] |
Jiang J J, Conrath D W.Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy[C]// Proceedings of the International Conference on Research in Computational Linguistics. 1997.
|
[64] |
Islam A, Inkpen D.Semantic Text Similarity Using Corpus-based Word Similarity and String Similarity[J]. ACM Transactions on Knowledge Discovery from Data, 2008, 2(2): 1-25.
doi: 10.1145/1376815.1376819
|
[65] |
Tasi C S, Huang Y M, Liu C H, et al.Applying VSM and LCS to Develop an Integrated Text Retrieval Mechanism[J]. Expert Systems with Applications, 2012, 39(4): 3974-3982.
doi: 10.1016/j.eswa.2011.09.039
|
[66] |
魏韡, 向阳, 陈千. 计算术语间语义相似度的混合方法[J]. 计算机应用, 2010, 30(6): 1668-1670.
|
[66] |
(Wei Wei, Xiang Yang, Chen Qian.Combined Measurement Approach for Semantic Similarity of Terms[J]. Journal of Computer Applications, 2010, 30(6): 1668-1670. )
|
[67] |
Liu G, Wang R, Buckley J, et al.A WordNet-based Semantic Similarity Measure Enhanced by Internet-based Knowledge[C]//Proceedings of the International Conference on Software Engineering & Knowledge Engineering.2011.
|
[68] |
王小林, 肖慧, 邰伟鹏. 基于Hadoop平台的文本相似度检测系统的研究[J]. 计算机技术与发展, 2015, 25(8): 90-93.
|
[68] |
(Wang Xiaolin, Xiao Hui, Tai Weipeng.Research on Text Similarity Detection System Based on Hadoop[J]. Computer Technology and Development, 2015, 25(8): 90-93.)
|
[69] |
Atoum I, Otoom A.Efficient Hybrid Semantic Text Similarity Using Wordnet and a Corpus[J]. International Journal of Advanced Computer Science and Applications, 2016, 7(9): 124-130.
doi: 10.14569/IJACSA.2016.070917
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|