Please wait a minute...
Data Analysis and Knowledge Discovery  2017, Vol. 1 Issue (6): 1-11    DOI: 10.11925/infotech.2096-3467.2017.06.01
Orginal Article Current Issue | Archive | Adv Search |
Review of Studies on Text Similarity Measures
Erjing Chen1,2(),Enbo Jiang1
1Chengdu Documentation and Information Center, Chinese Academy of Sciences, Chengdu 610041, China
2University of Chinese Academy of Sciences, Beijing 100049, China
Download: PDF(756 KB)   HTML ( 3
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper analyzes the popular text similarity measures and discusses their latest developments. [Coverage] We retrieved 69 key articles from CNKI and Web of Science databases by searching “TI: ‘text similarity’ or ‘semantic similarity’ or ‘lexical similarity’ ” in Chinese and English respectively. [Methods] We systematically reviewed the text similarity measures focusing on their basic concepts, characteristics and future directions. [Results] There were four types of text similarity measures: String-based, Corpus-based, Knowledge-based and others. Measures based on the neural network, Knowledge-based measures and inter-disciplinary measures could be the future research directions. [Limitations] We did not discuss the applications of those measures. [Conclusions] This paper is a comprehensive review of text similarity measure research.

Key wordsText Similarity      Semantic Similarity      Ontology      Bag of Words Model      Neural Network     
Received: 09 May 2017      Published: 25 August 2017

Cite this article:

Erjing Chen,Enbo Jiang. Review of Studies on Text Similarity Measures. Data Analysis and Knowledge Discovery, 2017, 1(6): 1-11.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2017.06.01     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2017/V1/I6/1

[1] Gomaa W H, Fahmy A A.A Survey of Text Similarity Approaches[J]. International Journal of Computer Applications, 2013, 68(13): 13-18.
[2] Pradhan N, Gyanchandani M, Wadhvani R.A Review on Text Similarity Technique Used in IR and Its Application[J]. International Journal of Computer Applications, 2015, 120(9): 29-34.
[3] 秦春秀, 赵捧未, 刘怀亮. 词语相似度计算研究[J]. 情报理论与实践, 2007, 30(1): 105-108.
[3] (Qin Chunxiu, Zhao Pengwei, Liu Huailiang.Research on Word Similarity Measurement[J]. Information Studies: Theory & Application, 2007, 30(1): 105-108.)
[4] 刘萍, 陈烨. 词汇相似度研究进展综述[J]. 现代图书情报技术, 2012(7-8): 82-89.
[4] (Liu Ping, Chen Ye.Survey of the State of the Art in Word Similarity[J]. New Technology of Library and Information Service, 2012(7-8): 82-89. )
[5] 李慧. 词语相似度算法研究综述[J]. 现代情报, 2015, 35(4): 172-177.
[5] (Li Hui.A Review on the Research of Word Similarity Algorithms[J]. Journal of Modern Information, 2015, 35(4): 172-177. )
[6] 韩普, 王东波, 王子敏. 词汇相似度计算和相似词挖掘研究进展[J]. 情报科学, 2016, 34(9): 161-165.
[6] (Han Pu, Wang Dongbo, Wang Zimin.Research Advancement in Word Similarity Calculation and Mining[J]. Information Science, 2016, 34(9): 161-165. )
[7] 孙海霞, 钱庆, 成颖. 基于本体的语义相似度计算方法研究综述[J]. 现代图书情报技术, 2010(1): 51-56.
[7] (Sun Haixia, Qian Qing, Cheng Ying.Review of Ontology-based Semantic Similarity Measuring[J]. New Technology of Library and Information Service, 2010(1): 51-56. )
[8] 刘宏哲, 须德. 基于本体的语义相似度和相关度计算研究综述[J]. 计算机科学, 2012, 39(2): 8-13.
[8] (Liu Hongzhe, Xu De.Ontology Based Semantic Similarity and Relatedness Measures Review[J]. Computer Science, 2012, 39(2): 8-13. )
[9] Lin D.An Information-theoretic Definition of Similarity[C]// Proceedings of the 15th International Conference on Machine Learning.1998.
[10] 刘群, 李素建. 基于《知网》的词汇语义相似度计算[J]. 中文计算语言学, 2002, 7(2): 59-76.
[10] (Liu Qun, Li Sujian.Word Similarity Computing Based on How-Net[J]. Chinese Computational Linguisties, 2002, 7(2): 59-76. )
[11] 董振东, 董强. 知网[EB/OL]. [2016-12-08]. .
[11] (Dong Zhendong, Dong Qiang. owNet [EB/OL]. [2016-12-08]. .
[12] 田久乐, 赵蔚. 基于同义词词林的词语相似度计算方法[J]. 吉林大学学报: 信息科学版, 2010, 28(6): 602-608.
[12] (Tian Jiule, Zhao Wei.Words Similarity Algorithm Based on Tongyici Cilin in Semantic Web Adaptive Learning System[J]. Journal of Jilin University: Information Science Edition, 2010, 28(6): 602-608.)
[13] 张焕炯, 王国胜, 钟义信. 基于汉明距离的文本相似度计算[J]. 计算机工程与应用, 2001, 37(19): 21-22.
[13] (Zhang Huanjiong, Wang Guosheng, Zhong Yixin.Text Similarity Computing Based on Hamming Distance[J]. Computer Engineering and Applications, 2001, 37(19): 21-22. )
[14] Dice L R.Measures of the Aount of Ecologic Association Between Species[J]. Ecology, 1944, 26(3): 297-302.
[15] Harris Z S.Distributional Structure [A]// Papers in Structural and Transformational Linguistics[M]. Springer, Dordrecht, 1970.
[16] Salton G, Wong A, Yang C S.A Vector Space Model for Automatic Indexing[J]. Communications of the ACM, 1975, 18(11): 613-620.
[17] 郭庆琳, 李艳梅, 唐琦. 基于VSM的文本相似度计算的研究[J]. 计算机应用研究, 2008,25(11): 3256-3258.
[17] (Guo Qinglin, Li Yanmei, Tang Qi.Similarity Computing of Documents Based on VSM[J]. Application Research of Computers, 2008, 25(11): 3256-3258. )
[18] 李连, 朱爱红, 苏涛. 一种改进的基于向量空间文本相似度算法的研究与实现[J]. 计算机应用与软件, 2012, 29(2): 282-284.
[18] (Li Lian, Zhu Aihong, Su Tao.Research and Implementation of An Improved VSM-based Text Similarity Algorithm[J]. Computer Applications and Software, 2012, 29(2): 282-284. )
[19] Landauer T K, Dumais S T.A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge[J]. Psychological Review, 1997, 104(2): 211-240.
[20] Hofmann T.Probabilistic Latent Semantic Analysis[C]// Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence.1999.
[21] Blei D M, Ng A Y, Jordan M I.Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
[22] 王振振, 何明, 杜永萍. 基于LDA主题模型的文本相似度计算[J]. 计算机科学, 2013, 40(12): 229-232.
[22] (Wang Zhenzhen, He Ming, Du Yongping.Text Similarity Computing Based on Topic Model LDA[J]. Computer Science, 2013, 40(12): 229-232. )
[23] 熊大平, 王健, 林鸿飞. 一种基于LDA的社区问答问句相似度计算方法[J]. 中文信息学报, 2012, 26(5): 40-45.
[23] (Xiong Daping, Wang Jian, Lin Hongfei.An LDA-based Approach to Finding Similar Questions for Community Question Answer[J]. Journal of Chinese Information Processing, 2012, 26(5): 40-45. )
[24] 张超, 陈利, 李琼. 一种PST_LDA中文文本相似度计算方法[J]. 计算机应用研究, 2016, 33(2): 375-377,383.
[24] (Zhang Chao, Chen Li, Li Qiong.Chinese Text Similarity Algorithm Based on PST_LDA[J]. Application Research of Computers, 2016, 33(2): 375-377,383. )
[25] Hinton G E.Learning Distributed Representations of Concepts[C]//Proceedings of the 8th Annual Conference of the Cognitive Science Society. 1986.
[26] Bengio Y, Ducharme R, Vincent P, et al.A Neural Probabilistic Language Model[J]. Journal of Machine Learning Research, 2003, 3(6): 1137-1155.
[27] Mikolov T, Sutskever I, Chen K, et al.Distributed Representations of Words and Phrases and Their Compositionality[C]//Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013.
[28] Pennington J, Socher R, Manning C D.GloVe: Global Vectors for Word Representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 1532-1543.
[29] Kenter T, Rijke M D.Short Text Similarity with Word Embeddings[C]//Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. 2015: 1411-1420.
[30] Kusner M J, Sun Y, Kolkin N I, et al.From Word Embeddings to Document Distances[C]//Proceedings of the 32nd International Conference on Machine Learning. 2015.
[31] Huang G, Guo C, Kusner M J, et al.Supervised Word Mover’s Distance[C]//Proceedings of the 30th Conference on Neural Information Processing Systems. 2016.
[32] Cilibrasi R L, Vitanyi P M B. The Google Similarity Distance[J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(3): 370-383.
[33] 刘胜久, 李天瑞, 贾真, 等. 基于搜索引擎的相似度研究与应用[J]. 计算机科学, 2014, 41(4): 211-214.
[33] (Liu Shengjiu, Li Tianrui, Jia Zhen, et al.Research and Application of Similarity Based on Search Engine[J]. Computer Science, 2014, 41(4): 211-214. )
[34] Sahami M, Heilman T D.A Web-based Kernel Function for Measuring the Similarity of Short Text Snippets[C]// Proceedings of the 15th International Conference on World Wide Web. 2006: 377-386.
[35] 陈海燕. 基于搜索引擎的词汇语义相似度计算方法[J]. 计算机科学, 2015, 42(1): 261-267.
[35] (Chen Haiyan.Measuring Semantic Similarity Between Words Using Web Search Engines[J]. Computer Science, 2015, 42(1): 261-267.)
[36] Hliaoutakis A. Semantic Similarity Measures in MeSH Ontology and Their Application to Information Retrieval on Medline [EB/OL]. [2016-12-08]. .
[37] Batet M, Sanchez D, Valls A.An Ontology-based Measure to Compute Semantic Similarity in Biomedicine[J]. Journal of Biomedical Informatics, 2011, 44(1): 118-125.
[38] Rada R, Mili H, Bicknell E, et al.Development and Application of a Metric on Semantic Nets[J]. IEEE Transactions on Systems, Man, and Cybernetics, 1989, 19(1): 17-30.
[39] Wu Z, Palmer M.Verb Semantic and Lexical Selection[C]// Proceedings of the 32nd Annual Meeting of the Associations for Computational Linguistics. 1994:133-138.
[40] Richardson R, Smeaton A F, Murphy J. Using WordNet as a Knowledge Base for Measuring Semantic Similarity Between Words [EB/OL]. [2016-12-08]. .
[41] Li Y, Bandar Z A, McLean D. An Approach for Measuring Semantic Similarity Between Words Using Multiple Information Sources[J]. IEEE Transactions on Knowledge and Data Engineering, 2003, 15(4): 871-882.
[42] Lin D.Principle-based Parsing without Overgeneration[C]// Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics. 1993.
[43] Resnik P.Semantic Similarity in a Taxonomy: An Information-Based Measure and Its Application to Problems of Ambiguity in Natural Language[J]. Journal of Artificial Intelligence Research, 1999, 11:95-130.
[44] Lord P W, Stevens R D, Brass A, et al.Investigating Semantic Similarity Measures across the Gene Ontology: The Relationship Between Sequence and Annotation[J]. Bioinformatics, 2003, 19(10): 1275-1283.
[45] 边振兴. WordNet中概念语义相似度IC参数模型研究[J]. 计算机工程与应用, 2011, 47(19): 128-131.
[45] (Bian Zhenxing.Research on Model of IC Parameter for Semantic Similarity of Concept in WordNet[J]. Computer Engineering and Applications, 2011, 47(19): 128-131. )
[46] Tversky A.Features of Similarity[J]. Psychological Review, 1977, 84(4): 327-352.
[47] 葛斌, 李芳芳, 郭丝路, 等. 基于知网的词汇语义相似度计算方法研究[J]. 计算机应用研究, 2010, 27(9): 3329-3333.
[47] (Ge Bin, Li Fangfang, Guo Silu, et al.Word’s Semantic Similarity Computation Method Based on Hownet[J]. Application Research of Computers, 2010, 27(9): 3329-3333. )
[48] 王艳娜, 周子力, 何艳. WordNet中基于IC的概念语义相似度算法[J]. 计算机工程, 2011, 37(22): 42-44.
[48] (Wang Yanna, Zhou Zili, He Yan.Concept Semantic Similarity Algorithm in WordNet Based on Information Content[J]. Computer Engineering, 2011, 37(22): 42-44. )
[49] 李文清, 孙新, 张常有, 等. 一种本体概念的语义相似度计算方法[J]. 自动化学报, 2012, 38(2): 229-235.
[49] (Li Wenqing, Sun Xin, Zhang Changyou, et al.A Semantic Similarity Measure Between Ontological Concepts[J]. Acta Automatica Sinica, 2012, 38(2): 229-235. )
[50] 孙琛琛, 申德荣, 单菁, 等. WSR:一种基于维基百科结构信息的语义关联度计算算法[J]. 计算机学报, 2012, 35(11): 2361-2370.
[50] (Sun Chenchen, Shen Derong, Shan Jing, et al.WSR: A Semantic Relatedness Measure Based on Wikipedia Structure[J]. Chinese Journal of Computers, 2012, 35(11): 2361-2370. )
[51] Strube M, Ponzetto S P.WikiRelate! Computing Semantic Relatedness Using Wikipedia[C]//Proceedings of the 21st National Conference on Artificial Intelligence. 2006.
[52] Gabrilovich E, Markovitch S.Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis[C]// Proceedings of the 20th International Joint Conference on Artifical Intelligence.2007.
[53] Milne D, Witten I H. An Effective, Low-cost Measure of Semantic Relatedness Obtained from Wikipedia Links[C]// Proceedings of the 23rd Association for the Advancement of Artificial Intelligence. 2008.
[54] 盛志超, 陶晓鹏. 基于维基百科的语义相似度计算方法[J]. 计算机工程, 2011, 37(7): 193-195.
[54] (Sheng Zhichao, Tao Xiaopeng.Semantic Similarity Computing Method Based on Wikipedia[J]. Computer Engineering, 2011, 37(7): 193-195. )
[55] 彭丽针, 吴扬扬. 基于维基百科社区挖掘的词语语义相似度计算[J]. 计算机科学, 2016, 43(4): 45-49.
[55] (Peng Lizhen, Wu Yangyang.Semantic Similarity Computing Based on Community Mining of Wikipedia[J]. Computer Science, 2016, 43(4): 45-49. )
[56] Lizorkin D, Medelyan O, Grineva M.Analysis of Community Structure in Wikipedia[C]//Proceedings of the 18th International Conference on World Wide Web. 2009: 1221-1222.
[57] 詹志建, 梁丽娜, 杨小平. 基于百度百科的词语相似度计算[J]. 计算机科学, 2013, 40(6): 199-202.
[57] (Zhan Zhijian, Liang Li’na, Yang Xiaoping.Word Similarity Measurement Based on BaiduBaike[J]. Computer Science, 2013, 40(6): 199-202. )
[58] 尹坤, 尹红风, 杨燕, 等. 基于SimRank的百度百科词条语义相似度计算[J]. 山东大学学报:工学版, 2014, 44(3): 29-35.
[58] (Yin Kun, Yin Hongfeng, Yang Yan, et al.Semantic Similarity Computation of Baidu Encyclopedia Entries Based on SimRank[J]. Journal of Shandong University:Engineering Science, 2014, 44(3): 29-35. )
[59] 穗志方, 俞士汶. 基于骨架依存树的语句相似度计算模型[C]//1998中文信息处理国际会议论文集. 1998.
[59] (Sui Zhifang, Yu Shiwen.The Skeletal-Dependency-Tree-Based Computational Model for the Sentence Similarity[C]// Proceedings of the International Conference on Chinese Computing.1998. )
[60] 李彬, 刘挺, 秦兵, 等. 基于语义依存的汉语句子相似度计算[J]. 计算机应用研究, 2003, 20(12): 15-17.
[60] (Li Bin, Liu Ting, Qin Bing, et al.Chinese Sentence Similarity Computing Based on Semantic Dependency Relationship Analysis[J]. Application Research of Computers, 2003, 20(12): 15-17. )
[61] 李茹, 王智强, 李双红, 等. 基于框架语义分析的汉语句子相似度计算[J]. 计算机研究与发展, 2013, 50(8): 1728-1736.
[61] (Li Ru, Wang Zhiqiang, Li Shuanghong, et al.Chinese Sentence Similarity Computing Based on Frame Semantic Parsing[J]. Journal of Computer Research and Development, 2013, 50(8): 1728-1736.)
[62] Blanco E, Moldovan D.A Semantic Logic-Based Approach to Determine Textual Similarity[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015, 23(4): 683-693.
[63] Jiang J J, Conrath D W.Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy[C]// Proceedings of the International Conference on Research in Computational Linguistics. 1997.
[64] Islam A, Inkpen D.Semantic Text Similarity Using Corpus-based Word Similarity and String Similarity[J]. ACM Transactions on Knowledge Discovery from Data, 2008, 2(2): 1-25.
[65] Tasi C S, Huang Y M, Liu C H, et al.Applying VSM and LCS to Develop an Integrated Text Retrieval Mechanism[J]. Expert Systems with Applications, 2012, 39(4): 3974-3982.
[66] 魏韡, 向阳, 陈千. 计算术语间语义相似度的混合方法[J]. 计算机应用, 2010, 30(6): 1668-1670.
[66] (Wei Wei, Xiang Yang, Chen Qian.Combined Measurement Approach for Semantic Similarity of Terms[J]. Journal of Computer Applications, 2010, 30(6): 1668-1670. )
[67] Liu G, Wang R, Buckley J, et al.A WordNet-based Semantic Similarity Measure Enhanced by Internet-based Knowledge[C]//Proceedings of the International Conference on Software Engineering & Knowledge Engineering.2011.
[68] 王小林, 肖慧, 邰伟鹏. 基于Hadoop平台的文本相似度检测系统的研究[J]. 计算机技术与发展, 2015, 25(8): 90-93.
[68] (Wang Xiaolin, Xiao Hui, Tai Weipeng.Research on Text Similarity Detection System Based on Hadoop[J]. Computer Technology and Development, 2015, 25(8): 90-93.)
[69] Atoum I, Otoom A.Efficient Hybrid Semantic Text Similarity Using Wordnet and a Corpus[J]. International Journal of Advanced Computer Science and Applications, 2016, 7(9): 124-130.
[1] Shiqi Deng,Liang Hong. Constructing Domain Ontology for Intelligent Applications: Case Study of Anti Tele-Fraud[J]. 数据分析与知识发现, 2019, 3(7): 73-84.
[2] Zhenyu He,Xiangxiang Dong,Qinghua Zhu. Classifying Baidu Encyclopedia Entries with User Behaviors[J]. 数据分析与知识发现, 2019, 3(6): 117-122.
[3] Zhu Fu,Yuefen Wang,Xuhui Ding. Semantic Representation of Design Process Knowledge Reuse[J]. 数据分析与知识发现, 2019, 3(6): 21-29.
[4] Kan Liu,Lu Chen. Deep Neural Network Learning for Medical Triage[J]. 数据分析与知识发现, 2019, 3(6): 99-108.
[5] Wancheng Chen,Haoran Dai,Yinghan Jin. Appraising Home Prices with HEDONIC Model: Case Study of Seattle, U.S.[J]. 数据分析与知识发现, 2019, 3(5): 19-26.
[6] Guangshang Gao. A Survey of User Profiles Methods[J]. 数据分析与知识发现, 2019, 3(3): 25-35.
[7] Ying Wang,Li Qian,Jing Xie,Zhijun Chang,Beibei Kong. Building Knowledge Graph with Sci-Tech Big Data[J]. 数据分析与知识发现, 2019, 3(1): 15-26.
[8] Yuemei Xu,Sining Lv,Lianqiao Cai,Xiaoya Zhang. Analyzing News Topic Evolution with Convolutional Neural Networks and Topic2Vec[J]. 数据分析与知识发现, 2018, 2(9): 31-41.
[9] Xiaoyu Ma,Han Zhang,Yuhong Zhao. Building Childhood Asthma Prediction Model with Artificial Neural Network and BRFSS Database[J]. 数据分析与知识发现, 2018, 2(8): 10-15.
[10] Youshi He,Shufang He. Sentiment Mining of Online Product Reviews Based on Domain Ontology[J]. 数据分析与知识发现, 2018, 2(8): 60-68.
[11] Huihui Tang,Hao Wang,Zixuan Zhang,Xueying Wang. Extracting Names of Historical Events Based on Chinese Character Tags[J]. 数据分析与知识发现, 2018, 2(7): 89-100.
[12] Beibei Pang,Juanqiong Gou,Wenxin Mu. Extracting Topics and Their Relationship from College Student Mentoring[J]. 数据分析与知识发现, 2018, 2(6): 92-101.
[13] Lin Li,Hui Li. Computing Text Similarity Based on Concept Vector Space[J]. 数据分析与知识发现, 2018, 2(5): 48-58.
[14] Shengchun Ding,Menglu Liu,Zhu Fu. Unified Multidimensional Model Based on Knowledge Flow in Conceptual Design[J]. 数据分析与知识发现, 2018, 2(2): 11-19.
[15] Hu Meng,Xiaobei Liang,Yixiong Yang,Min Li. Evaluating and Optimizing Supply Chains with LMBP Algorithm[J]. 数据分析与知识发现, 2018, 2(11): 37-45.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn