Please wait a minute...
Advanced Search
现代图书情报技术  2012, Vol. 28 Issue (7): 82-89    DOI: 10.11925/infotech.1003-3513.2012.07.13
  知识组织与知识管理 本期目录 | 过刊浏览 | 高级检索 |
词汇相似度研究进展综述
刘萍, 陈烨
武汉大学信息资源研究中心 武汉 430072
Survey of the State of the Art in Word Similarity
Liu Ping, Chen Ye
Center for the Studies of Information Resources, Wuhan University, Wuhan 430072, China
全文: PDF(502 KB)   HTML  
输出: BibTeX | EndNote (RIS)      
摘要 从有背景信息和没有背景信息两个角度对国内外词汇相似度研究现状进行深入分析和比较。没有背景的统计方法不能真正挖掘出词对间的语义关系,语义词典也存在覆盖词汇范围有限等局限性,而维基百科作为含有语义词典功能的大型语料库,成为新的词汇语义信息的重要来源。详细阐述维基游走法、内涵概念图法和时间语义分析法这三种最新的基于维基百科的词汇相似度算法,指出词汇相似度研究今后将有机融合维基百科和其他背景信息,使各种词汇语义信息来源优势互补。此外运用复杂网络的分析方法来挖掘词汇网络中词汇的相关性将是词汇相似度研究的又一发展方向。
服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
陈烨
刘萍
关键词 词汇相似度语义相关度相似度计算    
Abstract:This paper provides a comprehensive review on word similarity measuring methods in two categories, with background knowledge and without background knowledge. The statistical method without background knowledge cannot reveal the semantic relations between words, and the thesaurus provides limited scope of words. Wikipedia, as a large corpus comprising semantic knowledgebase, becomes the new sources for measuring semantic similarity between words. Three new Wiki-based methods, WikiWalk, conceptual graph, and temporal semantic analysis are described in details. The future directions of this field continue to combine Wikipedia and other background information as complementary semantic resources. In addition, characterizing relatedness between words by performing a complex network analysis is also a future challenge.
Key wordsWord similarity    Semantic relatedness    Similarity measures
收稿日期: 2012-05-27     
: 

TP391

 
基金资助:

本文系国家自然科学基金项目“知识网络的形成机制及演化规律研究”(项目编号:71173249)和教育部人文社会科学项目“高校专家知识地图构建研究”(项目编号:10YJC870022)的研究成果之一。

引用本文:   
刘萍, 陈烨. 词汇相似度研究进展综述[J]. 现代图书情报技术, 2012, 28(7): 82-89.
Liu Ping, Chen Ye. Survey of the State of the Art in Word Similarity. New Technology of Library and Information Service, DOI:10.11925/infotech.1003-3513.2012.07.13.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2012.07.13
[1] 秦春秀, 赵捧未, 刘怀亮.词语相似度计算研究[J]. 情报理论与实践 ,2007,30(1):105-108. (Qin Chunxiu, Zhao Pengwei, Liu Huailiang. Research on Word Similarity Measurement[J]. Information Studies:Theory & Application, 2007, 30(1):105-108.)

[2] 刘群,李素建. 基于《知网》的词汇语义相似度计算[C]. 见: 第三届汉语词汇语义学研讨会 ,2002. (Liu Qun, Li Sujian. Word Similarity Computing Based on How-Net[C]. In:Proceedings of the 3th Conference on Word Semantic,Taipei,2002.)

[3] Levenshetin V I. Binary Codes Capable of Correcting Deletions, Insertions, and Reversals[J]. Soviet Physics Doklady, 1966, 10 (8):707-710.

[4] Wagner R A, Fischer M J. The String-to-String Correction Problem[J]. Journal of the ACM(JACM), 1974, 21(1):168-173.

[5] Cilibrasi R L, Vitányi P M B. Clustering by Compression[J]. IEEE Transaction on Information Theory, 2005, 51(4):1523-1545.

[6] Cilibrasi R L, Vitányi P M B. The Google Similarity Distance[J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(3):370-383.

[7] Bollegala D, Matsuo Y, Ishizuka M. Measuring Semantic Similarity Between Words Using Web Search Engines[C]. In:Proceedings of the 16th International Conference on World Wide Web(WWW’07). New York:ACM, 2007:757-766.

[8] Sahami M, Heilman T. A Web-based Kernel Function for Matching Short Text Snippets[C]. In:Proceedings of the 15th International Conference on World Wide Web(WWW’06),Edinburgh. 2006.

[9] Salton G, McGill M J. An Introduction to Modern Information Retrieval[M]. New York:McGraw-Hill, Inc, 1986.

[10] Deerwester S, Dumais S T, Furnas G W, et al. Indexing by Latent Semantic Analysis[J]. Journal of the American Society for Information Science, 1990, 41(6):391-407.

[11] Mladenic D. Turning Yahoo into an Automatic Web-Page Classifier[C]. In:Proceedings of the 13th European Conference on Artificial Intelligence. 1998:473-474.

[12] Caropreso M, Matwin S, Sebastiani F. A Learner-Independent Evaluation of the Usefulness of Statistical Phrases for Automated Text Categorization[A]. //Chin A G. Text Database and Document Management:Theory and Practice[M]. Pennsylvania:IGI Publishing Hershey, 2001:78-102.

[13] Raskutti B, Ferra H L, Kowalczyk A. Second Order Features for Maximizing Text Classification Performance[C]. In:Proceedings of the 12th European Conference on Machine Learning(ECML’01). London:Springer-Verlag, 2001:419-430.

[14] Sable C, McKeown K, Church K W. NLP Found Helpful (At Least for One Text Categorization Task)[C]. In:Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing(EMNLP’02). Stroudsburg:Association for Computational Linguistics, 2002:172-179.

[15] Jarmasz M. Roget’s Thesaurus as a Lexical Resource for Natural Language Processing[D].Ottawa:University of Ottawa, 2003.

[16] Miller G A, Fellbaum C. Semantic Network of English[J].Cognition, 1991, 41(1-3):197-229.

[17] 梅家驹,竺一鸣,高蕴琦,等. 同义词词林[M]. 上海:上海辞书出版社,1983.(Mei Jiaju, Zhu Yiming, Gao Yunqi, et al. Synonyms[M]. Shanghai:Shanghai Lexicographical Publishing House, 1983.)

[18] 董振东, 董强. 知网[EB/OL].[2012-03-20]. http://www.keenage.com/html/c_index.html.

[19] 于江生, 俞士汶. 中文概念词典的结构[J]. 中文信息学报 ,2002,16(4):12-20,44.(Yu Jiangsheng,Yu Shiwen. The Structure of the Chinese Concept Dictionary[J]. Journal of Chinese Information Processing, 2002, 16(4):12-20,44.)

[20] Rada R, Mili H, Bichnell E, et al. Development and Application of a Metric on Semantic Nets[J]. IEEE Transactions on Systems Man and Cybernetics, 1989, 19(1):17-30.

[21] Wu Z B,Palmer M. Verb Semantic and Lexical Selection[C]. In:Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics(ACL’94). Stroudsburg:Association for Computational Linguistics, 1994:133-138.

[22] Hirst G, St-Onge D. Lexical Chains as Representations of Context for the Detection and Correction of Malapropisms[M]. Cambridge:The MIT Press, 1998:305-332.

[23] Leacock C, Chodorow M. Combining Local Context and WordNet Similarity for Word Sense Identification[A]. //Fellbaum C. WordNet:An Electronic Lexical Database[M]. Cambridge:The MIT Press, 1998:265-283.

[24] Resnik P. Semantic Similarity in a Taxonomy:An Information-Based Measure and Its Application to Problems Ambiguity in Nature Language[J]. Journal of Artificial Intelligence Research, 1999(11):95-130.

[25] 王斌.汉英双语语料库自动对齐研究[D].北京:中国科学院计算技术研究所,1999.(Wang Bin. Automatic Chinese English Paragraph Segmentation and Alignment[D]. Beijing:Institute of Computing Technology, Chinese Academy of Sciences, 1999.)

[26] Li S J, Zhang J, Huang X, et al. Semantic Computation in Chinese Question-Answering System[J]. Journal of Computer Science and Technology, 2002, 17(6):933-939.

[27] Resnik P. Using Information Content to Evaluate Semantic Similarity in Taxonomy[C]. In:Proceedings of the 14th International Joint Conference on Artificial Intelligence(IJCAI’95). San Francisco:Morgan Kaufmann Publishers Inc, 1995:448-453.

[28] Jiang J J, Conrath D W. Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy[C]. In:Proceedings of International Conference Research on Computational Linguistics, Taiwan. 1997:19-33.

[29] Lin D. An Information-Theoretic Definition of Similarity [C]. In:Proceedings of the 15th International Conference on Machine Learning(ICML’98). San Francisco:Morgan Kaufmann Publishers Inc, 1998:296-304.

[30] 荀恩东,颜伟. 基于语义网计算英语词语相似度[J]. 情报学报 ,2006,25(1):43-48.(Xun Endong, Yan Wei. English Word Similarity Calculation Based on Semantic Net[J]. Journal of the China Society for Scientific and Technical Information, 2006, 25(1):43-48.)

[31] Zhang X D, Jing L P, Hu X H, et al. A Comparative Study of Ontology Based Term Similarity Measures on PubMed Document Clustering[C]. In:Proceedings of the 12th International Conference on Database Systems for Advanced Applications(DASFAA’07). Heidelberg, Berlin:Springer-Verlag, 2007:115-126.

[32] 江敏, 肖诗斌, 王弘蔚, 等.一种改进的基于《知网》的词语语义相似度计算[J]. 中文信息学报 ,2008,22(5):84-89.(Jiang Min, Xiao Shibin, Wang Hongwei, et al. An Improved Word Similarity Computing Method Based on HowNet[J]. Journal of Chinese Information Processing, 2008, 22(5):84-89.)

[33] Wikipedia [EB/OL].[2012-03-20]. http://www.wikipedia.org/.

[34] Strube M, Ponzetto S P. WikiRelate! Computing Semantic Relatedness Using Wikipedia[C]. In:Proceedings of the 21st National Conference on Artificial Intelligence(AAAI’06). AAAI Press, 2006:1419-1424.

[35] Gabrilovich E, Markovich S. Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis[C]. In:Proceedings of the 20th International Joint Conference on Artificial Intelligence(IJCAI’07). San Francisco:Morgan Kaufmann Publishers Inc, 2007:1606-1611.

[36] Milne D, Witten I H. An Effective, Low-cost Measure of Semantic Relatedness Obtained from Wikipedia Links[C]. In:Proceedings of AAAI Workshop on Wikipedia and Artificial Intelligence. Chicago:AAAI Press, 2008:25-30.

[37] Gabrilovich E, Markovitch S. Overcoming the Brittleness Bottleneck Using Wikipedia:Enhancing Text Categorization with Encyclopedic Knowledge[C]. In:Proceedings of the 21st National Conference on Artificial Intelligence(AAAI’06). AAAI Press, 2006:1301-1306.

[38] Gupta R, Ratinov L. Text Categorization with Knowledge Transfer from Heterogeneous Data Sources[C]. In:Proceedings of the 23rd National Conference on Artificial Intelligence(AAAI’08). AAAI Press, 2008:842-847.

[39] Chang M W, Ratinov L, Roth D, et al. Importance of Semantic Representation:Dataless Classification[C]. In:Proceedings of the 23rd National Conference on Artificial Intelligence(AAAI’08). AAAI Press, 2008:830-835.

[40] Potthast M, Stein B, Anderka M. A Wikipedia-based Multilingual Retrieval Model[C]. In:Proceedings of the 30th European Conference on Advances in Information Retrieval(ECIR’08). Heidelberg, Berlin:Springer-Verlag, 2008:522-530.

[41] Sorg P, Cimiano P. Cross-lingual Information Retrieval with Explicit Semantic Analysis[C]. In:Proceedings of Working Notes for the Conference and Labs of the Evaluation Forum 2008 Workshop, 2008.

[42] Egozi O, Gabrilovich E, Markovitch S. Concept-based Feature Generation and Selection for Information Retrieval[C]. In:Proceedings of the 23rd National Conference on Artificial Intelligence(AAAI’08). AAAI Press, 2008:1132-1137.

[43] Gabrilovich E, Markovitch S. Wikipedia-based Semantic Interpretation for Natural Language Processing[J]. Journal of Artificial Intelligence Research, 2009,34 (1):443-498.

[44] 陈燕, 龙建勋.基于明确语义分析的自动文摘算法[J]. 计算机工程 , 2011, 37(3):183-186.(Chen Yan, Long Jianxun. Automatic Abstraction Algorithm Based on Explicit Semantic Analysis[J]. Computer Engineering, 2011, 37(3):183-186.)

[45] Li Y H, Bandar Z A, Mclean D. An Approach for Measuring Semantic Similarity Between Words Using Multiple Information Sources[J]. IEEE Transactions on Knowledge and Data Engineering, 2003, 15(4):871-882.

[46] Yeh E, Ramage D, Manning C D, et al. WikiWalk:Random Walks on Wikipedia for Semantic Relatedness[C]. In:Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing. Stroudsburg:Association for Computational Linguistics, 2009:41-49.

[47] Agirre E, Soroa A. Personalizing PageRank for Word Sense Disambiguation[C]. In:Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics(EACL’09).Stroudsburg:Association for Computational Linguistics, 2009:33-41.

[48] 何夏燕.基于汉语概念图的词汇语义相似度计算[D].上海:上海交通大学, 2010.(He Xiayan. Word Similarity Computing Based on Chinese Conceptual Graph[D]. Shanghai:Shanghai Jiaotong University, 2010.)

[49] Radinsky K, Agichtein E, Gabrilovich E, et al. A Word at a Time:Computing Word Relatedness Using Temporal Semantic Analysis[C]. In:Proceedings of the 20th International Conference on World Wide Web(WWW’11). New York:ACM, 2011:337-346.

[50] Rubenstein H, Goodenough J B. Contextual Correlates of Synonymy[J]. Communications of the ACM, 1965, 8(10):627-633.

[51] Miller G A, Charles W G. Contextual Correlates of Semantic Similarity[J]. Language and Cognitive Processes, 1991, 6(1):1-28.

[52] Finkelstein L, Gabdlovich E, Matias Y, et al. Placing Search in Context:The Concept Revisited[J]. ACM Transactions on Information Systems, 2002, 20(1):116-131.

[53] Budanisky A, Hirst G. Semantic Distance in WordNet:An Experimental, Application-oriented Evaluation of Five Measures[C]. In:Proceeding of Workshop on WordNet and Other Lexical Resources, 2nd Meeting of the North American Chapter of the Association for Computational Linguistics, Pittsburgh. 2001.
[1] 关鹏,王曰芬,傅柱. 基于LDA的主题语义演化分析方法研究 * ——以锂离子电池领域为例[J]. 数据分析与知识发现, 2019, 3(7): 61-72.
[2] 孙海霞,王蕾,吴英杰,华薇娜,李军莲. 科技文献数据库中机构名称匹配策略研究*[J]. 数据分析与知识发现, 2018, 2(8): 88-97.
[3] 李文江, 陈诗琴. AIMLBot智能机器人在实时虚拟参考咨询中的应用[J]. 现代图书情报技术, 2012, 28(7): 127-132.
[4] 徐健. 基于句法依赖关系模板的术语相似度计算方法[J]. 现代图书情报技术, 2011, 27(9): 28-33.
[5] 董桂. 基于PostgreSQL的TMX数据存储研究与语料检索平台实现[J]. 现代图书情报技术, 2011, 27(7/8): 47-55.
[6] 王志超, 翁楠, 王宇. 基于主题句相似度的标题党新闻鉴别技术研究[J]. 现代图书情报技术, 2011, (11): 48-53.
[7] 王军辉, 胡铁军, 李丹亚. 相关文献检索研究综述[J]. 现代图书情报技术, 2011, 27(1): 39-45.
[8] 徐健 张智雄 肖卓 邓昭俊. 科技术语语义相似度计算方法研究综述[J]. 现代图书情报技术, 2010, 26(7/8): 51-57.
[9] 孙海霞 钱庆 吴英杰 李军莲. MeSH词表的语义相似度计算研究*[J]. 现代图书情报技术, 2010, 26(6): 12-16.
[10] 谢靖, 江岚, 王东波, 苏新宁. 基于万方数据(2003-2007)的知识发现应用研究[J]. 现代图书情报技术, 2010, 26(12): 64-69.
[11] 孙海霞,钱庆,成颖. 基于本体的语义相似度计算方法研究综述*[J]. 现代图书情报技术, 2010, 26(1): 51-56.
[12] 康小丽,章成志,王惠临. 基于可比语料库的双语术语抽取研究述评*[J]. 现代图书情报技术, 2009, (10): 7-13.
[13] 姜华. 基于本体的语义检索技术研究与实现[J]. 现代图书情报技术, 2008, 24(4): 39-43.
[14] 廉站俊,吕学强,张玉杰,施水才. 基于句子相似度计算的信息抽取*[J]. 现代图书情报技术, 2007, 2(6): 38-41.
[15] 宋琦,薛建武 . 智能检索中基于用户模型的本体映射方法研究[J]. 现代图书情报技术, 2006, 1(9): 29-33.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn