Please wait a minute...
Data Analysis and Knowledge Discovery  2018, Vol. 2 Issue (5): 48-58    DOI: 10.11925/infotech.2096-3467.2018.0007
Orginal Article Current Issue | Archive | Adv Search |
Computing Text Similarity Based on Concept Vector Space
Lin Li1,Hui Li2()
1School of Foreign Studies, Anhui University, Hefei 230601, China
2Department of Electronics Engineering and Information Science, University of Science and Technology of China, Hefei 230027, China
Download: PDF(1263 KB)   HTML ( 1
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a method to compute the semantic similarity of texts based on a concept vector space model. [Methods] First, we analyzed the text by dependency parser and extracted key words. Then, we used word embedding method to build vector space for each document. Third, we measured similarities between the two vector spaces and actual texts. Finally, we evaluated the new similarity measures with the data set of short texts and proposed an algorithm for long document classification. [Results] The proposed method effectively measured the semantic similarity of short texts and long documents. The accuracy of document classification was over 92% for the long ones. [Limitations] The performance of our method relies on the quality of dependency parser and word embedding vectors. [Conclusions] Combining linguistics theory and word embedding technique could efectively measure the semantic similarity among texts. This new method also reduces computation complexity and could be used in document classification, text clustering, and automatic question answering systems.

Key wordsText Similarity      Word Embedding      Dependency Syntax Parser      Document Classification     
Received: 03 January 2018      Published: 20 June 2018

Cite this article:

Lin Li,Hui Li. Computing Text Similarity Based on Concept Vector Space. Data Analysis and Knowledge Discovery, 2018, 2(5): 48-58.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2018.0007     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2018/V2/I5/48

[1] 陈二静, 姜恩波. 文本相似度计算方法研究综述[J]. 数据分析与知识发现, 2017, 1(6): 1-11.>
[1] (Chen Erjing, Jiang Enbo.Review of Studies on Text Similarity Measures[J]. Data Analysis and Knowledge Discovery, 2017, 1(6): 1-11.)
[2] Salton G, Wong A, Yang C S.A Vector Space Model for Automatic Indexing[J]. Communications of the ACM, 1975, 18(11): 613-620.
[3] Salton G, Buckley C.Term-Weighting Approaches in Automatic Text Retrieval[J]. Information Processing & Management, 1988, 24(5): 513-523.
[4] Landauer T K, Foltz P W, Laham D.An Introduction to Latent Semantic Analysis[J]. Discourse Processes, 1998, 25(2-3): 259-284.
[5] Hofmann T.Probabilistic Latent Semantic Analysis[C]// Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence. 1999: 289-296.
[6] Blei D M, Ng A Y, Jordan M I.Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
[7] Miller G A.WordNet: A Lexical Database for English[J]. Communications of the ACM, 1995, 38(11): 39-41.
[8] 董振东, 董强, 郝长伶. 知网的理论发现[J]. 中文信息报, 2007, 21(4): 3-9.
[8] (Dong Zhendong, Dong Qiang, Hao Changling.Theoretical Findings of HowNet[J]. Journal of Chinese Information Processing, 2007, 21(4): 3-9.)
[9] 梅家驹, 竺一鸣, 高蕴琦, 等. 同义词词林[M]. 上海: 上海辞书出版社, 1983.
[9] (Mei Jiaju, Zhu Yiming, Gao Yunqi, et al.Tongyici Cilin [M]. Shanghai: Shanghai Lexicographical Publishing House, 1983.)
[10] Pedersen T, Patwardhan S, Michelizzi J.WordNet: Similarity - Measuring the Relatedness of Concepts[C]// Proceedings of the 19th National Conference on Artificial Intelligence. 2004: 38-41.
[11] 江敏, 肖诗斌, 王弘蔚, 等. 一种改进的基于《知网》的词语语义相似度计算[J]. 中文信息学报, 2008, 22(5): 84-89.
[11] (Jiang Min, Xiao Shibin, Wang Hongwei, et al.An Improved Word Similarity Computing Method Based on HowNet[J]. Journal of Chinese Information Processing, 2008, 22(5): 84-89.)
[12] 田久乐, 赵蔚. 基于同义词词林的词语相似度计算方法[J]. 吉林大学学报: 信息科学版, 2010, 28(6): 602-608.
[12] (Tian Jiule, Zhao Wei.Words Similarity Algorithm Based on Tongyici Cilin in Semantic Web Adaptive Learning System[J]. Journal of Jilin University: Information Science Edition, 2010, 28(6):602-608.)
[13] Gabrilovich E, Markovitch S.Computing Semantic Relatedness Using Wikipedia-Based Explicit Semantic Analysis[C]// Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI’07). 2007: 1606-1611.
[14] 彭丽针, 吴扬扬. 基于维基百科社区挖掘的词语语义相似度计算[J]. 计算机科学, 2016, 43(4): 45-49.
[14] (Peng Lizhen, Wu Yangyang.Semantic Similarity Computing Based on Community Mining of Wikipedia[J]. Computer Science, 2016, 43(4): 45-49.)
[15] 詹志建, 梁丽娜, 杨小平. 基于百度百科的词语相似度计算[J]. 计算机科学, 2013, 40(6):199-202.
[15] (Zhan Zhijian, Liang Lina, Yang Xiaoping.Word Similarity Measurement Based on BaiduBaike[J]. Computer Science, 2013, 40(6): 199-202.)
[16] Mikolov T, Sutskever I, Chen K, et al.Distributed Representations of Words and Phrases and Their Compositionality[C]// Advances in Neural Information Processing Systems (NIPS 2013). 2013: 3111-3119.
[17] Shao Y.HCTI at SemEval-2017 Task 1: Use Convolutional Neural Network to Evaluate Semantic Textual Similarity[C]// Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). 2017: 130-133.
[18] Tai K S, Socher R, Manning C D.Improved Semantic Representations from Tree-Structured Long Short-Term Memory Networks[C]// Proceedings of the 53rd Annual Meetings of Association for Computational Linguistics. 2015: 1556-1566.
[19] Kim H K, Kim H, Cho S.Bag-of-Concepts: Comprehending Document Representation Through Clustering Words in Distributed Representation[J]. Neurocomputing, 2017, 266: 366-352.
[20] 李峰, 侯加英, 曾荣仁, 等. 融合词向量的多特征句子相似度计算方法研究[J]. 计算机科学与探索, 2017, 11(4):608-618.
[20] (Li Feng, Hou Jiaying, Zeng Rongren, et al.Research on Multi-feature Sentence Similarity Computing Method with Word Embedding[J]. Journal of Frontiers of Computer Science and Technology, 2017, 11(4): 608-618.)
[21] 李晓, 解辉, 李立杰. 基于Word2Vec的句子语义相似度计算研究[J]. 计算机科学, 2017, 44(9): 256-260.
[21] (Li Xiao, Xie Hui, Li Lijie.Research on Sentence Semantic Similarity Calculation Based on Word2Vec[J]. Computer Science, 2017, 44(9): 256-260.)
[22] 刘海涛. 依存语法的理论与实践[M]. 北京: 科学出版社, 2009.
[22] (Liu Haitao.Dependency Grammar from Theory to Practice[M]. Beijing: Science Press, 2009.)
[23] Choi J D, Palmer M.Guidelines for the Clear Style Constituent to Dependency Conversion [R]. University of Colorado Boulder, 2012.
[24] Cer D, Diab M, Agirre E, et al.SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation[C]// Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). 2017: 1-14.
[25] Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv:1301.3781, 2013.
[26] Le Q, Mikolov T.Distributed Representations of Sentences and Documents[C]// Proceedings of the 31st International Conference on Machine Learning (ICML-14). 2014: 1188-1196.
[27] Kusner M, Sun Y, Kolkin N, et al.From Word Embeddings to Document Distances[C]// Proceedings of the 32nd International Conference on Machine Learning. 2015: 957-966.
[28] Maaten L, Hinton G.Visualizing Data Using t-SNE[J]. Journal of Machine Learning Research, 2008, 9: 2579-2605.
[1] Qingtian Zeng,Xiaohui Hu,Chao Li. Extracting Keywords with Topic Embedding and Network Structure Analysis[J]. 数据分析与知识发现, 2019, 3(7): 52-60.
[2] Peiyao Zhang,Dongsu Liu. Topic Evolutionary Analysis of Short Text Based on Word Vector and BTM[J]. 数据分析与知识发现, 2019, 3(3): 95-101.
[3] Tingting Wang,Man Han,Yu Wang. Optimizing LDA Model with Various Topic Numbers: Case Study of Scientific Literature[J]. 数据分析与知识发现, 2018, 2(1): 29-40.
[4] Qin Zhang,Hongmei Guo,Zhixiong Zhang. Extracting Entity Relationship with Word Embedding Representation Features[J]. 数据分析与知识发现, 2017, 1(9): 8-15.
[5] Erjing Chen,Enbo Jiang. Review of Studies on Text Similarity Measures[J]. 数据分析与知识发现, 2017, 1(6): 1-11.
[6] Rujiang Bai,Fuhai Leng,Junhua Liao. An Improved Cosine Text Similarity Computing Method Based on Semantic Chunk Feature[J]. 数据分析与知识发现, 2017, 1(6): 56-64.
[7] Tian Xia. Extracting Keywords with Modified TextRank Model[J]. 数据分析与知识发现, 2017, 1(2): 28-34.
[8] Liu Hongguang,Ma Shuanggang,Liu Guifeng. Classifying Chinese News Texts with Denoising Auto Encoder[J]. 现代图书情报技术, 2016, 32(6): 12-19.
[9] Qun Zhang, Hongjun Wang, Lunwen Wang. Classifying Short Texts with Word Embedding and LDA Model[J]. 数据分析与知识发现, 2016, 32(12): 27-35.
[10] Guo Xu,Qi Ruihua. Using Non-standard Text Features to Identify Authors[J]. 现代图书情报技术, 2016, 32(11): 27-33.
[11] Yang Zhimo, Liu Huailiang, Zhao Hui. An Algorithm of Chinese Text Representation Based on Complex Network[J]. 现代图书情报技术, 2014, 30(11): 38-44.
[12] Xu Jian. A Term Similarity Algorithm Based on Context Dependency Relation Pattern[J]. 现代图书情报技术, 2011, 27(9): 28-33.
[13] Wang Junhui, Hu Tiejun, Li Danya. Research Review of Related Articles Retrieval[J]. 现代图书情报技术, 2011, 27(1): 39-45.
[14] Lu Shengjun,Li Fayong,Qian Jianjun ,Zhen Zhen. WCONS+:An Ontology Integration Approach Based on WCONS[J]. 现代图书情报技术, 2009, 3(2): 18-22.
[15] Tan Jinbo . An Improved Hierarchical Document Classification Method[J]. 现代图书情报技术, 2007, 2(2): 56-59.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn