Data Analysis and Knowledge Discovery  2018, Vol. 2 Issue (5): 48-58    DOI: 10.11925/infotech.2096-3467.2018.0007
Computing Text Similarity Based on Concept Vector Space
Lin Li1,Hui Li2()
1School of Foreign Studies, Anhui University, Hefei 230601, China
2Department of Electronics Engineering and Information Science, University of Science and Technology of China, Hefei 230027, China
[Objective] This paper proposes a method to compute the semantic similarity of texts based on a concept vector space model. [Methods] First, we analyzed the text by dependency parser and extracted key words. Then, we used word embedding method to build vector space for each document. Third, we measured similarities between the two vector spaces and actual texts. Finally, we evaluated the new similarity measures with the data set of short texts and proposed an algorithm for long document classification. [Results] The proposed method effectively measured the semantic similarity of short texts and long documents. The accuracy of document classification was over 92% for the long ones. [Limitations] The performance of our method relies on the quality of dependency parser and word embedding vectors. [Conclusions] Combining linguistics theory and word embedding technique could efectively measure the semantic similarity among texts. This new method also reduces computation complexity and could be used in document classification, text clustering, and automatic question answering systems.

Key wordsText Similarity      Word Embedding      Dependency Syntax Parser      Document Classification     
Received: 03 January 2018      Published: 20 June 2018

Lin Li,Hui Li. Computing Text Similarity Based on Concept Vector Space. Data Analysis and Knowledge Discovery, 2018, 2(5): 48-58.

