[Objective] This paper proposes a method to compute the semantic similarity of texts based on a concept vector space model. [Methods] First, we analyzed the text by dependency parser and extracted key words. Then, we used word embedding method to build vector space for each document. Third, we measured similarities between the two vector spaces and actual texts. Finally, we evaluated the new similarity measures with the data set of short texts and proposed an algorithm for long document classification. [Results] The proposed method effectively measured the semantic similarity of short texts and long documents. The accuracy of document classification was over 92% for the long ones. [Limitations] The performance of our method relies on the quality of dependency parser and word embedding vectors. [Conclusions] Combining linguistics theory and word embedding technique could efectively measure the semantic similarity among texts. This new method also reduces computation complexity and could be used in document classification, text clustering, and automatic question answering systems.
(Tian Jiule, Zhao Wei.Words Similarity Algorithm Based on Tongyici Cilin in Semantic Web Adaptive Learning System[J]. Journal of Jilin University: Information Science Edition, 2010, 28(6):602-608.)
Gabrilovich E, Markovitch S.Computing Semantic Relatedness Using Wikipedia-Based Explicit Semantic Analysis[C]// Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI’07). 2007: 1606-1611.
Mikolov T, Sutskever I, Chen K, et al.Distributed Representations of Words and Phrases and Their Compositionality[C]// Advances in Neural Information Processing Systems (NIPS 2013). 2013: 3111-3119.
Shao Y.HCTI at SemEval-2017 Task 1: Use Convolutional Neural Network to Evaluate Semantic Textual Similarity[C]// Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). 2017: 130-133.
Tai K S, Socher R, Manning C D.Improved Semantic Representations from Tree-Structured Long Short-Term Memory Networks[C]// Proceedings of the 53rd Annual Meetings of Association for Computational Linguistics. 2015: 1556-1566.
Kim H K, Kim H, Cho S.Bag-of-Concepts: Comprehending Document Representation Through Clustering Words in Distributed Representation[J]. Neurocomputing, 2017, 266: 366-352.
(Li Feng, Hou Jiaying, Zeng Rongren, et al.Research on Multi-feature Sentence Similarity Computing Method with Word Embedding[J]. Journal of Frontiers of Computer Science and Technology, 2017, 11(4): 608-618.)
(Li Xiao, Xie Hui, Li Lijie.Research on Sentence Semantic Similarity Calculation Based on Word2Vec[J]. Computer Science, 2017, 44(9): 256-260.)
刘海涛. 依存语法的理论与实践[M]. 北京: 科学出版社, 2009.
(Liu Haitao.Dependency Grammar from Theory to Practice[M]. Beijing: Science Press, 2009.)
Choi J D, Palmer M.Guidelines for the Clear Style Constituent to Dependency Conversion [R]. University of Colorado Boulder, 2012.
Cer D, Diab M, Agirre E, et al.SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation[C]// Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). 2017: 1-14.
Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv:1301.3781, 2013.
Le Q, Mikolov T.Distributed Representations of Sentences and Documents[C]// Proceedings of the 31st International Conference on Machine Learning (ICML-14). 2014: 1188-1196.
Kusner M, Sun Y, Kolkin N, et al.From Word Embeddings to Document Distances[C]// Proceedings of the 32nd International Conference on Machine Learning. 2015: 957-966.
Maaten L, Hinton G.Visualizing Data Using t-SNE[J]. Journal of Machine Learning Research, 2008, 9: 2579-2605.