Computing Text Semantic Similarity with Syntactic Network of Co-occurrence Distance
Jiao Yan1,Jing Ma1(),Kang Fang2
1 College of Economics and Management, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China 2 Department of Computer Science and Technology, Nanjing University, Nanjing 210023, China
[Objective] This paper aims to break through the limitations of existing methods for text similarity calculation by synthesizing multiple text information features such as semantics, syntax and word frequency. [Methods] First, we constructed the text complex network, combining co-occurrence distance and dependency syntax. Then, we used information entropy to determine the weights of dynamics characteristics. Finally, we utilized word embedding, syntactic structure and inverted file information to avoid the loss of word structure and semantics. [Results] Compared with the syntactic network + TF-IDF algorithm, the F1 value of the proposed algorithm increased up to 12.1%. The result was 5.8% higher than that of the co-occurrence network + semantic method. The average values of F1 were 5.8% and 1.6% better than those of the existing methods. [Limitations] The selection of relevant indicators in feature extraction needs to be further improved, which address the importance of nodes more comprehensively. [Conclusions] Compared with the traditional methods, the proposed model could reduce the loss of text information and improve the accuracy of calculating text similarity effectively.
Che W, Li Z, Liu T . LTP: A Chinese Language Technology Platform [C]// Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations, Beijing, China. Stroudsburg: Association for Computational Linguistics, 2010: 13-16.
Wachs-Lopes G A, Rodrigues P S . Analyzing Natural Human Language from the Point of View of Dynamic of a Complex Network[J]. Expert Systems with Applications, 2016,45:8-22.
Onnela J P, Saramaki J, Kertesz J , et al. Intensity and Coherence of Motifs in Weighted Complex Networks[J]. Physical Review E, Statistical, Nonlinear, and Soft Matter Physics, 2005,71(6):065103.
Freeman L C . Centrality in Social Networks Conceptual Clarification[J]. Social Networks, 1978,1(3):215-239.
Shannon C E . A Mathematical Theory of Communication[J]. Bell Labs Technical Journal, 1948,27(4):379-423.
Salton G, Yu C T . On the Construction of Effective Vocabularies for Information Retrieval[J]. ACM Sigplan Notices, 1975,10(1):48-60.
Singhal A, Google I . Modern Information Retrieval: A Brief Overview[J]. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2001,24(24):35-43.
Cover T M, Hart P E . Nearest Neighbor Pattern Classification[J]. IEEE Transactions on Information Theory, 1967,13(1):21-27.