Please wait a minute...
Data Analysis and Knowledge Discovery  2019, Vol. 3 Issue (12): 93-100    DOI: 10.11925/infotech.2096-3467.2019.0737
Current Issue | Archive | Adv Search |
Computing Text Semantic Similarity with Syntactic Network of Co-occurrence Distance
Jiao Yan1,Jing Ma1(),Kang Fang2
1 College of Economics and Management, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China
2 Department of Computer Science and Technology, Nanjing University, Nanjing 210023, China
Download: PDF (3622 KB)   HTML ( 14
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper aims to break through the limitations of existing methods for text similarity calculation by synthesizing multiple text information features such as semantics, syntax and word frequency. [Methods] First, we constructed the text complex network, combining co-occurrence distance and dependency syntax. Then, we used information entropy to determine the weights of dynamics characteristics. Finally, we utilized word embedding, syntactic structure and inverted file information to avoid the loss of word structure and semantics. [Results] Compared with the syntactic network + TF-IDF algorithm, the F1 value of the proposed algorithm increased up to 12.1%. The result was 5.8% higher than that of the co-occurrence network + semantic method. The average values of F1 were 5.8% and 1.6% better than those of the existing methods. [Limitations] The selection of relevant indicators in feature extraction needs to be further improved, which address the importance of nodes more comprehensively. [Conclusions] Compared with the traditional methods, the proposed model could reduce the loss of text information and improve the accuracy of calculating text similarity effectively.

Key wordsDependency Grammar      Text Complex Network      Semantic Similarity      Co-occurrence Distance      Feature Extraction     
Received: 24 June 2019      Published: 25 December 2019
ZTFLH:  TP391  
Corresponding Authors: Jing Ma     E-mail: majing5525@126.com

Cite this article:

Jiao Yan,Jing Ma,Kang Fang. Computing Text Semantic Similarity with Syntactic Network of Co-occurrence Distance. Data Analysis and Knowledge Discovery, 2019, 3(12): 93-100.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2019.0737     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2019/V3/I12/93

source target weight source target weight
网络 费用 10 艺术 社会 8.571 429
爱好者 面对 10 艺术 描写 8.571 429
文坛 领袖 10 科技 挑战 8.571 429
网络 时代 9.285 714 口号 看待 8.571 429
诗人 词汇 9.285 714 艺术 活动 7.857 143
艺术 人类 8.571 429 艺术 主体 7.857 143
时刻 爱好 8.571 429 活动 类型 7.857 143
评价指标
次数
正确率 召回率 F1
1 89.2 88.3 88.3
2 89.8 87.5 87.4
3 89.7 89.2 89.2
4 91.2 90.8 90.8
5 90.9 90.4 90.4
6 86.8 86.3 86.3
7 80.5 80.4 80.4
8 91.6 91.3 91.3
9 93.7 93.8 93.8
10 86.3 85.8 85.8
实验
类别
本文算法 句法网络+TF-IDF 共现网络+语义
艺术 86.7 83.1 86.5
历史 74.8 62.7 73.9
计算机 93.5 95.3 93.1
环境 88.8 84.8 89.7
农业 92.8 83.9 90.6
经济 89.1 81.3 83.3
政治 88.3 80.9 85.4
体育 92.9 88.6 91.7
平均 88.4 82.6 86.8
[1] Gali N, Mariescu-Istodor R, Hostettler D , et al. Framework for Syntactic String Similarity Measures[J]. Expert Systems with Applications, 2019,129:169-185.
[2] An H, Gao X, Wei F , et al. Research on Patterns in the Fluctuation of the Co-movement Between Crude Oil Futures and Spot Prices: A Complex Network Approach[J]. Applied Energy, 2014,136:1067-1075.
[3] 杜坤, 刘怀亮, 郭路杰 . 结合复杂网络的特征权重改进算法研究[J]. 现代图书情报技术, 2015(11):26-32.
[3] ( Du Kun, Liu Huailiang, Guo Lujie . Study on the Modified Method of Feature Weighting with Complex Networks[J]. New Technology of Library and Information Service, 2015(11):26-32.)
[4] Zhang W, Li Y, Wang S . Learning Document Representation via Topic-enhanced LSTM Model[J]. Knowledge-Based Systems, 2019,174:194-204.
[5] Salton G, Wong A, Yang C S . A Vector Space Model for Automatic Indexing[J]. Communications of the ACM, 1975,18(11):613-620.
[6] Ezzikouri H, Madani Y, Erritali M , et al. A New Approach for Calculating Semantic Similarity Between Words Using WordNet and Set Theory[J]. Procedia Computer Science, 2019,151:1261-1265.
[7] Garg M, Kumar M . The Structure of Word Co-occurrence Network for Microblogs[J]. Physica A: Statistical Mechanics and Its Applications, 2018,512:698-720.
[8] 唐晓波, 肖璐 . 基于依存句法网络的文本特征提取研究[J]. 现代图书情报技术, 2014(11):31-37.
[8] ( Tang Xiaobo, Xiao Lu . Research of Text Feature Extraction on Dependency Parsing Network[J]. New Technology of Library and Information Service, 2014(11):31-37. )
[9] 周德志, 刘怀亮, 张倩 . 基于复杂网络的文本语义社区的构建[J]. 情报杂志, 2013,32(10):136-140.
[9] ( Zhou Dezhi, Liu Huailiang, Zhang Qian . Constructing Text Semantic Community Based on Complex Networks[J]. Journal of Intelligence, 2013,32(10):136-140.)
[10] Zhao W X, Jiang J, He J , et al. Topical Keyphrase Extraction from Twitter [C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. 2011.
[11] Qu R, Fang Y, Bai W , et al. Computing Semantic Similarity Based on Novel Models of Semantic Representation Using Wikipedia[J]. Information Processing & Management, 2018,54(6):1002-1021.
[12] 王春柳, 杨永辉, 邓霏 , 等. 文本相似度计算方法研究综述[J]. 情报科学, 2019,37(3):158-168.
[12] ( Wang Chunliu, Yang Yonghui, Deng Fei , et al. A Review of Text Similarity Approaches[J]. Information Science, 2019,37(3):158-168.)
[13] Mikolov T C K, Chen K, Corrado G , et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781.
[14] 李琳, 李辉 . 一种基于概念向量空间的文本相似度计算方法[J]. 数据分析与知识发现, 2018,2(5):48-58.
[14] ( Li Lin, Li Hui . Computing Text Similarity Based on Concept Vector Space[J]. Data Analysis and Knowledge Discovery, 2018,2(5):48-58.)
[15] 吕西安·泰尼埃尔 . 结构句法基础[M]. 方德义译. 北京: 中国人民大学语言文学系, 1987.
[15] ( Tesniere L . The Basis of Structure Syntax[M]. Translated by Fang Deyi. Beijing: Language and Literature Department of Renmin University of China, 1987.)
[16] Python 3.7.0[EB/OL].(2018-06-27). https://www.python.org/downloads/release/python-370/.
[17] Che W, Li Z, Liu T . LTP: A Chinese Language Technology Platform [C]// Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations, Beijing, China. Stroudsburg: Association for Computational Linguistics, 2010: 13-16.
[18] Wachs-Lopes G A, Rodrigues P S . Analyzing Natural Human Language from the Point of View of Dynamic of a Complex Network[J]. Expert Systems with Applications, 2016,45:8-22.
[19] Onnela J P, Saramaki J, Kertesz J , et al. Intensity and Coherence of Motifs in Weighted Complex Networks[J]. Physical Review E, Statistical, Nonlinear, and Soft Matter Physics, 2005,71(6):065103.
[20] Freeman L C . Centrality in Social Networks Conceptual Clarification[J]. Social Networks, 1978,1(3):215-239.
[21] Shannon C E . A Mathematical Theory of Communication[J]. Bell Labs Technical Journal, 1948,27(4):379-423.
[22] Salton G, Yu C T . On the Construction of Effective Vocabularies for Information Retrieval[J]. ACM Sigplan Notices, 1975,10(1):48-60.
[23] Singhal A, Google I . Modern Information Retrieval: A Brief Overview[J]. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2001,24(24):35-43.
[24] Cover T M, Hart P E . Nearest Neighbor Pattern Classification[J]. IEEE Transactions on Information Theory, 1967,13(1):21-27.
[1] Zheng Xinman, Dong Yu. Constructing Degree Lexicon for STI Policy Texts[J]. 数据分析与知识发现, 2021, 5(10): 81-93.
[2] Cai Jingxuan,Wu Jiang,Wang Chengkun. Predicting Usefulness of Crowd Testing Reports with Deep Learning[J]. 数据分析与知识发现, 2020, 4(11): 102-111.
[3] Hui Nie,Huan He. Identifying Implicit Features with Word Embedding[J]. 数据分析与知识发现, 2020, 4(1): 99-110.
[4] Gang Li,Huayang Zhou,Jin Mao,Sijing Chen. Classifying Social Media Users with Machine Learning[J]. 数据分析与知识发现, 2019, 3(8): 1-9.
[5] Xiaofeng Li,Jing Ma,Chi Li,Hengmin Zhu. Identifying Commodity Names Based on XGBoost Model[J]. 数据分析与知识发现, 2019, 3(7): 34-41.
[6] Qinghong Zhong,Xiaodong Qiao,Yunliang Zhang,Mengjuan Weng. Cross-media Fusion Method Based on LDA2Vec and Residual Network[J]. 数据分析与知识发现, 2019, 3(10): 78-88.
[7] Guijun Yang,Xue Xu,Fuqiang Zhao. Predicting User Ratings with XGBoost Algorithm[J]. 数据分析与知识发现, 2019, 3(1): 118-126.
[8] Zhou Lixin,Lin Jie. Extracting Product Features with NodeRank Algorithm[J]. 数据分析与知识发现, 2018, 2(4): 90-98.
[9] Huang Xiaoxi,Li Hanyu,Wang Rongbo,Wang Xiaohua,Chen Zhiqun. Recognizing Metaphor with Convolution Neural Network and SVM[J]. 数据分析与知识发现, 2018, 2(10): 77-83.
[10] Li Weiqing,Wang Weijun. Building Product Feature Dictionary with Large-scale Review Data[J]. 数据分析与知识发现, 2018, 2(1): 41-50.
[11] Li Changbing,Pang Chongpeng,Li Meiping. Extracting Product Features with Weight-based Apriori Algorithm[J]. 数据分析与知识发现, 2017, 1(9): 83-89.
[12] Chen Erjing,Jiang Enbo. Review of Studies on Text Similarity Measures[J]. 数据分析与知识发现, 2017, 1(6): 1-11.
[13] Wang Zixuan,Le Xiaoqiu,He Yuanbiao. Recognizing Core Topic Sentences with Improved TextRank Algorithm Based on WMD Semantic Similarity[J]. 数据分析与知识发现, 2017, 1(4): 1-8.
[14] Zhai Dongsheng,Cai Wenhao,Zhang Jie,Li Zhenfei. An Improved Method of Semantic Similarity Calculation of Chinese Trademarks[J]. 数据分析与知识发现, 2017, 1(11): 19-28.
[15] Liu Bingyao,Ma Jing,Li Xiaofeng. Topic Representation Model Based on “Feature Dimensionality Reduction”[J]. 数据分析与知识发现, 2017, 1(11): 53-61.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn