Please wait a minute...
Advanced Search
数据分析与知识发现  2019, Vol. 3 Issue (12): 93-100    DOI: 10.11925/infotech.2096-3467.2019.0737
     研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于融合共现距离的句法网络下文本语义相似度计算 *
严娇1,马静1(),房康2
1 南京航空航天大学经济与管理学院 南京 211106
2 南京大学计算机科学与技术系 南京 210023
Computing Text Semantic Similarity with Syntactic Network of Co-occurrence Distance
Jiao Yan1,Jing Ma1(),Kang Fang2
1 College of Economics and Management, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China
2 Department of Computer Science and Technology, Nanjing University, Nanjing 210023, China
全文: PDF(3622 KB)   HTML ( 10
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】综合语义、句法和词频等多种文本信息特征, 突破现有文本相似度计算的局限。【方法】构建融合共现距离和依存句法的文本复杂网络, 运用信息熵确定网络动力学特征指标的权重。利用词嵌入、句法结构和倒排档信息避免词语结构和语义的缺失。【结果】对比实验结果表明, 不同类别下本文算法分类效果的F1值较句法网络+TF-IDF方法最高提高12.1%, 比共现网络+语义方法最高提高5.8%。本文算法的各类别分类效果的平均F1值较二者分别提高5.8%和1.6%。【局限】特征提取中对各指标的选取有待改进, 以更全面地区分节点间的重要性。【结论】与传统方法相比, 本文算法减少了文本信息流失并实现文本降维, 有效地提高了文本相似度计算的准确率。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
严娇
马静
房康
关键词 依存句法文本复杂网络语义相似度共现距离特征提取    
Abstract

[Objective] This paper aims to break through the limitations of existing methods for text similarity calculation by synthesizing multiple text information features such as semantics, syntax and word frequency. [Methods] First, we constructed the text complex network, combining co-occurrence distance and dependency syntax. Then, we used information entropy to determine the weights of dynamics characteristics. Finally, we utilized word embedding, syntactic structure and inverted file information to avoid the loss of word structure and semantics. [Results] Compared with the syntactic network + TF-IDF algorithm, the F1 value of the proposed algorithm increased up to 12.1%. The result was 5.8% higher than that of the co-occurrence network + semantic method. The average values of F1 were 5.8% and 1.6% better than those of the existing methods. [Limitations] The selection of relevant indicators in feature extraction needs to be further improved, which address the importance of nodes more comprehensively. [Conclusions] Compared with the traditional methods, the proposed model could reduce the loss of text information and improve the accuracy of calculating text similarity effectively.

Key wordsDependency Grammar    Text Complex Network    Semantic Similarity    Co-occurrence Distance    Feature Extraction
收稿日期: 2019-06-24     
中图分类号:  TP391  
基金资助:*本文系国家自然科学基金项目“基于演化本体的网络舆情自适应话题跟踪方法研究”(项目编号: 71373123);中央高校基本科研业务费专项前瞻性发展策略研究资助项目“基于大数据技术的跨境电商政府管理范式研究”(项目编号: NW2018004)
通讯作者: 马静     E-mail: majing5525@126.com
引用本文:   
严娇,马静,房康. 基于融合共现距离的句法网络下文本语义相似度计算 *[J]. 数据分析与知识发现, 2019, 3(12): 93-100.
Jiao Yan,Jing Ma,Kang Fang. Computing Text Semantic Similarity with Syntactic Network of Co-occurrence Distance. Data Analysis and Knowledge Discovery, DOI:10.11925/infotech.2096-3467.2019.0737.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2019.0737
图1  依存句法分析结果
图2  融合共现距离的句法网络
source target weight source target weight
网络 费用 10 艺术 社会 8.571 429
爱好者 面对 10 艺术 描写 8.571 429
文坛 领袖 10 科技 挑战 8.571 429
网络 时代 9.285 714 口号 看待 8.571 429
诗人 词汇 9.285 714 艺术 活动 7.857 143
艺术 人类 8.571 429 艺术 主体 7.857 143
时刻 爱好 8.571 429 活动 类型 7.857 143
表1  句法网络中部分节点对及边权
图3  参数$\alpha $对相似度分类效果的影响
评价指标
次数
正确率 召回率 F1
1 89.2 88.3 88.3
2 89.8 87.5 87.4
3 89.7 89.2 89.2
4 91.2 90.8 90.8
5 90.9 90.4 90.4
6 86.8 86.3 86.3
7 80.5 80.4 80.4
8 91.6 91.3 91.3
9 93.7 93.8 93.8
10 86.3 85.8 85.8
表2  十折交叉验证每次结果的各评价指标值(%)
实验
类别
本文算法 句法网络+TF-IDF 共现网络+语义
艺术 86.7 83.1 86.5
历史 74.8 62.7 73.9
计算机 93.5 95.3 93.1
环境 88.8 84.8 89.7
农业 92.8 83.9 90.6
经济 89.1 81.3 83.3
政治 88.3 80.9 85.4
体育 92.9 88.6 91.7
平均 88.4 82.6 86.8
表3  不同类别的三组实验结果F1值(%)
图4  不同类别的三组实验结果F1
图5  三组实验实验结果的平均F1
[1] Gali N, Mariescu-Istodor R, Hostettler D , et al. Framework for Syntactic String Similarity Measures[J]. Expert Systems with Applications, 2019,129:169-185.
[2] An H, Gao X, Wei F , et al. Research on Patterns in the Fluctuation of the Co-movement Between Crude Oil Futures and Spot Prices: A Complex Network Approach[J]. Applied Energy, 2014,136:1067-1075.
[3] 杜坤, 刘怀亮, 郭路杰 . 结合复杂网络的特征权重改进算法研究[J]. 现代图书情报技术, 2015(11):26-32.
( Du Kun, Liu Huailiang, Guo Lujie . Study on the Modified Method of Feature Weighting with Complex Networks[J]. New Technology of Library and Information Service, 2015(11):26-32.)
[4] Zhang W, Li Y, Wang S . Learning Document Representation via Topic-enhanced LSTM Model[J]. Knowledge-Based Systems, 2019,174:194-204.
[5] Salton G, Wong A, Yang C S . A Vector Space Model for Automatic Indexing[J]. Communications of the ACM, 1975,18(11):613-620.
[6] Ezzikouri H, Madani Y, Erritali M , et al. A New Approach for Calculating Semantic Similarity Between Words Using WordNet and Set Theory[J]. Procedia Computer Science, 2019,151:1261-1265.
[7] Garg M, Kumar M . The Structure of Word Co-occurrence Network for Microblogs[J]. Physica A: Statistical Mechanics and Its Applications, 2018,512:698-720.
[8] 唐晓波, 肖璐 . 基于依存句法网络的文本特征提取研究[J]. 现代图书情报技术, 2014(11):31-37.
( Tang Xiaobo, Xiao Lu . Research of Text Feature Extraction on Dependency Parsing Network[J]. New Technology of Library and Information Service, 2014(11):31-37. )
[9] 周德志, 刘怀亮, 张倩 . 基于复杂网络的文本语义社区的构建[J]. 情报杂志, 2013,32(10):136-140.
( Zhou Dezhi, Liu Huailiang, Zhang Qian . Constructing Text Semantic Community Based on Complex Networks[J]. Journal of Intelligence, 2013,32(10):136-140.)
[10] Zhao W X, Jiang J, He J , et al. Topical Keyphrase Extraction from Twitter [C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. 2011.
[11] Qu R, Fang Y, Bai W , et al. Computing Semantic Similarity Based on Novel Models of Semantic Representation Using Wikipedia[J]. Information Processing & Management, 2018,54(6):1002-1021.
[12] 王春柳, 杨永辉, 邓霏 , 等. 文本相似度计算方法研究综述[J]. 情报科学, 2019,37(3):158-168.
( Wang Chunliu, Yang Yonghui, Deng Fei , et al. A Review of Text Similarity Approaches[J]. Information Science, 2019,37(3):158-168.)
[13] Mikolov T C K, Chen K, Corrado G , et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781.
[14] 李琳, 李辉 . 一种基于概念向量空间的文本相似度计算方法[J]. 数据分析与知识发现, 2018,2(5):48-58.
( Li Lin, Li Hui . Computing Text Similarity Based on Concept Vector Space[J]. Data Analysis and Knowledge Discovery, 2018,2(5):48-58.)
[15] 吕西安·泰尼埃尔 . 结构句法基础[M]. 方德义译. 北京: 中国人民大学语言文学系, 1987.
( Tesniere L . The Basis of Structure Syntax[M]. Translated by Fang Deyi. Beijing: Language and Literature Department of Renmin University of China, 1987.)
[16] Python 3.7.0[EB/OL].(2018-06-27). https://www.python.org/downloads/release/python-370/.
[17] Che W, Li Z, Liu T . LTP: A Chinese Language Technology Platform [C]// Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations, Beijing, China. Stroudsburg: Association for Computational Linguistics, 2010: 13-16.
[18] Wachs-Lopes G A, Rodrigues P S . Analyzing Natural Human Language from the Point of View of Dynamic of a Complex Network[J]. Expert Systems with Applications, 2016,45:8-22.
[19] Onnela J P, Saramaki J, Kertesz J , et al. Intensity and Coherence of Motifs in Weighted Complex Networks[J]. Physical Review E, Statistical, Nonlinear, and Soft Matter Physics, 2005,71(6):065103.
[20] Freeman L C . Centrality in Social Networks Conceptual Clarification[J]. Social Networks, 1978,1(3):215-239.
[21] Shannon C E . A Mathematical Theory of Communication[J]. Bell Labs Technical Journal, 1948,27(4):379-423.
[22] Salton G, Yu C T . On the Construction of Effective Vocabularies for Information Retrieval[J]. ACM Sigplan Notices, 1975,10(1):48-60.
[23] Singhal A, Google I . Modern Information Retrieval: A Brief Overview[J]. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2001,24(24):35-43.
[24] Cover T M, Hart P E . Nearest Neighbor Pattern Classification[J]. IEEE Transactions on Information Theory, 1967,13(1):21-27.
[1] 李博诚,张云秋,杨铠西. 面向微博商品评论的情感标签抽取研究 *[J]. 数据分析与知识发现, 2019, 3(9): 115-123.
[2] 李纲,周华阳,毛进,陈思菁. 基于机器学习的社交媒体用户分类研究 *[J]. 数据分析与知识发现, 2019, 3(8): 1-9.
[3] 文秀贤,徐健. 基于用户评论的商品特征提取及特征价格研究 *[J]. 数据分析与知识发现, 2019, 3(7): 42-51.
[4] 钟庆虹,乔晓东,张运良,翁梦娟. 基于LDA2Vec和残差网络的跨媒体融合方法研究 *[J]. 数据分析与知识发现, 2019, 3(10): 78-88.
[5] 杨贵军,徐雪,赵富强. 基于XGBoost算法的用户评分预测模型及应用*[J]. 数据分析与知识发现, 2019, 3(1): 118-126.
[6] 李琳,李辉. 一种基于概念向量空间的文本相似度计算方法[J]. 数据分析与知识发现, 2018, 2(5): 48-58.
[7] 黄孝喜,李晗雨,王荣波,王小华,谌志群. 基于卷积神经网络与SVM分类器的隐喻识别*[J]. 数据分析与知识发现, 2018, 2(10): 77-83.
[8] 李伟卿,王伟军. 基于大规模评论数据的产品特征词典构建方法研究*[J]. 数据分析与知识发现, 2018, 2(1): 41-50.
[9] 李昌兵,庞崇鹏,李美平. 基于权重的Apriori算法在文本统计特征提取方法中的应用*[J]. 数据分析与知识发现, 2017, 1(9): 83-89.
[10] 陈二静,姜恩波. 文本相似度计算方法研究综述[J]. 数据分析与知识发现, 2017, 1(6): 1-11.
[11] 翟东升,蔡文浩,张杰,李振飞. 改进的中文商标语义相似度计算方法研究[J]. 数据分析与知识发现, 2017, 1(11): 19-28.
[12] 刘冰瑶,马静,李晓峰. 一种“特征降维”文本复杂网络的话题表示模型*[J]. 数据分析与知识发现, 2017, 1(11): 53-61.
[13] 刘健,毕强,刘庆旭,王福. 数字文献资源内容服务推荐研究*——基于本体规则推理和语义相似度计算[J]. 现代图书情报技术, 2016, 32(9): 70-77.
[14] 刘红光,马双刚,刘桂锋. 基于降噪自动编码器的中文新闻文本分类方法研究*[J]. 现代图书情报技术, 2016, 32(6): 12-19.
[15] 兰秋军,刘文星,李卫康,胡星野. 融合句法信息的金融论坛文本情感计算研究*[J]. 现代图书情报技术, 2016, 32(4): 64-71.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn