Please wait a minute...
Advanced Search
数据分析与知识发现  2018, Vol. 2 Issue (5): 48-58     https://doi.org/10.11925/infotech.2096-3467.2018.0007
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
一种基于概念向量空间的文本相似度计算方法
李琳1, 李辉2()
1安徽大学外语学院 合肥 230601
2中国科学技术大学电子工程与信息科学系 合肥 230027
Computing Text Similarity Based on Concept Vector Space
Li Lin1, Li Hui2()
1School of Foreign Studies, Anhui University, Hefei 230601, China
2Department of Electronics Engineering and Information Science, University of Science and Technology of China, Hefei 230027, China
全文: PDF (1263 KB)   HTML ( 1
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 将文本建模为一个概念向量空间, 提出一种该模型下的文本相似度计算方法。【方法】 对文本进行依存句法分析, 提取关键概念词, 利用词嵌入方法构造表示文本的向量空间; 提出一种向量空间之间的相似度定量刻画文本间的相似程度; 采用标准测试集对短文本的相似度进行评测, 并利用该相似度实现一种面向长文本的文本分类算法。【结果】 实验结果表明定义在概念向量空间上的相似度可以有效评估文本间的语义相似性, 在长文本的文档分类数据集上达到92%以上的分类准确率。【局限】该算法依赖于依存语法的分析效果和词嵌入向量的质量。【结论】 将语言学知识与词嵌入技术有机结合, 可以有效衡量文本间的相似度, 具有较低的计算复杂度, 能够广泛应用于文档分类和聚类、自动问答系统等应用。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
李琳
李辉
关键词 文本相似度词嵌入依存句法分析文本分类    
Abstract

[Objective] This paper proposes a method to compute the semantic similarity of texts based on a concept vector space model. [Methods] First, we analyzed the text by dependency parser and extracted key words. Then, we used word embedding method to build vector space for each document. Third, we measured similarities between the two vector spaces and actual texts. Finally, we evaluated the new similarity measures with the data set of short texts and proposed an algorithm for long document classification. [Results] The proposed method effectively measured the semantic similarity of short texts and long documents. The accuracy of document classification was over 92% for the long ones. [Limitations] The performance of our method relies on the quality of dependency parser and word embedding vectors. [Conclusions] Combining linguistics theory and word embedding technique could efectively measure the semantic similarity among texts. This new method also reduces computation complexity and could be used in document classification, text clustering, and automatic question answering systems.

Key wordsText Similarity    Word Embedding    Dependency Syntax Parser    Document Classification
收稿日期: 2018-01-03      出版日期: 2018-06-20
ZTFLH:  TP391 G35  
引用本文:   
李琳, 李辉. 一种基于概念向量空间的文本相似度计算方法[J]. 数据分析与知识发现, 2018, 2(5): 48-58.
Li Lin,Li Hui. Computing Text Similarity Based on Concept Vector Space. Data Analysis and Knowledge Discovery, 2018, 2(5): 48-58.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2018.0007      或      http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2018/V2/I5/48
编号 当前节点 父节点 词性
1 nsubj|nsubjpass 任意 NOUN|PROPN
2 dobj|attr|oprd|iobj 任意 NOUN|PROPN
3 appos|nmod|npadvmod 任意 NOUN|PROPN
4 amod|acomp|compound 任意 ADJ
5 pobj prep NOUN|PROPN
6 conj pobj|nsubj|nsubjpass|dobj NOUN|PROPN
7 nmod 任意 VERB|NOUN|PROPN|ADJ
  提取概念词的语法规则
句子 概念词
The species is classified in the genus Panthera with the lion, leopard, jaguar and snow leopard. species, genus, panthera, lion, leopard, jaguar, snow, leopard
The dollar has hit its highest level against the euro in almost three months after the Federal Reserve head said the US trade deficit is set to stabilise. dollar, high, level, euro, month, federal reserve, head, US, trade, deficit
Wayne Rooney made a winning return to Everton as Manchester United cruised into the FA Cup quarter-finals. wayne, rooney, win, return, everton, FA, cup, quarter, finals
  概念词提取示例
  两个句子之间的文本相似度计算示例
训练集(train) 开发集(dev) 测试集(test) 总计(total)
新闻 3 299 500 500 4 299
字幕 2 000 625 625 3 250
论坛 450 375 254 1 079
合计 5 749 1 500 1 379 8 626
  STS Benchmark来源和分类情况
  STS Benchmark数据集处理流程
方法 开发集(dev) 测试集(test)
BOW 0.403 0.294
BOW+Word2Vec 0.653 0.532
Concept VS 0.725 0.642
Word2Vec 0.700 0.565
PV-DBOW 0.722 0.649
  STS Benchmark数据集语义相似度实验结果 (皮尔逊相关系数ρ)
BBC BBC Sport Reuters Classic
类别 文档数 类别 文档数 类别 文档数 类别 文档数
Business 510 Athletics 101 Earn 3 735 CACM 1 480
Entertainment 386 Cricket 124 Acq 2 142 CRAN 1 393
Politics 417 Football 265 Crude 375 CISI 1 397
Sport 511 Rugby 147 Interest 369 MED 1 011
Technology 401 Tennis 100 Trade 366
money-fx 259
ship 256
wheat 162
sugar 149
coffee 123
  4个数据集的文档信息
  文本分类数据集处理流程
方法 BBC BBC Sport Reuters Classic
BOW 0.686 0.841 0.833 0.703
TF-IDF 0.653 0.532 0.722 0.689
Concept VS 0.957 0.973 0.925 0.958
WMD[27] - 0.954 0.965 0.972
  文本分类数据集实验结果(分类正确率)
  BBC、BBC Sport和Classic数据集中的文档分布情况
[1] 陈二静, 姜恩波. 文本相似度计算方法研究综述[J]. 数据分析与知识发现, 2017, 1(6): 1-11.>
[1] (Chen Erjing, Jiang Enbo.Review of Studies on Text Similarity Measures[J]. Data Analysis and Knowledge Discovery, 2017, 1(6): 1-11.)
[2] Salton G, Wong A, Yang C S.A Vector Space Model for Automatic Indexing[J]. Communications of the ACM, 1975, 18(11): 613-620.
doi: 10.1145/361219.361220
[3] Salton G, Buckley C.Term-Weighting Approaches in Automatic Text Retrieval[J]. Information Processing & Management, 1988, 24(5): 513-523.
[4] Landauer T K, Foltz P W, Laham D.An Introduction to Latent Semantic Analysis[J]. Discourse Processes, 1998, 25(2-3): 259-284.
doi: 10.1080/01638539809545028
[5] Hofmann T.Probabilistic Latent Semantic Analysis[C]// Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence. 1999: 289-296.
[6] Blei D M, Ng A Y, Jordan M I.Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
[7] Miller G A.WordNet: A Lexical Database for English[J]. Communications of the ACM, 1995, 38(11): 39-41.
[8] 董振东, 董强, 郝长伶. 知网的理论发现[J]. 中文信息报, 2007, 21(4): 3-9.
[8] (Dong Zhendong, Dong Qiang, Hao Changling.Theoretical Findings of HowNet[J]. Journal of Chinese Information Processing, 2007, 21(4): 3-9.)
[9] 梅家驹, 竺一鸣, 高蕴琦, 等. 同义词词林[M]. 上海: 上海辞书出版社, 1983.
[9] (Mei Jiaju, Zhu Yiming, Gao Yunqi, et al.Tongyici Cilin [M]. Shanghai: Shanghai Lexicographical Publishing House, 1983.)
[10] Pedersen T, Patwardhan S, Michelizzi J.WordNet: Similarity - Measuring the Relatedness of Concepts[C]// Proceedings of the 19th National Conference on Artificial Intelligence. 2004: 38-41.
[11] 江敏, 肖诗斌, 王弘蔚, 等. 一种改进的基于《知网》的词语语义相似度计算[J]. 中文信息学报, 2008, 22(5): 84-89.
[11] (Jiang Min, Xiao Shibin, Wang Hongwei, et al.An Improved Word Similarity Computing Method Based on HowNet[J]. Journal of Chinese Information Processing, 2008, 22(5): 84-89.)
[12] 田久乐, 赵蔚. 基于同义词词林的词语相似度计算方法[J]. 吉林大学学报: 信息科学版, 2010, 28(6): 602-608.
doi: 10.3969/j.issn.1671-5896.2010.06.011
[12] (Tian Jiule, Zhao Wei.Words Similarity Algorithm Based on Tongyici Cilin in Semantic Web Adaptive Learning System[J]. Journal of Jilin University: Information Science Edition, 2010, 28(6):602-608.)
doi: 10.3969/j.issn.1671-5896.2010.06.011
[13] Gabrilovich E, Markovitch S.Computing Semantic Relatedness Using Wikipedia-Based Explicit Semantic Analysis[C]// Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI’07). 2007: 1606-1611.
[14] 彭丽针, 吴扬扬. 基于维基百科社区挖掘的词语语义相似度计算[J]. 计算机科学, 2016, 43(4): 45-49.
[14] (Peng Lizhen, Wu Yangyang.Semantic Similarity Computing Based on Community Mining of Wikipedia[J]. Computer Science, 2016, 43(4): 45-49.)
[15] 詹志建, 梁丽娜, 杨小平. 基于百度百科的词语相似度计算[J]. 计算机科学, 2013, 40(6):199-202.
doi: 10.3969/j.issn.1002-137X.2013.06.043
[15] (Zhan Zhijian, Liang Lina, Yang Xiaoping.Word Similarity Measurement Based on BaiduBaike[J]. Computer Science, 2013, 40(6): 199-202.)
doi: 10.3969/j.issn.1002-137X.2013.06.043
[16] Mikolov T, Sutskever I, Chen K, et al.Distributed Representations of Words and Phrases and Their Compositionality[C]// Advances in Neural Information Processing Systems (NIPS 2013). 2013: 3111-3119.
[17] Shao Y.HCTI at SemEval-2017 Task 1: Use Convolutional Neural Network to Evaluate Semantic Textual Similarity[C]// Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). 2017: 130-133.
[18] Tai K S, Socher R, Manning C D.Improved Semantic Representations from Tree-Structured Long Short-Term Memory Networks[C]// Proceedings of the 53rd Annual Meetings of Association for Computational Linguistics. 2015: 1556-1566.
[19] Kim H K, Kim H, Cho S.Bag-of-Concepts: Comprehending Document Representation Through Clustering Words in Distributed Representation[J]. Neurocomputing, 2017, 266: 366-352.
[20] 李峰, 侯加英, 曾荣仁, 等. 融合词向量的多特征句子相似度计算方法研究[J]. 计算机科学与探索, 2017, 11(4):608-618.
doi: 10.3778/j.issn.1673-9418.1604029
[20] (Li Feng, Hou Jiaying, Zeng Rongren, et al.Research on Multi-feature Sentence Similarity Computing Method with Word Embedding[J]. Journal of Frontiers of Computer Science and Technology, 2017, 11(4): 608-618.)
doi: 10.3778/j.issn.1673-9418.1604029
[21] 李晓, 解辉, 李立杰. 基于Word2Vec的句子语义相似度计算研究[J]. 计算机科学, 2017, 44(9): 256-260.
doi: 10.11896/j.issn.1002-137X.2017.09.048
[21] (Li Xiao, Xie Hui, Li Lijie.Research on Sentence Semantic Similarity Calculation Based on Word2Vec[J]. Computer Science, 2017, 44(9): 256-260.)
doi: 10.11896/j.issn.1002-137X.2017.09.048
[22] 刘海涛. 依存语法的理论与实践[M]. 北京: 科学出版社, 2009.
[22] (Liu Haitao.Dependency Grammar from Theory to Practice[M]. Beijing: Science Press, 2009.)
[23] Choi J D, Palmer M.Guidelines for the Clear Style Constituent to Dependency Conversion [R]. University of Colorado Boulder, 2012.
[24] Cer D, Diab M, Agirre E, et al.SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation[C]// Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). 2017: 1-14.
[25] Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv:1301.3781, 2013.
[26] Le Q, Mikolov T.Distributed Representations of Sentences and Documents[C]// Proceedings of the 31st International Conference on Machine Learning (ICML-14). 2014: 1188-1196.
[27] Kusner M, Sun Y, Kolkin N, et al.From Word Embeddings to Document Distances[C]// Proceedings of the 32nd International Conference on Machine Learning. 2015: 957-966.
[28] Maaten L, Hinton G.Visualizing Data Using t-SNE[J]. Journal of Machine Learning Research, 2008, 9: 2579-2605.
doi: 10.1007/s10846-008-9235-4
[1] 唐晓波,高和璇. 基于关键词词向量特征扩展的健康问句分类研究 *[J]. 数据分析与知识发现, 2020, 4(7): 66-75.
[2] 王思迪,胡广伟,杨巳煜,施云. 基于文本分类的政府网站信箱自动转递方法研究*[J]. 数据分析与知识发现, 2020, 4(6): 51-59.
[3] 苏传东,黄孝喜,王荣波,谌志群,毛君钰,朱嘉莹,潘宇豪. 基于词嵌入融合和循环神经网络的中英文隐喻识别*[J]. 数据分析与知识发现, 2020, 4(4): 91-99.
[4] 高原,施元磊,张蕾,曹天奕,冯筠. 基于游记文本的游客游览行程重构*[J]. 数据分析与知识发现, 2020, 4(2/3): 165-172.
[5] 徐月梅,刘韫文,蔡连侨. 基于深度融合特征的政务微博转发规模预测模型*[J]. 数据分析与知识发现, 2020, 4(2/3): 18-28.
[6] 余本功,曹雨蒙,陈杨楠,杨颖. 基于nLD-SVM-RF的短文本分类研究*[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[7] 宰新宇,田学东. 基于公式描述结构和词嵌入的科技文档检索方法*[J]. 数据分析与知识发现, 2020, 4(1): 131-138.
[8] 李博诚,张云秋,杨铠西. 面向微博商品评论的情感标签抽取研究 *[J]. 数据分析与知识发现, 2019, 3(9): 115-123.
[9] 聂维民,陈永洲,马静. 融合多粒度信息的文本向量表示模型 *[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[10] 邵云飞,刘东苏. 基于类别特征扩展的短文本分类方法研究 *[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[11] 秦贺然,刘浏,李斌,王东波. 融入实体特征的典籍自动分类研究 *[J]. 数据分析与知识发现, 2019, 3(9): 68-76.
[12] 陈果,许天祥. 基于主动学习的科技论文句子功能识别研究 *[J]. 数据分析与知识发现, 2019, 3(8): 53-61.
[13] 曾庆田,胡晓慧,李超. 融合主题词嵌入和网络结构分析的主题关键词提取方法 *[J]. 数据分析与知识发现, 2019, 3(7): 52-60.
[14] 余本功,陈杨楠,杨颖. 基于nBD-SVM模型的投诉短文本分类*[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[15] 谭章禄,王兆刚,胡翰. 一种基于χ2统计的特征分类选择方法研究*[J]. 数据分析与知识发现, 2019, 3(2): 72-78.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn