Please wait a minute...
Advanced Search
数据分析与知识发现  2016, Vol. 32 Issue (12): 9-16    DOI: 10.11925/infotech.1003-3513.2016.12.02
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于语义相似度的文本聚类研究*
毕强1,刘健1,鲍玉来1,2()
1吉林大学管理学院 长春 130022
2内蒙古大学图书馆 呼和浩特 010021
A New Text Clustering Method Based on Semantic Similarity
Qiang Bi1,Jian Liu1,Yulai Bao1,2()
1School of Management, Jilin University, Changchun 130022, China
2Inner Mongolia University Library, Hohhot 010021, China
全文: PDF(782 KB)   HTML ( 41
输出: BibTeX | EndNote (RIS)      
摘要 

目的】为解决传统的文本聚类无法充分挖掘文本资源语义信息以及相似度矩阵高维性、稀疏性等问题, 并进一步改善文本聚类质量, 提出基于语义相似度的文本聚类方法。【方法】通过《同义词词林扩展版》计算词语的语义相似度并得到文本语义相似度矩阵, 根据文本语义相似度矩阵进行谱聚类, 将文本聚集为文本簇。【结果】利用复旦大学文本语料库与搜狗文本语料库中的文本资源作为数据来源分别对传统聚类算法与本文提出的算法进行实验, 结果表明, 当聚类个数为10时, 本文算法的准确率最高, 并且Purity值高于传统聚类算法的Purity值。【局限】《同义词词林扩展版》中包含的领域术语不完整, 部分相似度计算结果需要手工进行调整。【结论】该方法考虑了词语间语义关系, 充分挖掘文本主体潜在信息, 并且改善了聚类质量, 为文本聚类和推荐提供了一条新途径。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
毕强
刘健
鲍玉来
关键词 同义词词林扩展版语义相似度谱聚类文本挖掘    
Abstract

[Objective]This paper proposes an algorithm based on semantic similarity to extract more information from the textual resources. [Methods] First, we calculated the semantic similarity of words with the Extended Dictionary of Synonyms, and then created a semantic similarity matrix. Second, we clustered the texts based on the new semantic similarity matrix. [Results] The proposed algorithm was examined with text corpus from Fudan University and the search engine Sogou. Compared to the traditional methods, the proposed algorithm achieved the highest precision rates and purity values (cluster number=10). [Limitations] Some partial similarity calculation results were manually adjusted due to the incomplete coverage of the Tongyici Cilin Extened Edition. [Conclusions] The proposed algorithm could extract more latent information from the texts, which is an effective method to cluster and recommend textual documents.

Key wordsTongyici Cilin Extended Edition    Semantic similarity    Spectrum clustering    Text mining
收稿日期: 2016-09-12     
基金资助:*本文系国家自然科学基金项目“语义网络环境下数字图书馆资源多维度聚合与可视化展示研究”(项目编号: 71273111)的研究成果之一
引用本文:   
毕强, 刘健, 鲍玉来. 基于语义相似度的文本聚类研究*[J]. 数据分析与知识发现, 2016, 32(12): 9-16.
Qiang Bi, Jian Liu, Yulai Bao. A New Text Clustering Method Based on Semantic Similarity. Data Analysis and Knowledge Discovery, DOI:10.11925/infotech.1003-3513.2016.12.02.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2016.12.02
[1] 王鹏, 高铖, 陈晓美. 基于LDA模型的文本聚类研究[J]. 情报科学, 2015, 33(1): 63-68.
[1] (Wang Peng, Gao Cheng, Chen Xiaomei.Research on LDA Model Based on Text Clustering[J]. Information Science, 2015, 33(1): 63-68.)
[2] 顾晓雪, 章成志. 结合内容和标签的Web文本聚类研究[J]. 现代图书情报技术, 2014(11): 45-52.
[2] (Gu Xiaoxue, Zhang Chengzhi.Using Content and Tags for Web Text Clustering[J]. New Technology of Library and Information Service, 2014(11): 45-52.)
[3] 赵辉, 刘怀亮. 面向用户生成内容的短文本聚类算法研究[J]. 现代图书情报技术, 2013(9): 88-92.
[3] (Zhao Hui, Liu Huailiang.Research on Short Text Clustering Algorithm for User Generated Content[J]. New Technology of Library and Information Service, 2013(9): 88-92.)
[4] 柴春梅. 互联网短文本信息分类关键技术研究[D]. 上海: 上海交通大学, 2009.
[4] (Chai Chunmei.The Key Technology Research on Internet Short Text Information Classification [D]. Shanghai: Shanghai Jiaotong University, 2009.)
[5] 张文秀, 朱庆华. 领域本体的构建方法研究[J]. 图书与情报, 2011(1): 16-19, 40.
[5] (Zhang Wenxiu, Zhu Qinghua.Research on Construction Methods of Domain Ontology[J]. Library and Information, 2011(1): 16-19, 40.)
[6] 行小帅, 潘进, 焦李成. 基于免疫规划的K-means聚类算法[J]. 计算机学报, 2003, 26(5): 605-610.
[6] (Xing Xiaoshuai, Pan Jin, Jiao Licheng.A Novel K-means Clustering Based on the Immune Programming Algorithm[J]. Chinese Journal of Computers, 2003, 26(5): 605-610.)
[7] 刘端阳, 王良芳.基于语义词典和词汇链的关键词提取算法[J]. 浙江工业大学学报, 2013, 41(5): 545-551.
[7] (Liu Duanyang, Wang Liangfang.Keywords Extraction Algorithm Based on Semantic Dictionary and Lexical Chain[J]. Journal of Zhengjiang University of Technology, 2013, 41(5): 545-551.)
[8] 刘宏哲, 须德. 基于本体的语义相似度和相关度计算研究综述[J]. 计算机科学, 2012, 39(2): 8-13.
[8] (Liu Hongzhe, Xu De.Ontology Based Semantic Similarity and Relatedness Measures Review[J]. Computer Science, 2012, 39(2): 8-13.)
[9] Fernandez-Amoros D, Heradio R.Understanding the Role of Conceptual Relations in Word Sense Disambiguation[J]. Expert Systems with Applications, 2011, 38(8): 9506-9516.
[10] Alonso I, Contreras D.Evaluation of Semantic Similarity Metrics Applied to the Automatic Retrieval of Medical Documents: An UMLS Approach[J]. Expert Systems with Applications, 2016, 44(C): 386-399.
[11] Chang J Y, Lee K M.Large Margin Learning of Hierarchical Semantic Similarity for Image Classification[J]. Computer Vision and Image Understanding, 2015, 132: 3-11.
[12] Hassan H, Hassan A, Emam O.Unsupervised Information Extraction Approach Using Graph Mutual Reinforcement [C]. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing.2006: 501-508.
[13] Bae M, Kang S, Oh S.Semantic Similarity Method for Keyword Query System on RDF[J]. Neurocomputing, 2014, 146(C): 264-275.
[14] Rada R, Mili H, Bicknell E, et al.Development and Application of a Metric on Semantic Nets[J]. IEEE Transactions on Systems, Man, and Cybernetics, 1989, 19(1): 17-30.
[15] Tversky A.Feature of Similarity[J]. Psychological Review, 1977, 84(4): 327-352.
[16] Lord P W, Stevens R D, Brass A, et al.Investigating Semantic Similarity Measures Across the Gene Ontology: The Relationship Between Sequence and Annotation[J]. Bioinformatics, 2003, 19(10): 1275-1283.
[17] 焦芬芬. 基于概念和语义相似度的文本聚类算法[J]. 计算机工程与应用, 2012, 48(18): 136-141.
[17] (Jiao Fenfen.Clustering Method Based on Concept and Semantic Similarity[J]. Computer Engineering and Applications, 2012, 48(18): 136-141.)
[18] 田久乐, 赵蔚. 基于同义词词林的词语相似度计算方法[J].吉林大学学报: 信息科学版, 2010, 28(6): 602-608.
[18] (Tian Jiule, Zhao Wei.Words Similarity Algorithm Based on Tongyici Cilin in Semantic Web Adaptive Learning System[J]. Journal of Jilin University: Information Science Edition, 2010, 28(6): 602-608.)
[19] 王刚, 邱玉辉. 基于本体及相似度的文本聚类研究[J]. 计算机应用研究, 2010, 27(7): 2494-2497.
[19] (Wang Gang, Qiu Yuhui.Study on Text Clustering Based on Ontology Similarity[J]. Application Research of Computers, 2010, 27(7): 2494-2497.)
[20] Xiong S, Ji D. Exploiting Flexible-constrained K-means Clustering with Word Embedding for Aspect-phrase Grouping [J]. Information Sciences, 2016, 367-368: 689-699.
[21] Zhuo Z, Zhang X, Niu W, et al.Improving Data Field Hierarchical Clustering Using Barnes-Hut Algorithm[J]. Pattern Recognition Letters, 2016, 80(1): 113-120.
[22] Kumar K M, Reddy A R M. A Fast DBSCAN Clustering Algorithm by Accelerating Neighbor Searching Using Groups Method[J]. Pattern Recognition, 2016, 58: 39-48.
[23] Y?ld?r?mA A, ?zdo?an C. Parallel WaveCluster: A Linear Scaling Parallel Clustering Algorithm Implementation with Application to Very Large Datasets[J]. Journal of Parallel and Distributed Computing, 2011, 71(7): 955-962.
[24] Langone R, Agudelo O M, De Moor B, et al.Incremental Kernel Spectral Clustering for Online Learning of Non- stationary Data[J]. Neurocomputing, 2014, 139(2): 246-260.
[25] Yang Y, Wang Y, Xue X.A Novel Spectral Clustering Method with Superpixels for Image Segmentation[J]. International Journal for Light and Electron Optics, 2016, 127(1): 161-167.
[26] Chifu A-G, Hristea F, Mothe J, et al.Word Sense Discrimination in Information Retrieval: A Spectral Clustering-based Approach[J]. Information Processing & Management, 2015, 52(2): 16-31.
[27] Ng A Y, Zheng A X, Jordan M I.Stable Algorithms for Link Analysis [C]. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2001: 258-266.
[28] Singh K, Shakya H K, Biswas B.Clustering of People in Social Network Based on Textual Similarity[J]. Perspectives in Science, 2016, 8: 570-573.
[29] 吕立辉, 梁维薇, 冉蜀阳. 基于词林的词语相似度的度量[J]. 现代计算机, 2013(1): 3-6, 9.
[29] (Lv Lihui, Liang Weiwei, Ran Shuyang.A Method for Measuring Word Similarity Based on Cilin[J]. Modern Computer, 2013(1): 3-6, 9.)
[30] 孙爽, 章勇. 一种基于语义相似度的文本聚类算法[J]. 南京航空航天大学学报, 2006, 38(6): 712-716.
[30] (Sun Shuang, Zhang Yong.Clustering Method Based on Semantic Similarity[J]. Journal of Nanjing University of Aeronautics & Astronautics, 2006, 38(6): 712-716.)
[31] Ng A Y, Jordan M L, Weiss Y.On Spectral Clustering: Analysis and an Algorithm[A]. // Advances in Neural Information Processing Systems[M]. Cambridge, MA: MIT Press, 2002.
[1] 杨亚楠,赵文辉,张健,谭珅,张贝贝. 基于多视图协同的政策文本可视化研究*[J]. 数据分析与知识发现, 2019, 3(6): 30-41.
[2] 张梦吉,杜婉钰,郑楠. 引入新闻短文本的个股走势预测模型[J]. 数据分析与知识发现, 2019, 3(5): 11-18.
[3] 李湘东,高凡,李悠海. 共通语义空间下的跨文献类型文本自动分类研究*[J]. 数据分析与知识发现, 2018, 2(9): 66-73.
[4] 张宁,尹乐民,何立峰. 网络股评“发布者-关注者”BSI与股票市场关联性研究*[J]. 数据分析与知识发现, 2018, 2(6): 1-12.
[5] 范馨月,崔雷. 基于文本挖掘的药物副作用知识发现研究[J]. 数据分析与知识发现, 2018, 2(3): 79-86.
[6] 陈二静,姜恩波. 文本相似度计算方法研究综述[J]. 数据分析与知识发现, 2017, 1(6): 1-11.
[7] 陈梅梅, 薛康杰. 基于改进张量分解模型的个性化推荐算法研究*[J]. 数据分析与知识发现, 2017, 1(3): 38-45.
[8] 汪强兵,章成志. 融合内容与用户手势行为的用户画像构建系统设计与实现*[J]. 数据分析与知识发现, 2017, 1(2): 80-86.
[9] 翟东升,蔡文浩,张杰,李振飞. 改进的中文商标语义相似度计算方法研究[J]. 数据分析与知识发现, 2017, 1(11): 19-28.
[10] 谢秀芳,张晓林. 针对科技路线图的文本挖掘研究: 集成分析及可视化*[J]. 数据分析与知识发现, 2017, 1(1): 16-25.
[11] 刘健,毕强,刘庆旭,王福. 数字文献资源内容服务推荐研究*——基于本体规则推理和语义相似度计算[J]. 现代图书情报技术, 2016, 32(9): 70-77.
[12] 姚兆旭,马静. 面向微博话题的“主题+观点”词条抽取算法研究*[J]. 现代图书情报技术, 2016, 32(7-8): 78-86.
[13] 兰秋军,刘文星,李卫康,胡星野. 融合句法信息的金融论坛文本情感计算研究*[J]. 现代图书情报技术, 2016, 32(4): 64-71.
[14] 巴志超,李纲,朱世伟. 基于语义网络的研究兴趣相似性度量方法*[J]. 现代图书情报技术, 2016, 32(4): 81-90.
[15] 林园园,战洪飞,余军合,李长江,张凡. 基于产品评论的消费者情感波动分析模型构建及实证研究*[J]. 现代图书情报技术, 2016, 32(11): 44-53.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn