Please wait a minute...
Data Analysis and Knowledge Discovery  2016, Vol. 32 Issue (12): 9-16    DOI: 10.11925/infotech.1003-3513.2016.12.02
Orginal Article Current Issue | Archive | Adv Search |
A New Text Clustering Method Based on Semantic Similarity
Qiang Bi1,Jian Liu1,Yulai Bao1,2()
1School of Management, Jilin University, Changchun 130022, China
2Inner Mongolia University Library, Hohhot 010021, China
Download:
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective]This paper proposes an algorithm based on semantic similarity to extract more information from the textual resources. [Methods] First, we calculated the semantic similarity of words with the Extended Dictionary of Synonyms, and then created a semantic similarity matrix. Second, we clustered the texts based on the new semantic similarity matrix. [Results] The proposed algorithm was examined with text corpus from Fudan University and the search engine Sogou. Compared to the traditional methods, the proposed algorithm achieved the highest precision rates and purity values (cluster number=10). [Limitations] Some partial similarity calculation results were manually adjusted due to the incomplete coverage of the Tongyici Cilin Extened Edition. [Conclusions] The proposed algorithm could extract more latent information from the texts, which is an effective method to cluster and recommend textual documents.

Key wordsTongyici Cilin Extended Edition      Semantic similarity      Spectrum clustering      Text mining     
Received: 12 September 2016      Published: 22 January 2017

Cite this article:

Qiang Bi, Jian Liu, Yulai Bao. A New Text Clustering Method Based on Semantic Similarity. Data Analysis and Knowledge Discovery, 2016, 32(12): 9-16.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2016.12.02     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2016/V32/I12/9

[1] 王鹏, 高铖, 陈晓美. 基于LDA模型的文本聚类研究[J]. 情报科学, 2015, 33(1): 63-68.
[1] (Wang Peng, Gao Cheng, Chen Xiaomei.Research on LDA Model Based on Text Clustering[J]. Information Science, 2015, 33(1): 63-68.)
[2] 顾晓雪, 章成志. 结合内容和标签的Web文本聚类研究[J]. 现代图书情报技术, 2014(11): 45-52.
[2] (Gu Xiaoxue, Zhang Chengzhi.Using Content and Tags for Web Text Clustering[J]. New Technology of Library and Information Service, 2014(11): 45-52.)
[3] 赵辉, 刘怀亮. 面向用户生成内容的短文本聚类算法研究[J]. 现代图书情报技术, 2013(9): 88-92.
[3] (Zhao Hui, Liu Huailiang.Research on Short Text Clustering Algorithm for User Generated Content[J]. New Technology of Library and Information Service, 2013(9): 88-92.)
[4] 柴春梅. 互联网短文本信息分类关键技术研究[D]. 上海: 上海交通大学, 2009.
[4] (Chai Chunmei.The Key Technology Research on Internet Short Text Information Classification [D]. Shanghai: Shanghai Jiaotong University, 2009.)
[5] 张文秀, 朱庆华. 领域本体的构建方法研究[J]. 图书与情报, 2011(1): 16-19, 40.
[5] (Zhang Wenxiu, Zhu Qinghua.Research on Construction Methods of Domain Ontology[J]. Library and Information, 2011(1): 16-19, 40.)
[6] 行小帅, 潘进, 焦李成. 基于免疫规划的K-means聚类算法[J]. 计算机学报, 2003, 26(5): 605-610.
[6] (Xing Xiaoshuai, Pan Jin, Jiao Licheng.A Novel K-means Clustering Based on the Immune Programming Algorithm[J]. Chinese Journal of Computers, 2003, 26(5): 605-610.)
[7] 刘端阳, 王良芳.基于语义词典和词汇链的关键词提取算法[J]. 浙江工业大学学报, 2013, 41(5): 545-551.
[7] (Liu Duanyang, Wang Liangfang.Keywords Extraction Algorithm Based on Semantic Dictionary and Lexical Chain[J]. Journal of Zhengjiang University of Technology, 2013, 41(5): 545-551.)
[8] 刘宏哲, 须德. 基于本体的语义相似度和相关度计算研究综述[J]. 计算机科学, 2012, 39(2): 8-13.
[8] (Liu Hongzhe, Xu De.Ontology Based Semantic Similarity and Relatedness Measures Review[J]. Computer Science, 2012, 39(2): 8-13.)
[9] Fernandez-Amoros D, Heradio R.Understanding the Role of Conceptual Relations in Word Sense Disambiguation[J]. Expert Systems with Applications, 2011, 38(8): 9506-9516.
[10] Alonso I, Contreras D.Evaluation of Semantic Similarity Metrics Applied to the Automatic Retrieval of Medical Documents: An UMLS Approach[J]. Expert Systems with Applications, 2016, 44(C): 386-399.
[11] Chang J Y, Lee K M.Large Margin Learning of Hierarchical Semantic Similarity for Image Classification[J]. Computer Vision and Image Understanding, 2015, 132: 3-11.
[12] Hassan H, Hassan A, Emam O.Unsupervised Information Extraction Approach Using Graph Mutual Reinforcement [C]. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing.2006: 501-508.
[13] Bae M, Kang S, Oh S.Semantic Similarity Method for Keyword Query System on RDF[J]. Neurocomputing, 2014, 146(C): 264-275.
[14] Rada R, Mili H, Bicknell E, et al.Development and Application of a Metric on Semantic Nets[J]. IEEE Transactions on Systems, Man, and Cybernetics, 1989, 19(1): 17-30.
[15] Tversky A.Feature of Similarity[J]. Psychological Review, 1977, 84(4): 327-352.
[16] Lord P W, Stevens R D, Brass A, et al.Investigating Semantic Similarity Measures Across the Gene Ontology: The Relationship Between Sequence and Annotation[J]. Bioinformatics, 2003, 19(10): 1275-1283.
[17] 焦芬芬. 基于概念和语义相似度的文本聚类算法[J]. 计算机工程与应用, 2012, 48(18): 136-141.
[17] (Jiao Fenfen.Clustering Method Based on Concept and Semantic Similarity[J]. Computer Engineering and Applications, 2012, 48(18): 136-141.)
[18] 田久乐, 赵蔚. 基于同义词词林的词语相似度计算方法[J].吉林大学学报: 信息科学版, 2010, 28(6): 602-608.
[18] (Tian Jiule, Zhao Wei.Words Similarity Algorithm Based on Tongyici Cilin in Semantic Web Adaptive Learning System[J]. Journal of Jilin University: Information Science Edition, 2010, 28(6): 602-608.)
[19] 王刚, 邱玉辉. 基于本体及相似度的文本聚类研究[J]. 计算机应用研究, 2010, 27(7): 2494-2497.
[19] (Wang Gang, Qiu Yuhui.Study on Text Clustering Based on Ontology Similarity[J]. Application Research of Computers, 2010, 27(7): 2494-2497.)
[20] Xiong S, Ji D. Exploiting Flexible-constrained K-means Clustering with Word Embedding for Aspect-phrase Grouping [J]. Information Sciences, 2016, 367-368: 689-699.
[21] Zhuo Z, Zhang X, Niu W, et al.Improving Data Field Hierarchical Clustering Using Barnes-Hut Algorithm[J]. Pattern Recognition Letters, 2016, 80(1): 113-120.
[22] Kumar K M, Reddy A R M. A Fast DBSCAN Clustering Algorithm by Accelerating Neighbor Searching Using Groups Method[J]. Pattern Recognition, 2016, 58: 39-48.
[23] Y?ld?r?mA A, ?zdo?an C. Parallel WaveCluster: A Linear Scaling Parallel Clustering Algorithm Implementation with Application to Very Large Datasets[J]. Journal of Parallel and Distributed Computing, 2011, 71(7): 955-962.
[24] Langone R, Agudelo O M, De Moor B, et al.Incremental Kernel Spectral Clustering for Online Learning of Non- stationary Data[J]. Neurocomputing, 2014, 139(2): 246-260.
[25] Yang Y, Wang Y, Xue X.A Novel Spectral Clustering Method with Superpixels for Image Segmentation[J]. International Journal for Light and Electron Optics, 2016, 127(1): 161-167.
[26] Chifu A-G, Hristea F, Mothe J, et al.Word Sense Discrimination in Information Retrieval: A Spectral Clustering-based Approach[J]. Information Processing & Management, 2015, 52(2): 16-31.
[27] Ng A Y, Zheng A X, Jordan M I.Stable Algorithms for Link Analysis [C]. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2001: 258-266.
[28] Singh K, Shakya H K, Biswas B.Clustering of People in Social Network Based on Textual Similarity[J]. Perspectives in Science, 2016, 8: 570-573.
[29] 吕立辉, 梁维薇, 冉蜀阳. 基于词林的词语相似度的度量[J]. 现代计算机, 2013(1): 3-6, 9.
[29] (Lv Lihui, Liang Weiwei, Ran Shuyang.A Method for Measuring Word Similarity Based on Cilin[J]. Modern Computer, 2013(1): 3-6, 9.)
[30] 孙爽, 章勇. 一种基于语义相似度的文本聚类算法[J]. 南京航空航天大学学报, 2006, 38(6): 712-716.
[30] (Sun Shuang, Zhang Yong.Clustering Method Based on Semantic Similarity[J]. Journal of Nanjing University of Aeronautics & Astronautics, 2006, 38(6): 712-716.)
[31] Ng A Y, Jordan M L, Weiss Y.On Spectral Clustering: Analysis and an Algorithm[A]. // Advances in Neural Information Processing Systems[M]. Cambridge, MA: MIT Press, 2002.
[1] Huang Mingxuan,Jiang Caoqing,Lu Shoudong. Expanding Queries Based on Word Embedding and Expansion Terms[J]. 数据分析与知识发现, 2021, 5(6): 115-125.
[2] Xu Guang,Ren Ming,Song Chengyu. Extracting China’s Economic Image from Western News[J]. 数据分析与知识发现, 2021, 5(5): 30-40.
[3] Dai Bing,Hu Zhengyin. Review of Studies on Literature-Based Discovery[J]. 数据分析与知识发现, 2021, 5(4): 1-12.
[4] Yu Chuanming, Wang Manyi, Lin Hongjun, Zhu Xingyu, Huang Tingting, An Lu. A Comparative Study of Word Representation Models Based on Deep Learning[J]. 数据分析与知识发现, 2020, 4(8): 28-40.
[5] Xia Tian. Extracting Key-phrases from Chinese Scholarly Papers[J]. 数据分析与知识发现, 2020, 4(7): 76-86.
[6] Du Jian. Measuring Uncertainty of Medical Knowledge: A Literature Review[J]. 数据分析与知识发现, 2020, 4(10): 14-27.
[7] Peng Guan,Yuefen Wang. Advances in Patent Network[J]. 数据分析与知识发现, 2020, 4(1): 26-39.
[8] Mingxuan Huang,Shoudong Lu,Hui Xu. Cross-Language Information Retrieval Based on Weighted Association Patterns and Rule Consequent Expansion[J]. 数据分析与知识发现, 2019, 3(9): 77-87.
[9] Yanan Yang,Wenhui Zhao,Jian Zhang,Shen Tan,Beibei Zhang. Visualizing Policy Texts Based on Multi-View Collaboration[J]. 数据分析与知识发现, 2019, 3(6): 30-41.
[10] Mengji Zhang,Wanyu Du,Nan Zheng. Predicting Stock Trends Based on News Events[J]. 数据分析与知识发现, 2019, 3(5): 11-18.
[11] Jiao Yan,Jing Ma,Kang Fang. Computing Text Semantic Similarity with Syntactic Network of Co-occurrence Distance[J]. 数据分析与知识发现, 2019, 3(12): 93-100.
[12] Zhang Ning,Yin Lemin,He Lifeng. Impacts of “Poster-Follower” Sentiment on Stock Market Performance[J]. 数据分析与知识发现, 2018, 2(6): 1-12.
[13] Fan Xinyue,Cui Lei. Using Text Mining to Discover Drug Side Effects: Case Study of PubMed[J]. 数据分析与知识发现, 2018, 2(3): 79-86.
[14] Chen Erjing,Jiang Enbo. Review of Studies on Text Similarity Measures[J]. 数据分析与知识发现, 2017, 1(6): 1-11.
[15] Wang Zixuan,Le Xiaoqiu,He Yuanbiao. Recognizing Core Topic Sentences with Improved TextRank Algorithm Based on WMD Semantic Similarity[J]. 数据分析与知识发现, 2017, 1(4): 1-8.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn