[Objective]This paper proposes an algorithm based on semantic similarity to extract more information from the textual resources. [Methods] First, we calculated the semantic similarity of words with the Extended Dictionary of Synonyms, and then created a semantic similarity matrix. Second, we clustered the texts based on the new semantic similarity matrix. [Results] The proposed algorithm was examined with text corpus from Fudan University and the search engine Sogou. Compared to the traditional methods, the proposed algorithm achieved the highest precision rates and purity values (cluster number=10). [Limitations] Some partial similarity calculation results were manually adjusted due to the incomplete coverage of the Tongyici Cilin Extened Edition. [Conclusions] The proposed algorithm could extract more latent information from the texts, which is an effective method to cluster and recommend textual documents.
毕强, 刘健, 鲍玉来. 基于语义相似度的文本聚类研究*[J]. 数据分析与知识发现, 2016, 32(12): 9-16.
Qiang Bi, Jian Liu, Yulai Bao. A New Text Clustering Method Based on Semantic Similarity. Data Analysis and Knowledge Discovery, 2016, 32(12): 9-16.
(Zhao Hui, Liu Huailiang.Research on Short Text Clustering Algorithm for User Generated Content[J]. New Technology of Library and Information Service, 2013(9): 88-92.)
[4]
柴春梅. 互联网短文本信息分类关键技术研究[D]. 上海: 上海交通大学, 2009.
[4]
(Chai Chunmei.The Key Technology Research on Internet Short Text Information Classification [D]. Shanghai: Shanghai Jiaotong University, 2009.)
(Xing Xiaoshuai, Pan Jin, Jiao Licheng.A Novel K-means Clustering Based on the Immune Programming Algorithm[J]. Chinese Journal of Computers, 2003, 26(5): 605-610.)
(Liu Duanyang, Wang Liangfang.Keywords Extraction Algorithm Based on Semantic Dictionary and Lexical Chain[J]. Journal of Zhengjiang University of Technology, 2013, 41(5): 545-551.)
(Liu Hongzhe, Xu De.Ontology Based Semantic Similarity and Relatedness Measures Review[J]. Computer Science, 2012, 39(2): 8-13.)
[9]
Fernandez-Amoros D, Heradio R.Understanding the Role of Conceptual Relations in Word Sense Disambiguation[J]. Expert Systems with Applications, 2011, 38(8): 9506-9516.
[10]
Alonso I, Contreras D.Evaluation of Semantic Similarity Metrics Applied to the Automatic Retrieval of Medical Documents: An UMLS Approach[J]. Expert Systems with Applications, 2016, 44(C): 386-399.
[11]
Chang J Y, Lee K M.Large Margin Learning of Hierarchical Semantic Similarity for Image Classification[J]. Computer Vision and Image Understanding, 2015, 132: 3-11.
[12]
Hassan H, Hassan A, Emam O.Unsupervised Information Extraction Approach Using Graph Mutual Reinforcement [C]. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing.2006: 501-508.
[13]
Bae M, Kang S, Oh S.Semantic Similarity Method for Keyword Query System on RDF[J]. Neurocomputing, 2014, 146(C): 264-275.
[14]
Rada R, Mili H, Bicknell E, et al.Development and Application of a Metric on Semantic Nets[J]. IEEE Transactions on Systems, Man, and Cybernetics, 1989, 19(1): 17-30.
[15]
Tversky A.Feature of Similarity[J]. Psychological Review, 1977, 84(4): 327-352.
[16]
Lord P W, Stevens R D, Brass A, et al.Investigating Semantic Similarity Measures Across the Gene Ontology: The Relationship Between Sequence and Annotation[J]. Bioinformatics, 2003, 19(10): 1275-1283.
(Tian Jiule, Zhao Wei.Words Similarity Algorithm Based on Tongyici Cilin in Semantic Web Adaptive Learning System[J]. Journal of Jilin University: Information Science Edition, 2010, 28(6): 602-608.)
(Wang Gang, Qiu Yuhui.Study on Text Clustering Based on Ontology Similarity[J]. Application Research of Computers, 2010, 27(7): 2494-2497.)
[20]
Xiong S, Ji D. Exploiting Flexible-constrained K-means Clustering with Word Embedding for Aspect-phrase Grouping [J]. Information Sciences, 2016, 367-368: 689-699.
[21]
Zhuo Z, Zhang X, Niu W, et al.Improving Data Field Hierarchical Clustering Using Barnes-Hut Algorithm[J]. Pattern Recognition Letters, 2016, 80(1): 113-120.
[22]
Kumar K M, Reddy A R M. A Fast DBSCAN Clustering Algorithm by Accelerating Neighbor Searching Using Groups Method[J]. Pattern Recognition, 2016, 58: 39-48.
[23]
Y?ld?r?mA A, ?zdo?an C. Parallel WaveCluster: A Linear Scaling Parallel Clustering Algorithm Implementation with Application to Very Large Datasets[J]. Journal of Parallel and Distributed Computing, 2011, 71(7): 955-962.
[24]
Langone R, Agudelo O M, De Moor B, et al.Incremental Kernel Spectral Clustering for Online Learning of Non- stationary Data[J]. Neurocomputing, 2014, 139(2): 246-260.
[25]
Yang Y, Wang Y, Xue X.A Novel Spectral Clustering Method with Superpixels for Image Segmentation[J]. International Journal for Light and Electron Optics, 2016, 127(1): 161-167.
[26]
Chifu A-G, Hristea F, Mothe J, et al.Word Sense Discrimination in Information Retrieval: A Spectral Clustering-based Approach[J]. Information Processing & Management, 2015, 52(2): 16-31.
[27]
Ng A Y, Zheng A X, Jordan M I.Stable Algorithms for Link Analysis [C]. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2001: 258-266.
[28]
Singh K, Shakya H K, Biswas B.Clustering of People in Social Network Based on Textual Similarity[J]. Perspectives in Science, 2016, 8: 570-573.
(Sun Shuang, Zhang Yong.Clustering Method Based on Semantic Similarity[J]. Journal of Nanjing University of Aeronautics & Astronautics, 2006, 38(6): 712-716.)
[31]
Ng A Y, Jordan M L, Weiss Y.On Spectral Clustering: Analysis and an Algorithm[A]. // Advances in Neural Information Processing Systems[M]. Cambridge, MA: MIT Press, 2002.