Please wait a minute...
New Technology of Library and Information Service  2009, Vol. Issue (10): 67-70    DOI: 10.11925/infotech.1003-3513.2009.10.12
Current Issue | Archive | Adv Search |
Research on the Application of WordNet in Text Clustering
Rao Yanghui1,3  Ye Liang2  Cheng Jie2
1(National Science Library, Chinese Academy of Sciences, Beijing 100190, China)
2(Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China)
3(Graduate University of Chinese Academy of Sciences, Beijing 100049, China)
Download: PDF (335 KB)  
Export: BibTeX | EndNote (RIS)      
Abstract  

To deal with “disaster of dimensionality”, cluster identifying and large-scale problems arising in text clustering algorithm’s applications, a parallel text clustering method is proposed and implemented,which uses WordNet to the dimensionality reduction of the word list and stemming based on POS tagging and WordNet. Comparing with the Porter Stemming method, the experimental results show that this method can substantially reduce the dimension of word list, improve the accuracy and recall rate of the clustering and have a better understanding of each cluster.

Key words WordNet      POS tagging      Text clustering      Parallel K-Means     
Received: 07 September 2009      Published: 25 October 2009
ZTFLH: 

TP311

 
Corresponding Authors: Rao Yanghui     E-mail: raoyh@mail.las.ac.cn
About author:: Rao Yanghui,Ye Liang,Cheng Jie

Cite this article:

Rao Yanghui,Ye Liang,Cheng Jie. Research on the Application of WordNet in Text Clustering. New Technology of Library and Information Service, 2009, (10): 67-70.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2009.10.12     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2009/V/I10/67

[1] Han  J, Kamber M. Data Mining: Concepts and Techniques[M]. America: Morgan Kaufmann, 2006:383-460.
[2] Steinbach M, Karypis G, Kumar V. A Comparison of Document Clustering Techniques[C]. In:Proceedings of  KDD Workshop on Text Mining.2000:20-23.
[3] Zhao Y, Karypis G. Evaluation of Hierarchical Clustering Algorithms for Document Datasets[C]. In: Proceedings of International Conference on Information and Knowledge Management.2002:515-524.
[4] Zhao Y, Karypis G, Fayyad U M. Hierarchical Clustering Algorithms for Document Datasets[J]. Data Mining and Knowledge Discovery, 2005,10(2):141-168.
[5] Kanungo T, Mount D M, Netanyahu N, et al. A Local Search Approximation Algorithm for K-Means Clustering[C]. In: Proceedings of the 18th Annual ACM Symposium on Computational Geometry. 2004(2-3):1-25.
[6] Bradley  P S.  Fayyad U M. Refining Initial Points for K-Means Clustering[C]. In: Proceedings of the 15th International Conference on Machine Learning.1998:91-99.
[7] 刘远超,王晓龙,刘秉权.一种改进的K-Means文档聚类初值选择算法[J].高技术通讯,2006,16(1):11-15.
[8] 杨风召.高维数据挖掘技术研究[M].南京:东南大学出版社, 2007:60-61.
[9] Porter M. An Algorithm for Suffix Stripping [J]. Program, 1980,14(3):130-138.
[10] Miller G A, Beckwith R, Fellbaum C, et al. WordNet: An On-line Lexical Database[J]. International Journal of Lexicography, 1990(3):235-244.
[11] Manning C D,Schutze H.统计自然语言处理基础[M]. 苑春法李庆中,等译.北京:电子工业出版社,2005:216-217.
[12] GATE—General Architecture for Text Engineering[EB/OL].[2009-04-12].http://Gate.ac.uk/.
[13] Bisgin H, Dalfes H N. Parallel Clustering Algorithms with Application to Climatology[D]. Istanbul Technical University, 2008.
[14] 20 Newsgroups[EB/OL]. [2009-04-12].http://people.csail.mit.edu/jrennie/20Newsgroups/.

[1] Huaming Zhao,Li Yu,Qiang Zhou. Determining Best Text Clustering Number with Mean Shift Algorithm[J]. 数据分析与知识发现, 2019, 3(9): 27-35.
[2] Quan Lu,Anqi Zhu,Jiyue Zhang,Jing Chen. Research on User Information Requirement in Chinese Network Health Community: Taking Tumor-forum Data of Qiuyi as an Example[J]. 数据分析与知识发现, 2019, 3(4): 22-32.
[3] Zhang Tao,Ma Haiqun. Clustering Policy Texts Based on LDA Topic Model[J]. 数据分析与知识发现, 2018, 2(9): 59-65.
[4] Guan Qin,Deng Sanhong,Wang Hao. Chinese Stopwords for Text Clustering: A Comparative Study[J]. 数据分析与知识发现, 2017, 1(3): 72-80.
[5] Chen Dongyi,Zhou Zicheng,Jiang Shengyi,Wang Lianxi,Wu Jialin. A Framework for Customer Segmentation on Enterprises’ Microblog[J]. 现代图书情报技术, 2016, 32(2): 43-51.
[6] Gong Kaile,Cheng Ying,Sun Jianjun. Clustering Blog Posts with Co-occurrence Analysis[J]. 现代图书情报技术, 2016, 32(10): 50-58.
[7] Gu Xiaoxue, Zhang Chengzhi. Using Content and Tags for Web Text Clustering[J]. 现代图书情报技术, 2014, 30(11): 45-52.
[8] Xu Xin, Hong Yunjia. Study on Text Visualization of Clustering Result for Domain Knowledge Base —— Take Knowledge Base of Chinese Cuisine Culture as the Object[J]. 现代图书情报技术, 2014, 30(10): 25-32.
[9] Deng Sanhong,Wan Jiexi,Wang Hao,Liu Xiwen. Experimental Study of Multilingual Text Clustering[J]. 现代图书情报技术, 2014, 30(1): 28-35.
[10] Zhao Hui, Liu Huailiang. Research on Short Text Clustering Algorithm for User Generated Content[J]. 现代图书情报技术, 2013, 29(9): 88-92.
[11] He Wenjing, He Lin. Research on Text Clustering Based on Social Tagging[J]. 现代图书情报技术, 2013, 29(7/8): 49-54.
[12] Hong Yunjia, Xu Xin. Study on Multi-level Text Clustering for Knowledge Base Based on Domain Ontology——Taking Knowledge Base of Chinese Cuisine Culture as an Example[J]. 现代图书情报技术, 2013, (12): 19-26.
[13] Bian Peng, Zhao Yan, Su Yuzhao. An Improved Method for Determining Optimal Number of Clusters in K-means Clustering Algorithm[J]. 现代图书情报技术, 2011, 27(9): 34-40.
[14] Yin Jinling,Wang Huilin. Research on the Part-of-Speech Tagging Method[J]. 现代图书情报技术, 2009, 3(3): 46-51.
[15] Lu Guoli,Wang Xiaohua,Wang Rongbo. Text Clustering Research on the Max Term Contribution Dimension Reduction and Simulated Annealing Algorithm[J]. 现代图书情报技术, 2008, 24(12): 43-47.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn