|
|
Research on the Application of WordNet in Text Clustering |
Rao Yanghui1,3 Ye Liang2 Cheng Jie2 |
1(National Science Library, Chinese Academy of Sciences, Beijing 100190, China)
2(Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China)
3(Graduate University of Chinese Academy of Sciences, Beijing 100049, China) |
|
|
Abstract To deal with “disaster of dimensionality”, cluster identifying and large-scale problems arising in text clustering algorithm’s applications, a parallel text clustering method is proposed and implemented,which uses WordNet to the dimensionality reduction of the word list and stemming based on POS tagging and WordNet. Comparing with the Porter Stemming method, the experimental results show that this method can substantially reduce the dimension of word list, improve the accuracy and recall rate of the clustering and have a better understanding of each cluster.
|
Received: 07 September 2009
Published: 25 October 2009
|
|
Corresponding Authors:
Rao Yanghui
E-mail: raoyh@mail.las.ac.cn
|
About author:: Rao Yanghui,Ye Liang,Cheng Jie |
[1] Han J, Kamber M. Data Mining: Concepts and Techniques[M]. America: Morgan Kaufmann, 2006:383-460.
[2] Steinbach M, Karypis G, Kumar V. A Comparison of Document Clustering Techniques[C]. In:Proceedings of KDD Workshop on Text Mining.2000:20-23.
[3] Zhao Y, Karypis G. Evaluation of Hierarchical Clustering Algorithms for Document Datasets[C]. In: Proceedings of International Conference on Information and Knowledge Management.2002:515-524.
[4] Zhao Y, Karypis G, Fayyad U M. Hierarchical Clustering Algorithms for Document Datasets[J]. Data Mining and Knowledge Discovery, 2005,10(2):141-168.
[5] Kanungo T, Mount D M, Netanyahu N, et al. A Local Search Approximation Algorithm for K-Means Clustering[C]. In: Proceedings of the 18th Annual ACM Symposium on Computational Geometry. 2004(2-3):1-25.
[6] Bradley P S. Fayyad U M. Refining Initial Points for K-Means Clustering[C]. In: Proceedings of the 15th International Conference on Machine Learning.1998:91-99.
[7] 刘远超,王晓龙,刘秉权.一种改进的K-Means文档聚类初值选择算法[J].高技术通讯,2006,16(1):11-15.
[8] 杨风召.高维数据挖掘技术研究[M].南京:东南大学出版社, 2007:60-61.
[9] Porter M. An Algorithm for Suffix Stripping [J]. Program, 1980,14(3):130-138.
[10] Miller G A, Beckwith R, Fellbaum C, et al. WordNet: An On-line Lexical Database[J]. International Journal of Lexicography, 1990(3):235-244.
[11] Manning C D,Schutze H.统计自然语言处理基础[M]. 苑春法李庆中,等译.北京:电子工业出版社,2005:216-217.
[12] GATE—General Architecture for Text Engineering[EB/OL].[2009-04-12].http://Gate.ac.uk/.
[13] Bisgin H, Dalfes H N. Parallel Clustering Algorithms with Application to Climatology[D]. Istanbul Technical University, 2008.
[14] 20 Newsgroups[EB/OL]. [2009-04-12].http://people.csail.mit.edu/jrennie/20Newsgroups/. |
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|