WordNet在文本聚类中的应用研究*

doi:10.11925/infotech.1003-3513.2009.10.12

现代图书情报技术

2009, Vol.

Issue (10): 67-70 https://doi.org/10.11925/infotech.1003-3513.2009.10.12

情报分析与研究

本期目录 | 过刊浏览 | 高级检索

WordNet在文本聚类中的应用研究*

饶洋辉^1,3叶良²程洁²

¹（中国科学院国家科学图书馆北京 100190）
²（中国科学院计算机网络信息中心北京 100190）
³（中国科学院研究生院北京 100049）

Research on the Application of WordNet in Text Clustering

Rao Yanghui^1,3 Ye Liang² Cheng Jie²

¹(National Science Library, Chinese Academy of Sciences, Beijing 100190, China)
²(Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China)
³(Graduate University of Chinese Academy of Sciences, Beijing 100049, China)

摘要
参考文献
相关文章
Metrics

全文: PDF (335 KB)
输出: BibTeX | EndNote (RIS)

摘要

针对文本聚类算法在应用方面存在的“维灾”、簇的命名以及大规模的问题，运用WordNet词典进行词列表的降维和词干化，提出并实现基于词性标注和WordNet相结合的并行文本聚类方法，最后和基于Porter词干化的文本聚类方法进行性能的比较。实验结果表明，该方法能大幅度降低词列表的维度，提高聚类的准确率和召回率，同时增强各个簇的可理解性。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	饶洋辉
	叶良
	程洁

关键词 ： WordNet, 词性标注, 文本聚类, 并行K-Means

Abstract：

To deal with “disaster of dimensionality”, cluster identifying and large-scale problems arising in text clustering algorithm’s applications, a parallel text clustering method is proposed and implemented,which uses WordNet to the dimensionality reduction of the word list and stemming based on POS tagging and WordNet. Comparing with the Porter Stemming method, the experimental results show that this method can substantially reduce the dimension of word list, improve the accuracy and recall rate of the clustering and have a better understanding of each cluster.

Key words： WordNet POS tagging Text clustering Parallel K-Means

收稿日期: 2009-09-07 出版日期: 2009-10-25

TP311

基金资助:

*本文系中国科学院规划与战略研究项目“21世纪科技发展前沿走势研究”（项目编号：KACX1-YW-0733）的研究成果之一。

通讯作者: 饶洋辉 E-mail: raoyh@mail.las.ac.cn

作者简介: 饶洋辉,叶良,程洁

引用本文:

饶洋辉,叶良,程洁. WordNet在文本聚类中的应用研究*[J]. 现代图书情报技术, 2009, (10): 67-70.
Rao Yanghui,Ye Liang,Cheng Jie. Research on the Application of WordNet in Text Clustering. New Technology of Library and Information Service, 2009, (10): 67-70.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2009.10.12 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2009/V/I10/67

［1］ Han J, Kamber M. Data Mining: Concepts and Techniques［M］. America: Morgan Kaufmann, 2006:383-460.
［2］ Steinbach M, Karypis G, Kumar V. A Comparison of Document Clustering Techniques［C］. In：Proceedings of KDD Workshop on Text Mining.2000:20-23.
［3］ Zhao Y, Karypis G. Evaluation of Hierarchical Clustering Algorithms for Document Datasets［C］. In： Proceedings of International Conference on Information and Knowledge Management.2002:515-524.
［4］ Zhao Y, Karypis G, Fayyad U M. Hierarchical Clustering Algorithms for Document Datasets［J］. Data Mining and Knowledge Discovery, 2005，10（2）:141-168.
［5］ Kanungo T, Mount D M, Netanyahu N, et al. A Local Search Approximation Algorithm for K-Means Clustering［C］. In： Proceedings of the 18th Annual ACM Symposium on Computational Geometry. 2004(2-3):1-25.
［6］ Bradley P S. Fayyad U M. Refining Initial Points for K-Means Clustering［C］. In: Proceedings of the 15th International Conference on Machine Learning.1998:91-99.
［7］刘远超,王晓龙,刘秉权.一种改进的K-Means文档聚类初值选择算法［J］.高技术通讯,2006,16(1):11-15.
［8］杨风召.高维数据挖掘技术研究［M］.南京:东南大学出版社, 2007:60-61.
［9］ Porter M. An Algorithm for Suffix Stripping ［J］. Program, 1980,14(3):130-138.
［10］ Miller G A, Beckwith R， Fellbaum C, et al. WordNet: An On-line Lexical Database［J］. International Journal of Lexicography, 1990(3):235-244.
［11］ Manning C D,Schutze H.统计自然语言处理基础［M］. 苑春法李庆中,等译.北京:电子工业出版社,2005:216-217.
［12］ GATE—General Architecture for Text Engineering［EB/OL］.［2009-04-12］.http://Gate.ac.uk/.
［13］ Bisgin H, Dalfes H N. Parallel Clustering Algorithms with Application to Climatology［D］. Istanbul Technical University, 2008.
［14］ 20 Newsgroups［EB/OL］. ［2009-04-12］.http://people.csail.mit.edu/jrennie/20Newsgroups/.

[1]	张琪,江川,纪有书,冯敏萱,李斌,许超,刘浏. 面向多领域先秦典籍的分词词性一体化自动标注模型构建^*[J]. 数据分析与知识发现, 2021, 5(3): 2-11.
[2]	赵华茗,余丽,周强. 基于均值漂移算法的文本聚类数目优化研究 ^*[J]. 数据分析与知识发现, 2019, 3(9): 27-35.
[3]	陆泉,朱安琪,张霁月,陈静. *中文网络健康社区中的用户信息需求挖掘研究^——以求医网肿瘤板块数据为例**[J]. 数据分析与知识发现, 2019, 3(4): 22-32.
[4]	袁悦,王东波,黄水清,李斌. 不同词性标记集在典籍实体抽取上的差异性探究^*[J]. 数据分析与知识发现, 2019, 3(3): 57-65.
[5]	张涛, 马海群. 一种基于LDA主题模型的政策文本聚类方法研究^*[J]. 数据分析与知识发现, 2018, 2(9): 59-65.
[6]	官琴, 邓三鸿, 王昊. 中文文本聚类常用停用词表对比研究^*[J]. 数据分析与知识发现, 2017, 1(3): 72-80.
[7]	曲云鹏,王文玲. 一种分布式语义增强的词汇链文本表示模型构建方法[J]. 现代图书情报技术, 2016, 32(9): 34-41.
[8]	陈东沂,周子程,蒋盛益,王连喜,吴佳林. 面向企业微博的客户细分框架^*[J]. 现代图书情报技术, 2016, 32(2): 43-51.
[9]	龚凯乐,成颖,孙建军. 基于参与者共现分析的博文聚类研究^*[J]. 现代图书情报技术, 2016, 32(10): 50-58.
[10]	赵华茗. 分布式环境下的文本聚类研究与实现[J]. 现代图书情报技术, 2015, 31(1): 82-88.
[11]	顾晓雪, 章成志. 结合内容和标签的Web文本聚类研究[J]. 现代图书情报技术, 2014, 30(11): 45-52.
[12]	许鑫, 洪韵佳. 专题知识库中文本聚类结果的可视化研究——以中华烹饪文化知识库为例[J]. 现代图书情报技术, 2014, 30(10): 25-32.
[13]	邓三鸿,万接喜,王昊,刘喜文. 基于特征翻译和潜在语义标引的跨语言文本聚类实验分析^*[J]. 现代图书情报技术, 2014, 30(1): 28-35.
[14]	赵辉, 刘怀亮. 面向用户生成内容的短文本聚类算法研究[J]. 现代图书情报技术, 2013, 29(9): 88-92.
[15]	何文静, 何琳. 基于社会标签的文本聚类研究[J]. 现代图书情报技术, 2013, 29(7/8): 49-54.

Viewed

Full text

Abstract

Cited

Shared

Discussed