|
|
Research on Short Text Clustering Algorithm for User Generated Content |
Zhao Hui, Liu Huailiang |
School of Economics & Management, Xidian University, Xi’an 710071, China |
|
|
Abstract To solve the problem of weak semantic description ability of short text feature in user generated content, and the traditional K-means algorithm for document clustering is sensitive to the initial clustering center, this paper proposes that the semantic features information of short text can be supplied by feature extension based on the concept, link structure and category system of Wikipedia. Then the weighted complex network of short text set is built by the semantic relation of texts, and text clustering is achieved by node partitioning community based on K-means algorithm whose initial clustering center is chosen according to the synthetic characteristics of network nodes. Results of experiment show that the algorithm proposed by this paper can improve the effect of short text clustering.
|
Received: 02 July 2013
Published: 27 September 2013
|
|
[1] 赵宇翔, 范哲, 朱庆华. 用户生成内容 (UGC) 概念解析及研究进展[J]. 中国图书馆学报, 2012,38(5): 68-81. (Zhao Yuxiang, Fan Zhe, Zhu Qinghua. Conceptualization and Research Progress on User-Generated Content [J]. Journal of Library Science in China, 2012,38(5): 68-81.) [2] 柴春梅. 互联网短文本信息分类关键技术研究[D]. 上海: 上海交通大学, 2009. (Chai Chunmei. The Key Technology Research on Internet Short Text Information Classification[D]. Shanghai: Shanghai Jiaotong University, 2009.) [3] MacQueen J B. Some Methods for Classification and Analysis of Multivariate Observations[C]. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 1967: 281-297. [4] 行小帅, 潘进, 焦李成. 基于免疫规划的 K-means 聚类算法[J]. 计算机学报, 2003, 26(5): 605-610. (Xing Xiaoshuai, Pan Jin, Jiao Licheng. A Novel K-means Clustering Based on the Immune Programming Algorithm [J]. Chinese Journal of Computers, 2003, 26(5): 605-610.) [5] 宁亚辉, 樊兴华, 吴渝. 基于领域词语本体的短文本分类[J]. 计算机科学, 2009, 36(3): 142-145. (Ning Yahui, Fan Xinghua, Wu Yu. Short Text Classification Based on Domain Word Ontology[J]. Computer Science, 2009,36(3): 142-145.) [6] 王盛, 樊兴华, 陈现麟. 利用上下位关系的中文短文本分类[J]. 计算机应用, 2010,30(3): 603-611.(Wang Sheng, Fan Xinghua, Chen Xianlin. Chinese Short Text Classification Based on Hyponymy Relation[J]. Journal of Computer Application, 2010,30(3): 603-611.) [7] 范云杰, 刘怀亮. 基于维基百科的中文短文本分类研究[J]. 现代图书情报技术, 2012(3): 47-52. (Fan Yunjie, Liu Huailiang. Research on Chinese Short Text Classification Based on Wikipedia[J]. New Technology of Library and Information Service, 2012(3): 47-52.) [8] 白秋产, 金春霞. 概念属性扩展的短文本聚类算法[J]. 长春师范学院学报: 自然科学版, 2011,30(5): 29-33. (Bai Qiuchan, Jin Chunxia. Short Text Clustering Algorithm Based on Concept Feature Expansion[J]. Journal of Changchun Normal University: Natural Science, 2011, 30(5): 29-33.) [9] Pan Y, Chen A H, Jiang L L. Improved K-means Clustering Method Based on Complex Network for Rolling Bearing Fault Diagnosis[J]. Applied Mechanics and Materials, 2013, 273: 250-254. [10] 赵鹏, 耿焕同, 蔡庆生, 等. 一种基于加权复杂网络特征的 K-means 聚类算法[J]. 计算机技术与发展, 2007, 17(9): 35-37. (Zhao Peng, Geng Huantong, Cai Qingsheng, et al. A Novel K-means Clustering Algorithm Based on Weighted Complex Networks Feature[J]. Computer Technology and Development, 2007, 17(9): 35-37.) [11] 董俊, 任家东, 卢海涛. 一种基于复杂网络属性值的 K-means 聚类算法[J]. 燕山大学学报, 2012, 36(4): 343-347. (Dong Jun, Ren Jiadong, Lu Haitao. A K-means Cluster Algorithm Based on Complex Networks Attribute Value [J]. Journal of Yanshan University, 2012, 36(4): 343-347.) [12] 赵辉, 刘怀亮, 范云杰. 复杂网络理论在中文文本特征选择中的应用研究[J]. 现代图书情报技术, 2012(9): 23-28. (Zhao Hui, Liu Huailiang, Fan Yunjie. Study on the Application of Complex Network Theory in Chinese Text Feature Selection[J]. New Technology of Library and Information Service, 2012(9): 23-28.) [13] Milne D, Witten I H. An Effective, Low-cost Measure of Semantic Relatedness Obtained from Wikipedia Links[C]. In: Proceedings of the 1st AAAI Workshop on Wikipedia and Artificial Intelligence. 2008: 25-30. [14] Allen J. Natural Language Understanding[M]. The Benjamin Cummings Publishing Company, 1991. [15] Salton G, McGill M J. Introduction to Modern Information Retrieval[M]. New York, NY, USA: McGraw-Hill, 1983. [16] Sebastiani F. Machine Learning in Automated Text Categorization[J]. ACM Computing Surveys, 2002,34(1):1-47. |
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|