Please wait a minute...
New Technology of Library and Information Service  2013, Vol. 29 Issue (9): 88-92    DOI: 10.11925/infotech.1003-3513.2013.09.14
Current Issue | Archive | Adv Search |
Research on Short Text Clustering Algorithm for User Generated Content
Zhao Hui, Liu Huailiang
School of Economics & Management, Xidian University, Xi’an 710071, China
Export: BibTeX | EndNote (RIS)      
Abstract  To solve the problem of weak semantic description ability of short text feature in user generated content, and the traditional K-means algorithm for document clustering is sensitive to the initial clustering center, this paper proposes that the semantic features information of short text can be supplied by feature extension based on the concept, link structure and category system of Wikipedia. Then the weighted complex network of short text set is built by the semantic relation of texts, and text clustering is achieved by node partitioning community based on K-means algorithm whose initial clustering center is chosen according to the synthetic characteristics of network nodes. Results of experiment show that the algorithm proposed by this paper can improve the effect of short text clustering.
Key wordsShort text clustering      Feature extension      Complex network      K-means algorithm      User enerated content     
Received: 02 July 2013      Published: 27 September 2013
:  G350  

Cite this article:

Zhao Hui, Liu Huailiang. Research on Short Text Clustering Algorithm for User Generated Content. New Technology of Library and Information Service, 2013, 29(9): 88-92.

URL:     OR

[1] 赵宇翔, 范哲, 朱庆华. 用户生成内容 (UGC) 概念解析及研究进展[J]. 中国图书馆学报, 2012,38(5): 68-81. (Zhao Yuxiang, Fan Zhe, Zhu Qinghua. Conceptualization and Research Progress on User-Generated Content [J]. Journal of Library Science in China, 2012,38(5): 68-81.)
[2] 柴春梅. 互联网短文本信息分类关键技术研究[D]. 上海: 上海交通大学, 2009. (Chai Chunmei. The Key Technology Research on Internet Short Text Information Classification[D]. Shanghai: Shanghai Jiaotong University, 2009.)
[3] MacQueen J B. Some Methods for Classification and Analysis of Multivariate Observations[C]. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 1967: 281-297.
[4] 行小帅, 潘进, 焦李成. 基于免疫规划的 K-means 聚类算法[J]. 计算机学报, 2003, 26(5): 605-610. (Xing Xiaoshuai, Pan Jin, Jiao Licheng. A Novel K-means Clustering Based on the Immune Programming Algorithm [J]. Chinese Journal of Computers, 2003, 26(5): 605-610.)
[5] 宁亚辉, 樊兴华, 吴渝. 基于领域词语本体的短文本分类[J]. 计算机科学, 2009, 36(3): 142-145. (Ning Yahui, Fan Xinghua, Wu Yu. Short Text Classification Based on Domain Word Ontology[J]. Computer Science, 2009,36(3): 142-145.)
[6] 王盛, 樊兴华, 陈现麟. 利用上下位关系的中文短文本分类[J]. 计算机应用, 2010,30(3): 603-611.(Wang Sheng, Fan Xinghua, Chen Xianlin. Chinese Short Text Classification Based on Hyponymy Relation[J]. Journal of Computer Application, 2010,30(3): 603-611.)
[7] 范云杰, 刘怀亮. 基于维基百科的中文短文本分类研究[J]. 现代图书情报技术, 2012(3): 47-52. (Fan Yunjie, Liu Huailiang. Research on Chinese Short Text Classification Based on Wikipedia[J]. New Technology of Library and Information Service, 2012(3): 47-52.)
[8] 白秋产, 金春霞. 概念属性扩展的短文本聚类算法[J]. 长春师范学院学报: 自然科学版, 2011,30(5): 29-33. (Bai Qiuchan, Jin Chunxia. Short Text Clustering Algorithm Based on Concept Feature Expansion[J]. Journal of Changchun Normal University: Natural Science, 2011, 30(5): 29-33.)
[9] Pan Y, Chen A H, Jiang L L. Improved K-means Clustering Method Based on Complex Network for Rolling Bearing Fault Diagnosis[J]. Applied Mechanics and Materials, 2013, 273: 250-254.
[10] 赵鹏, 耿焕同, 蔡庆生, 等. 一种基于加权复杂网络特征的 K-means 聚类算法[J]. 计算机技术与发展, 2007, 17(9): 35-37. (Zhao Peng, Geng Huantong, Cai Qingsheng, et al. A Novel K-means Clustering Algorithm Based on Weighted Complex Networks Feature[J]. Computer Technology and Development, 2007, 17(9): 35-37.)
[11] 董俊, 任家东, 卢海涛. 一种基于复杂网络属性值的 K-means 聚类算法[J]. 燕山大学学报, 2012, 36(4): 343-347. (Dong Jun, Ren Jiadong, Lu Haitao. A K-means Cluster Algorithm Based on Complex Networks Attribute Value [J]. Journal of Yanshan University, 2012, 36(4): 343-347.)
[12] 赵辉, 刘怀亮, 范云杰. 复杂网络理论在中文文本特征选择中的应用研究[J]. 现代图书情报技术, 2012(9): 23-28. (Zhao Hui, Liu Huailiang, Fan Yunjie. Study on the Application of Complex Network Theory in Chinese Text Feature Selection[J]. New Technology of Library and Information Service, 2012(9): 23-28.)
[13] Milne D, Witten I H. An Effective, Low-cost Measure of Semantic Relatedness Obtained from Wikipedia Links[C]. In: Proceedings of the 1st AAAI Workshop on Wikipedia and Artificial Intelligence. 2008: 25-30.
[14] Allen J. Natural Language Understanding[M]. The Benjamin Cummings Publishing Company, 1991.
[15] Salton G, McGill M J. Introduction to Modern Information Retrieval[M]. New York, NY, USA: McGraw-Hill, 1983.
[16] Sebastiani F. Machine Learning in Automated Text Categorization[J]. ACM Computing Surveys, 2002,34(1):1-47.
[1] Chen Wenjie,Wen Yi,Yang Ning. Fuzzy Overlapping Community Detection Algorithm Based on Node Vector Representation[J]. 数据分析与知识发现, 2021, 5(5): 41-50.
[2] Li Wenzheng,Gu Yijun,Yan Hongli. Predicting Community Numbers with Network Bayesian Information Criterion[J]. 数据分析与知识发现, 2020, 4(4): 72-82.
[3] Tingxin Wen,Yangzi Li,Jingshuang Sun. News Hotspots Discovery Method Based on Multi Factor Feature Selection and AFOA/K-means[J]. 数据分析与知识发现, 2019, 3(4): 97-106.
[4] Xiang Li,Xiaodong Qian. Research on Impact of Commodity Online Evaluation for Consumption Convergence[J]. 数据分析与知识发现, 2019, 3(3): 102-111.
[5] Jiao Yan,Jing Ma,Kang Fang. Computing Text Semantic Similarity with Syntactic Network of Co-occurrence Distance[J]. 数据分析与知识发现, 2019, 3(12): 93-100.
[6] Wuxuan Jiang,Huixiang Xiong,Jiaxin Ye,Ning An. Creating Dynamic Tags for Social Networking Groups[J]. 数据分析与知识发现, 2019, 3(10): 98-109.
[7] Qian Xiaodong,Li Min. Identifying E-commerce User Types Based on Complex Network Overlapping Community[J]. 数据分析与知识发现, 2018, 2(6): 79-91.
[8] Liu Hongwei,Gao Hongming,Chen Li,Zhan Mingjun,Liang Zhouyang. Identifying User Interests Based on Browsing Behaviors[J]. 数据分析与知识发现, 2018, 2(2): 74-85.
[9] Chen Yunwei,Zhang Ruihong. Comparing on Community Detection Algorithms for Information Mining[J]. 数据分析与知识发现, 2018, 2(10): 84-94.
[10] Liu Bingyao,Ma Jing,Li Xiaofeng. Topic Representation Model Based on “Feature Dimensionality Reduction”[J]. 数据分析与知识发现, 2017, 1(11): 53-61.
[11] Wu Jiang,Chen Jun,Zhang Jinfan. A Knowledge Supply-Demand Simulation System for Collaborative Innovation[J]. 现代图书情报技术, 2016, 32(9): 27-33.
[12] Ye Teng,Han Lichuan,Xing Chunxiao,Zhang Yan. Knowledge Dissemination Mechanism in Virtual Communities: Case Study Based on Complex Network Theory[J]. 现代图书情报技术, 2016, 32(7-8): 70-77.
[13] Li Xiangdong,Liu Kang,Ding Cong,Gao Fan. A New Automatic Categorization Method with Documents Based on HowNet[J]. 现代图书情报技术, 2016, 32(2): 59-66.
[14] Lixin Xia,Ying Tan. Analysis and Visualization of the LOD Network Structure[J]. 现代图书情报技术, 2016, 32(1): 65-72.
[15] Li Xiangdong, Cao Huan, Ding Cong, Huang Li. Short-text Classification Based on HowNet and Domain Keyword Set Extension[J]. 现代图书情报技术, 2015, 31(2): 31-38.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938