Please wait a minute...
New Technology of Library and Information Service  2014, Vol. 30 Issue (11): 45-52    DOI: 10.11925/infotech.1003-3513.2014.11.07
Current Issue | Archive | Adv Search |
Using Content and Tags for Web Text Clustering
Gu Xiaoxue, Zhang Chengzhi
School of Economics & Management, Nanjing University of Science & Technology, Nanjing 210094, China
Export: BibTeX | EndNote (RIS)      

[Objective] This paper explores the infulence of the combination of social tagging and text content. [Methods] In this paper, taking the English and Chinese blogs for example, using TF×IDF, TextRank and TextRank×IDF as text feature extraction method, basing on tags combining with text content where two types weighted methods is used, and AP clustering algorithm is used to cluster samples. [Results] The results show that acts the best in the clustering of three feature extraction. And content weighted with tags improve different degree of the clustering of English blogs, but not for Chinese blogs in the method of Sigmoid. In two kinds of similarity weighted, linear method performs better than the Sigmoid method. [Limitations] The authors cannot find the best weight coefficient of tag similarity and content similarity. AP clustering algorithm can't apply to big data and a lot of clustering results interfered the visualization of show. [Conclusions] The weighted similarity of social tags and text content can improve the effect of the clutering of Web text.

Key wordsSocial tag      Feature selection      Text clustering     
Received: 07 May 2014      Published: 18 December 2014
:  G250  

Cite this article:

Gu Xiaoxue, Zhang Chengzhi. Using Content and Tags for Web Text Clustering. New Technology of Library and Information Service, 2014, 30(11): 45-52.

URL:     OR

[1] Sebastiani F. Machine Learning in Automated Text Categorization [J]. ACM Computing Surveys, 2002, 34(1): 1-47.
[2] Salton G, Wong A, Yang C S. A Vector Space Model for Automatic Indexing [J]. Communications of the ACM, 1975, 18(11): 613-620.
[3] 马娜. 文本聚类研究[J]. 电脑知识与技术, 2009, 5(20): 5487-5489. (Ma Na. Research of Document Clustering [J]. Computer Knowledge and Technology, 2009, 5(20): 5487-5489.)
[4] Salton G, Yu C T. On the Construction of Effective Vocabularies for Information Retrieval [C]. In: Proceedings of the 1973 Meeting on Programming Languages and Information Retrieval. New York: ACM, 1973: 48-60.
[5] 吴夙慧, 成颖, 郑彦宁, 等. 文本聚类中文本表示和相似度计算研究综述[J]. 情报科学, 2012, 30(4): 622-627. (Wu Suhui, Cheng Ying, Zheng Yanning, et al. A Survey on Text Representation and Similarity Calculation in Text Clustering [J]. Information Science, 2012, 30(4): 622-627.)
[6] Manning C D, Raghavan P, Schütze H. Introduction to Information Retrieval [M]. Cambridge: Cambridge University Press, 2008.
[7] Luxburg U V. A Tutorial on Spectral Clustering[J]. Statistics and Computing, 2007, 17(4): 395-416.
[8] Trivedi A, Rai P, DuVall S L, et al. Exploiting Tag and Word Correlations for Improved Webpage Clustering [C]. In: Proceedings of the 2nd International Workshop on Search and Mining User-Generated Contents. New York: ACM, 2010: 3-12.
[9] Zhao Y, Karypis G, Fayyad U. Hierarchical Clustering Algorithms for Document Datasets [J]. Data Mining and Knowledge Discovery, 2005, 10(2): 141-168.
[10] Jing L, Ng M K, Xu J, et al. Subspace Clustering of Text Documents with Feature Weighting K-means Algorithm[C]. In: Proceedings of the 9th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. Springer Berlin Heidelberg, 2005: 802-812.
[11] Kummamuru K, Dhawale A, Krishnapuram R. Fuzzy Co-Clustering of Documents and Keywords [C]. In: Proceedings of the 12th IEEE International Conference on Fuzzy Systems. 2003, 2: 772-777.
[12] 毛嘉莉. 基于K-means的文本聚类算法[J]. 计算机系统应用, 2009, 18(10): 85-87. (Mao Jiali. Text Clustering Algorithm Based on K-means [J]. Computer Systems & Applications, 2009, 18(10): 85-87.)
[13] 李星毅, 曾路平, 施化吉. 基于单词相似度的文本聚类[J]. 计算机工程与设计, 2009, 30(8): 1966-1968. (Li Xingyi, Zeng Luping, Shi Huaji. Text Clustering Based on Word Similarity [J]. Computer Engineering and Design, 2009, 30(8): 1966-1968.)
[14] 李云, 田素方, 李拓, 等. 基于概念格的 Web 文本聚类[J]. 计算机工程与应用, 2008, 44(23): 169-171. (Li Yun, Tian Sufang, Li Tuo, et al. Web Text Clustering Based on Concept Lattice [J]. Computer Engineering and Applications, 2008, 44(23): 169-171.)
[15] 何文静, 何琳. 基于社会标签的文本聚类研究[J]. 现代图书情报技术, 2013(7-8): 49-54. (He Wenjing, He Lin. Reasearch on Text Clustering Based on Social Tagging [J]. New Technology of Library and Information Service, 2013 (7-8): 49-54.)
[16] 杨鲲, 马慧芳, 史忠植. 基于社会标注的 Web 资源语义聚类研究[J]. 高技术通讯, 2012, 22(1): 48-54. (Yang Kun, Ma Huifang, Shi Zhongzhi. Semantic Clustering of Web Resources Based on Social Annotation [J]. Chinese High Technology Letters, 2012, 22(1): 48-54.)
[17] Li P, Wang B, Jin W. Improving Web Document Clustering Through Employing User-Related Tag Expansion Techniques [J]. Journal of Computer Science and Technology, 2012, 27(3): 554-566.
[18] 贺秋芳, 曾启杰, 蔡延光. 挖掘用户标签的增强型社区网页聚类算法[J]. 微电子学与计算机, 2013, 30(2): 74-77. (He Qiufang, Zeng Qijie, Cai Yan'guang. Enhanced Social Web Clustering Algorithm of Mining Information [J]. Microelectronics & Computer, 2013, 30(2): 74-77.)
[19] Lu C, Chen X, Park E K. Exploit the Tripartite Network of Social Tagging for Web Clustering [C]. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management. New York: ACM, 2009: 1545-1548.
[20] 叶宇飞, 安世全, 代劲. 一种新的Web中文文本聚类方法研究[J]. 计算机应用与软件, 2013, 30(12): 222-225, 287. (Ye Yufei, An Shiquan, Dai Jin. Research on a Novel Web Chinese Text Clustering Method [J]. Computer Applications and Software, 2013, 30(12): 222-225, 287.)
[21] Ramage D, Heymann P, Manning C D, et al. Clustering the Tagged Web [C]. In: Proceedings of the 2nd ACM International Conference on Web Search and Data Mining. New York: ACM, 2009: 54-63.
[22] 李鹏, 王斌, 石志伟, 等. Tag-Text Rank: 一种基于Tag的网页关键词抽取方法[J]. 计算机研究与发展, 2012, 49(11): 2344-2351. (Li Peng, Wang Bin, Shi Zhiwei, et al. Tag-Text Rank: A Webpage Keyword Extraction Method Based on Tags [J]. Journal of Computer Research and Development, 2012, 49(11): 2344-2351.)
[23] 姚清耘. 基于向量空间模型的中文文本聚类方法的研究[D]. 上海: 上海交通大学, 2008. (Yao Qingyun. Research of VSM-Based Chinese Text Clustering Algorithms [D]. Shanghai: Shanghai Jiaotong University, 2008.)
[24] Mihalcea R, Tarau P. Text Rank: Bringing Order into Texts [C]. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain. 2004.
[25] Ehrig M, Staab S. QOM–quick Ontology Mapping [C]. In: Proceedings of the 3rd International Semantic Web Confe­rence (ISWC 2004). Springer Berlin Heidelberg, 2004: 683-697.
[26] Peukert E, Massmann S, Koenig K. Comparing Similarity Combination Methods for Schema Matching [C]. In: Proceedings of GI Jahrestagung (1). 2010: 692-701.
[27] 何琳. 基于多策略的领域本体术语抽取研究[J]. 情报学报, 2012, 31(8): 798-804. (He Lin. Domain Ontology Terminology Extraction Based on Integrated Strategy Method [J]. Journal of the China Society for Scientific and Technical Information, 2012, 31(8): 798-804.)
[28] Frey B J, Dueck D. Clustering by Passing Messages Between Data Points [J]. Science, 2007, 315(5814): 972-976.
[29] Tan P N, Steinbach M, Kumar V. 数据挖掘导论[M]. 范明, 范宏建译. 北京: 人民邮电出版社, 2006: 340-341. (Tan P N, Steinbach M, Kumar V. Introduction to Data Mining [M]. Translated by Fan Ming, Fan Hongjian. Beijing: Posts & Telecom Press, 2006: 340-341.)

[1] Liang Jiaming, Zhao Jie, Zheng Peng, Huang Liushen, Ye Minqi, Dong Zhenning. Framework for Computing Trust in Online Short-Rent Platform Using Feature Selection of Images and Texts[J]. 数据分析与知识发现, 2021, 5(2): 129-140.
[2] Xiong Huixiang,Li Xiaomin,Li Yueyan. Group Recommendation Based on Attribute Mining of Book Reviews[J]. 数据分析与知识发现, 2020, 4(2/3): 214-222.
[3] Huaming Zhao,Li Yu,Qiang Zhou. Determining Best Text Clustering Number with Mean Shift Algorithm[J]. 数据分析与知识发现, 2019, 3(9): 27-35.
[4] Lixin Xia,Jieyan Zeng,Chongwu Bi,Guanghui Ye. Identifying Hierarchy Evolution of User Interests with LDA Topic Model[J]. 数据分析与知识发现, 2019, 3(7): 1-13.
[5] Cheng Zhou,Hongqin Wei. Evaluating and Classifying Patent Values Based on Self-Organizing Maps and Support Vector Machine[J]. 数据分析与知识发现, 2019, 3(5): 117-124.
[6] Jiaming Liang,Jie Zhao,Zhou Jianlong,Zhenning Dong. Detecting Collusive Fraudulent Online Transaction with Implicit User Behaviors[J]. 数据分析与知识发现, 2019, 3(5): 125-138.
[7] Quan Lu,Anqi Zhu,Jiyue Zhang,Jing Chen. Research on User Information Requirement in Chinese Network Health Community: Taking Tumor-forum Data of Qiuyi as an Example[J]. 数据分析与知识发现, 2019, 3(4): 22-32.
[8] Tingxin Wen,Yangzi Li,Jingshuang Sun. News Hotspots Discovery Method Based on Multi Factor Feature Selection and AFOA/K-means[J]. 数据分析与知识发现, 2019, 3(4): 97-106.
[9] Zhanglu Tan,Zhaogang Wang,Han Hu. Study on a Method of Feature Classification Selection Based on χ2 Statistics[J]. 数据分析与知识发现, 2019, 3(2): 72-78.
[10] Chongwu Bi,Guanghui Ye,Mingqian Li,Jieyan Zeng. Discovering City Profile Based on Tag Semantic Mining[J]. 数据分析与知识发现, 2019, 3(12): 41-51.
[11] Zhang Tao,Ma Haiqun. Clustering Policy Texts Based on LDA Topic Model[J]. 数据分析与知识发现, 2018, 2(9): 59-65.
[12] Wen Tingxin,Li Yangzi,Sun Jingshuang. Extracting Text Features with Improved Fruit Fly Optimization Algorithm[J]. 数据分析与知识发现, 2018, 2(5): 59-69.
[13] Li Zhipeng,Li Weizhong. Feature Selection Based on Modified QPSO Algorithm[J]. 数据分析与知识发现, 2017, 1(7): 82-89.
[14] Xiong Huixiang,Jiang Wuxuan. Clustering and Recommending Users Based on Tags and Relation Network[J]. 数据分析与知识发现, 2017, 1(6): 36-46.
[15] Guan Qin,Deng Sanhong,Wang Hao. Chinese Stopwords for Text Clustering: A Comparative Study[J]. 数据分析与知识发现, 2017, 1(3): 72-80.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938