Please wait a minute...
New Technology of Library and Information Service  2014, Vol. 30 Issue (11): 45-52    DOI: 10.11925/infotech.1003-3513.2014.11.07
Current Issue | Archive | Adv Search |
Using Content and Tags for Web Text Clustering
Gu Xiaoxue, Zhang Chengzhi
School of Economics & Management, Nanjing University of Science & Technology, Nanjing 210094, China
Download: PDF(648 KB)   HTML
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper explores the infulence of the combination of social tagging and text content. [Methods] In this paper, taking the English and Chinese blogs for example, using TF×IDF, TextRank and TextRank×IDF as text feature extraction method, basing on tags combining with text content where two types weighted methods is used, and AP clustering algorithm is used to cluster samples. [Results] The results show that acts the best in the clustering of three feature extraction. And content weighted with tags improve different degree of the clustering of English blogs, but not for Chinese blogs in the method of Sigmoid. In two kinds of similarity weighted, linear method performs better than the Sigmoid method. [Limitations] The authors cannot find the best weight coefficient of tag similarity and content similarity. AP clustering algorithm can't apply to big data and a lot of clustering results interfered the visualization of show. [Conclusions] The weighted similarity of social tags and text content can improve the effect of the clutering of Web text.

Key wordsSocial tag      Feature selection      Text clustering     
Received: 07 May 2014      Published: 18 December 2014
PACS:  G250  

Cite this article:

Gu Xiaoxue, Zhang Chengzhi. Using Content and Tags for Web Text Clustering. New Technology of Library and Information Service, 2014, 30(11): 45-52.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2014.11.07     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2014/V30/I11/45

[1] Sebastiani F. Machine Learning in Automated Text Categorization [J]. ACM Computing Surveys, 2002, 34(1): 1-47.
[2] Salton G, Wong A, Yang C S. A Vector Space Model for Automatic Indexing [J]. Communications of the ACM, 1975, 18(11): 613-620.
[3] 马娜. 文本聚类研究[J]. 电脑知识与技术, 2009, 5(20): 5487-5489. (Ma Na. Research of Document Clustering [J]. Computer Knowledge and Technology, 2009, 5(20): 5487-5489.)
[4] Salton G, Yu C T. On the Construction of Effective Vocabularies for Information Retrieval [C]. In: Proceedings of the 1973 Meeting on Programming Languages and Information Retrieval. New York: ACM, 1973: 48-60.
[5] 吴夙慧, 成颖, 郑彦宁, 等. 文本聚类中文本表示和相似度计算研究综述[J]. 情报科学, 2012, 30(4): 622-627. (Wu Suhui, Cheng Ying, Zheng Yanning, et al. A Survey on Text Representation and Similarity Calculation in Text Clustering [J]. Information Science, 2012, 30(4): 622-627.)
[6] Manning C D, Raghavan P, Schütze H. Introduction to Information Retrieval [M]. Cambridge: Cambridge University Press, 2008.
[7] Luxburg U V. A Tutorial on Spectral Clustering[J]. Statistics and Computing, 2007, 17(4): 395-416.
[8] Trivedi A, Rai P, DuVall S L, et al. Exploiting Tag and Word Correlations for Improved Webpage Clustering [C]. In: Proceedings of the 2nd International Workshop on Search and Mining User-Generated Contents. New York: ACM, 2010: 3-12.
[9] Zhao Y, Karypis G, Fayyad U. Hierarchical Clustering Algorithms for Document Datasets [J]. Data Mining and Knowledge Discovery, 2005, 10(2): 141-168.
[10] Jing L, Ng M K, Xu J, et al. Subspace Clustering of Text Documents with Feature Weighting K-means Algorithm[C]. In: Proceedings of the 9th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. Springer Berlin Heidelberg, 2005: 802-812.
[11] Kummamuru K, Dhawale A, Krishnapuram R. Fuzzy Co-Clustering of Documents and Keywords [C]. In: Proceedings of the 12th IEEE International Conference on Fuzzy Systems. 2003, 2: 772-777.
[12] 毛嘉莉. 基于K-means的文本聚类算法[J]. 计算机系统应用, 2009, 18(10): 85-87. (Mao Jiali. Text Clustering Algorithm Based on K-means [J]. Computer Systems & Applications, 2009, 18(10): 85-87.)
[13] 李星毅, 曾路平, 施化吉. 基于单词相似度的文本聚类[J]. 计算机工程与设计, 2009, 30(8): 1966-1968. (Li Xingyi, Zeng Luping, Shi Huaji. Text Clustering Based on Word Similarity [J]. Computer Engineering and Design, 2009, 30(8): 1966-1968.)
[14] 李云, 田素方, 李拓, 等. 基于概念格的 Web 文本聚类[J]. 计算机工程与应用, 2008, 44(23): 169-171. (Li Yun, Tian Sufang, Li Tuo, et al. Web Text Clustering Based on Concept Lattice [J]. Computer Engineering and Applications, 2008, 44(23): 169-171.)
[15] 何文静, 何琳. 基于社会标签的文本聚类研究[J]. 现代图书情报技术, 2013(7-8): 49-54. (He Wenjing, He Lin. Reasearch on Text Clustering Based on Social Tagging [J]. New Technology of Library and Information Service, 2013 (7-8): 49-54.)
[16] 杨鲲, 马慧芳, 史忠植. 基于社会标注的 Web 资源语义聚类研究[J]. 高技术通讯, 2012, 22(1): 48-54. (Yang Kun, Ma Huifang, Shi Zhongzhi. Semantic Clustering of Web Resources Based on Social Annotation [J]. Chinese High Technology Letters, 2012, 22(1): 48-54.)
[17] Li P, Wang B, Jin W. Improving Web Document Clustering Through Employing User-Related Tag Expansion Techniques [J]. Journal of Computer Science and Technology, 2012, 27(3): 554-566.
[18] 贺秋芳, 曾启杰, 蔡延光. 挖掘用户标签的增强型社区网页聚类算法[J]. 微电子学与计算机, 2013, 30(2): 74-77. (He Qiufang, Zeng Qijie, Cai Yan'guang. Enhanced Social Web Clustering Algorithm of Mining Information [J]. Microelectronics & Computer, 2013, 30(2): 74-77.)
[19] Lu C, Chen X, Park E K. Exploit the Tripartite Network of Social Tagging for Web Clustering [C]. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management. New York: ACM, 2009: 1545-1548.
[20] 叶宇飞, 安世全, 代劲. 一种新的Web中文文本聚类方法研究[J]. 计算机应用与软件, 2013, 30(12): 222-225, 287. (Ye Yufei, An Shiquan, Dai Jin. Research on a Novel Web Chinese Text Clustering Method [J]. Computer Applications and Software, 2013, 30(12): 222-225, 287.)
[21] Ramage D, Heymann P, Manning C D, et al. Clustering the Tagged Web [C]. In: Proceedings of the 2nd ACM International Conference on Web Search and Data Mining. New York: ACM, 2009: 54-63.
[22] 李鹏, 王斌, 石志伟, 等. Tag-Text Rank: 一种基于Tag的网页关键词抽取方法[J]. 计算机研究与发展, 2012, 49(11): 2344-2351. (Li Peng, Wang Bin, Shi Zhiwei, et al. Tag-Text Rank: A Webpage Keyword Extraction Method Based on Tags [J]. Journal of Computer Research and Development, 2012, 49(11): 2344-2351.)
[23] 姚清耘. 基于向量空间模型的中文文本聚类方法的研究[D]. 上海: 上海交通大学, 2008. (Yao Qingyun. Research of VSM-Based Chinese Text Clustering Algorithms [D]. Shanghai: Shanghai Jiaotong University, 2008.)
[24] Mihalcea R, Tarau P. Text Rank: Bringing Order into Texts [C]. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain. 2004.
[25] Ehrig M, Staab S. QOM–quick Ontology Mapping [C]. In: Proceedings of the 3rd International Semantic Web Confe­rence (ISWC 2004). Springer Berlin Heidelberg, 2004: 683-697.
[26] Peukert E, Massmann S, Koenig K. Comparing Similarity Combination Methods for Schema Matching [C]. In: Proceedings of GI Jahrestagung (1). 2010: 692-701.
[27] 何琳. 基于多策略的领域本体术语抽取研究[J]. 情报学报, 2012, 31(8): 798-804. (He Lin. Domain Ontology Terminology Extraction Based on Integrated Strategy Method [J]. Journal of the China Society for Scientific and Technical Information, 2012, 31(8): 798-804.)
[28] Frey B J, Dueck D. Clustering by Passing Messages Between Data Points [J]. Science, 2007, 315(5814): 972-976.
[29] Tan P N, Steinbach M, Kumar V. 数据挖掘导论[M]. 范明, 范宏建译. 北京: 人民邮电出版社, 2006: 340-341. (Tan P N, Steinbach M, Kumar V. Introduction to Data Mining [M]. Translated by Fan Ming, Fan Hongjian. Beijing: Posts & Telecom Press, 2006: 340-341.)

[1] Qin Guan, Sanhong Deng, Hao Wang. Chinese Stopwords for Text Clustering: A Comparative Study[J]. 数据分析与知识发现, 2017, 1(3): 72-80.
[2] Mengyao Xie,Xuwei Pan. Constructing Dynamic Social Tag Cloud for User Interests[J]. 数据分析与知识发现, 2017, 1(2): 35-40.
[3] Yue Zhang,Dongbo Wang,Danhao Zhu. Segmenting Chinese Words from Food Safety Emergencies[J]. 数据分析与知识发现, 2017, 1(2): 64-72.
[4] Yonghe Lu,Jinghuang Chen. Optimizing Feature Selection Method for Text Classification with Shuffled Frog Leaping Algorithm[J]. 数据分析与知识发现, 2017, 1(1): 91-101.
[5] Liu Hongguang,Ma Shuanggang,Liu Guifeng. Classifying Chinese News Texts with Denoising Auto Encoder[J]. 现代图书情报技术, 2016, 32(6): 12-19.
[6] Meng Yuan,Wang Hongwei. Evaluating Online Reviews Based on Text Content Features[J]. 现代图书情报技术, 2016, 32(4): 40-47.
[7] Chen Dongyi,Zhou Zicheng,Jiang Shengyi,Wang Lianxi,Wu Jialin. A Framework for Customer Segmentation on Enterprises’ Microblog[J]. 现代图书情报技术, 2016, 32(2): 43-51.
[8] Gong Kaile,Cheng Ying,Sun Jianjun. Clustering Blog Posts with Co-occurrence Analysis[J]. 现代图书情报技术, 2016, 32(10): 50-58.
[9] Li Xiangdong, Ba Zhichao, Huang Li. Allocation and Multi-granularity[J]. 现代图书情报技术, 2015, 31(5): 42-49.
[10] Wang Zhongqun, Jiang Sheng, Xiu Yu, Huang Subin, Wang Qiansong. Information Resource Recommendation Method Based on Dynamic Tag-Resource Network[J]. 现代图书情报技术, 2015, 31(3): 49-57.
[11] Xu Dongdong, Wu Shaobo. An Improved TF-IDF Feature Selection Based on Categorical Description[J]. 现代图书情报技术, 2015, 31(3): 39-48.
[12] Zhang Chengzhi, Li Lei. Automatic Quality Evaluation of Social Tags[J]. 现代图书情报技术, 2015, 31(10): 2-12.
[13] Zhang Yingyi, Zhang Chengzhi, Chi Xuehua, Li Lei. Difference Research on Keywords Tagging Behavior for Academic User Blog——A Case Study of ScienceNet.cn[J]. 现代图书情报技术, 2015, 31(10): 13-21.
[14] Shao Jian, Zhang Chengzhi, Li Lei. Survey on Hashtag Mining and Its Application[J]. 现代图书情报技术, 2015, 31(10): 40-49.
[15] He Lin, Wan Jian, He Juan, Guo Shiyun. Research on Automatic Classification of Chinese Books Based on Social Tagging[J]. 现代图书情报技术, 2014, 30(9): 1-7.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn