New Technology of Library and Information Service  2014, Vol. 30 Issue (11): 45-52    DOI: 10.11925/infotech.1003-3513.2014.11.07
Using Content and Tags for Web Text Clustering
Gu Xiaoxue, Zhang Chengzhi
School of Economics & Management, Nanjing University of Science & Technology, Nanjing 210094, China
[Objective] This paper explores the infulence of the combination of social tagging and text content. [Methods] In this paper, taking the English and Chinese blogs for example, using TF×IDF, TextRank and TextRank×IDF as text feature extraction method, basing on tags combining with text content where two types weighted methods is used, and AP clustering algorithm is used to cluster samples. [Results] The results show that acts the best in the clustering of three feature extraction. And content weighted with tags improve different degree of the clustering of English blogs, but not for Chinese blogs in the method of Sigmoid. In two kinds of similarity weighted, linear method performs better than the Sigmoid method. [Limitations] The authors cannot find the best weight coefficient of tag similarity and content similarity. AP clustering algorithm can't apply to big data and a lot of clustering results interfered the visualization of show. [Conclusions] The weighted similarity of social tags and text content can improve the effect of the clutering of Web text.

Key wordsSocial tag      Feature selection      Text clustering     
Received: 07 May 2014      Published: 18 December 2014
PACS:  G250  

Gu Xiaoxue, Zhang Chengzhi. Using Content and Tags for Web Text Clustering. New Technology of Library and Information Service, 2014, 30(11): 45-52.

