|
|
Clustering Blog Posts with Co-occurrence Analysis |
Gong Kaile(),Cheng Ying,Sun Jianjun |
School of Information Management, Nanjing University, Nanjing 210023, China |
|
|
Abstract [Objective] This study investigates the co-occurrence of blog comment contributors, aiming to explore their roles in blog posts clustering. [Methods] We developed a method of two-step clustering. First, we constructed the co-occurrence matrix of the contributors from different blog posts and then transform it to a correlation matrix. Then finished the first-step clustering with the help of Affinity Propagation (AP) algorithm. Second, we calculated the terms’ position weight based on the centers of AP clustering, and then finished the second-stage blog post content clustering with K-means algorithm. [Results] The average precision and recall ratio of the proposed method were 0.66 and 0.57, which were significantly higher than those of the traditional ones. [Limitations] The blog comment contributors co-occurrence improved the quality of clustering, but it has limited value in blog posts with few comments. [Conclusions] The proposed method improves the quality of blog posts clustering by combining terms and contributors’ co-occurrence. The two-step clustering method is a better option to select the initial cluster centers of the K-means algorithm.
|
Received: 04 May 2016
Published: 23 November 2016
|
[1] | Small H.Co-citation in the Scientific Literature: A New Measure of the Relationship Between Two Documents[J]. Journal of the American Society for Information Science, 1973, 24(4): 265-269. | [2] | White H D, Griffith B C.Author Cocitation: A Literature Measure of Intellectual Structure[J]. Journal of the American Society for Information Science, 1981, 32(3): 163-171. | [3] | Callon M, Law J, Rip A.Mapping the Dynamics of Science and Technology: Sociology of Science in the Real World[M]. Macmillan Press Ltd.., 1986. | [4] | Larson R R.Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace [C]. In: Proceedings of the Annual Meeting- American Society for Information Science.1996, 33: 71-78. | [5] | 王曰芬, 宋爽, 卢宁, 等. 共现分析在文本知识挖掘中的应用研究[J]. 中国图书馆学报, 2007, 33(2): 59-64. | [5] | (Wang Yuefen, Song Shuang, Lu Ning, et al.Applications of Co-occurrence Analysis in Text Knowledge Mining[J]. Journal of Library Science in China, 2007, 33(2): 59-64.) | [6] | 张树良, 冷伏海. 基于文献的知识发现的应用进展研究[J]. 情报学报, 2006, 25(6): 700-712. | [6] | (Zhang Shuliang, Leng Fuhai.Study on the Applicational Development of Literature-based Knowledge Discovery[J]. Journal of the China Society for Scientific and Technical Information, 2006, 25(6): 700-712.) | [7] | 王曰芬, 宋爽, 苗露. 共现分析在知识服务中的应用研究[J]. 现代图书情报技术, 2006(4): 29-34. | [7] | (Wang Yuefen, Song Shuang, Miao Lu.Application Study of Co-occurrence Analysis in Knowledge Service[J]. New Technology of Library and Information Service, 2006(4): 29-34.) | [8] | 孙建军, 李江. 网络信息计量理论、工具与应用[M]. 北京: 科学出版社, 2009. | [8] | (Sun Jianjun, Li Jiang. On Webometrics Theories, Tools and Applications[M]. Beijing: Science Press, 2009.) | [9] | Liu Y C, Wang X L, Liu B Q.A Feature Selection Algorithm for Document Clustering Based on Word Co-occurrence Frequency [C]. In: Proceedings of 2004 International Conference on Machine Learning and Cybernetics. IEEE, 2004, 5: 2963-2968. | [10] | Zhang Y, Feng B Q.A Co-occurrence Based Hierarchical Method for Clustering Web Search Results[C]. In: Proceedings of IEEE / WIC / ACM International Conference on Web Intelligence, WI 2008, Sydney, Australia. 2008: 407-410. | [11] | 吴夙慧, 成颖, 郑彦宁, 等. 基于学术文献同被引分析的K-means算法改进研究[J]. 情报学报, 2012, 31(1): 82-94. | [11] | (Wu Suhui, Cheng Ying, Zheng Yanning, et al.Improvement of K-means Algorithm Based on Co-citation Analysis[J]. Journal of the China Society for Scientific and Technical Information, 2012, 31(1): 82-94.) | [12] | He X, Zha H, Ding C H Q, et al. Web Document Clustering Using Hyperlink Structures[J]. Computational Statistics & Data Analysis, 2002, 41(1): 19-45. | [13] | 邓三鸿, 顾婷婷. 我国图情领域核心期刊论文作者同被引现象的可视化分析[J]. 情报科学, 2010, 28(11): 1728-1732. | [13] | (Deng Sanhong, Gu Tingting.A Visual ACA Analysis of Core Journals in the Field of LIS[J]. Information Science, 2010, 28(11): 1728-1732.) | [14] | 谭旻, 许鑫, 赵星. 学术博客共推荐关系及核心结构特性研究——以科学网博客为例[J]. 现代图书情报技术, 2015(7): 24-30. | [14] | (Tan Min, Xu Xin, Zhao Xing.Exploring the Co-recommendation Relationship and Its Core Structure Features of Academic Blogs——Taking ScienceNet.cn Blog as an Example[J]. New Technology of Library and Information Service, 2015(7): 24-30.) | [15] | Xia F, Yang Q, Li J, et al.Data Dissemination Using Interest-tree in Socially Aware Networking[J]. Computer Networks, 2015, 91: 495-507. | [16] | McPherson M, Smith-Lovin L, Cook J M. Birds of a Feather: Homophily in Social Networks[J]. Annual Review of Sociology, 2001, 27(1): 415-444. | [17] | Katsaros D, Dimokas N, Tassiulas L.Social Network Analysis Concepts in the Design of Wireless Ad Hoc Network Protocols[J]. IEEE Network, 2010, 24(6): 23-29. | [18] | Frey B J, Dueck D.Clustering by Passing Messages Between Data Points[J]. Science, 2007, 315(5814): 972-976. | [19] | 吴夙慧, 成颖, 郑彦宁, 等. K-means算法研究综述[J]. 现代图书情报技术, 2011(5): 28-35. | [19] | (Wu Suhui, Cheng Ying, Zheng Yanning, et al.Survey on K-means Algorithm[J]. New Technology of Library and Information Service, 2011(5): 28-35.) | [20] | 常鹏, 冯楠, 马辉. 一种基于词共现的文档聚类算法[J]. 计算机工程, 2012, 38(2): 213-214. | [20] | (Chang Peng, Feng Nan, Ma Hui.Document Clustering Algorithm Based on Word Co-occurrence[J]. Computer Engineering, 2012, 38(2): 213-214.) | [21] | 肖欣延, 张东站, 高君杰, 等. 一种新的Web检索结果聚类方法[J]. 计算机研究与发展, 2007, 44(S2): 79-83. | [21] | (Xiao Xinyan, Zhang Dongzhan, Gao Junjie, et al.A New Method for Web Search Results Clustering[J]. Journal of Computer Research and Development, 2007, 44(S2): 79-83.) | [22] | 李枫林, 何洲芳. 基于关键词共现分析的检索结果聚类研究[J]. 情报学报, 2011, 30(8): 819-825. | [22] | (Li Fenglin, He Zhoufang.Study on Clustering of Retrieval Results Based on Co-occurrence Analysis of Keywords[J]. Journal of the China Society for Scientific and Technical Information, 2011, 30(8): 819-825.) | [23] | Wang Y, Kitsuregawa M.Link Based Clustering of Web Search Results [C]. In: Proceedings of International Conference on Advances in Web-Age Information Management. Springer-Verlag, 2001: 225-236. | [24] | Mukhopadhyay D, Sing S R.An Algorithm for Automatic Web-page Clustering Using Link Structures [C]. In: Proceedings of the IEEE INDICON Annual Conference 2004. IEEE, 2004: 472-477. | [25] | Modha D S, Spangler W S.Clustering Hypertext with Applications to Web Searching [C]. In: Proceedings of the 11th ACM Conference on Hypertext and Hypermedia. ACM, 2000: 143-152. | [26] | 顾钧, 郑晓东, 张连明. 结合引文信息的生物医学文本聚类研究[J]. 计算机应用与软件, 2012(10): 5-7. | [26] | (Gu Jun, Zheng Xiaodong, Zhang Lianming.Research on Bio-medical Document Clustering with Citation Information Incorporated[J]. Computer Applications and Software, 2012(10): 5-7.) | [27] | Brooks C H, Montanez N.Improved Annotation of the Blogosphere via Autotagging and Hierarchical Clustering [C]. In: Proceedings of the 15th International Conference on World Wide Web. ACM, 2006: 625-632. | [28] | 何文静, 何琳. 基于社会标签的文本聚类研究[J]. 现代图书情报技术, 2013(7-8): 49-54. | [28] | (He Wenjing, He Lin.Research on Text Clustering Based on Social Tagging[J]. New Technology of Library and Information Service, 2013(7-8): 49-54.) | [29] | Zhang Y, Gao K, Zhang B, et al.Clustering Blog Posts Using Tags and Relations in the Blogosphere [C]. In: Proceedings of the 1st International Conference on Information Science and Engineering (ICISE). IEEE, 2010: 817-820. | [30] | Chen Y H, Lu J L, Wu T Y.A Blog Clustering Approach Based on Queried Keywords [C]. In: Proceedings of the 2013 International Symposium on Biometrics and Security Technologies (ISBAST). IEEE, 2013: 1-9. | [31] | Li B, Xu S, Zhang J.Enhancing Clustering Blog Documents by Utilizing Author/Reader Comments [C]. In: Proceedings of the 45th Annual Southeast Regional Conference. ACM, 2007: 94-99. | [32] | Kopel M, Zgrzywa A.Search Result Clustering Using Semantic Web Data [C]. In: Proceedings of the 3rd International Conference on Intelligent Information and Database Systems. Springer Berlin Heidelberg, 2011: 292-301. | [33] | Chin A, Chignell M.A Social Hypertext Model for Finding Community in Blogs [C]. In: Proceedings of the 17th Conference on Hypertext and Hypermedia. ACM, 2006: 11-22. | [34] | Lin Y R, Sundaram H, Chi Y, et al.Discovery of Blog Communities Based on Mutual Awareness [C]. In: Proceedings of the 3rd Annual Workshop on the Weblogging Ecosystem. 2006. | [35] | Lu L, Zhu F.Blogger Clustering by Utilizing Link Information [C]. In: Proceedings of the 2010 IEEE International Conference on Intelligent Computing and Intelligent Systems (ICIS). IEEE, 2010, 2: 267-270. | [36] | Bruns A, Burgess J, Highfield T, et al.Mapping the Australian Networked Public Sphere[J]. Social Science Computer Review, 2011, 29(3): 277-287. | [37] | 肖宇, 于剑. 基于近邻传播算法的半监督聚类[J]. 软件学报, 2008, 19(11): 2803-2813. | [37] | (Xiao Yu, Yu Jian.Semi-Supervised Clustering Based on Affinity Propagation Algorithm[J]. Journal of Software, 2008, 19(11): 2803-2813.) | [38] | 周磊, 杨威, 张玉峰. 共现矩阵聚类分析的问题与再思考[J]. 情报杂志, 2014, 33(6): 32-36. | [38] | (Zhou Lei, Yang Wei, Zhang Yufeng.Issues and Re-consideration on Cluster Analysis in Co-occurrence Matrix[J]. Journal of Intelligence, 2014, 33(6): 32-36.) | [39] | 杭文龙, 蒋亦樟, 刘解放, 等. 迁移近邻传播聚类算法[J/OL]. 软件学报, (2015-11-26).[2016-04-01]. . | [39] | (Hang Wenlong, Jiang Yizhang, Liu Jiefang, et al. Transfer Affinity Propagation Clustering Algorithm[J/OL]. Journal of Software, (2015-11-26). [2016-04-01]. | [40] | 韩客松, 王永成. 中文全文标引的主题词标引和主题概念标引方法[J]. 情报学报, 2001, 20(2): 212-216. | [40] | (Han Kesong, Wang Yongcheng.Methods of Keyword and Subject Concept Indexing to Chinese Full-text[J]. Journal of the China Society for Scientific and Technical Information, 2001, 20(2): 212-216.) | [41] | 苗家, 马军, 陈竹敏. 一种基于HITS算法的Blog文摘方法[J]. 中文信息学报, 2011, 25(1): 104-109. | [41] | (Miao Jia, Ma Jun, Chen Zhumin.A New HITS-Based Summarization Approach for Blog[J]. Journal of Chinese Information Processing, 2011, 25(1): 104-109.) | [42] | 郭朋伟, 高克宁, 张斌. 基于评论修正的博客聚类算法[J]. 东北大学学报: 自然科学版, 2010, 31(6): 782-785. | [42] | (Guo Pengwei, Gao Kening, Zhang Bin.Public Blog Clustering Algorithm Based on Revision by Comments[J]. Journal of Northeastern University: Natural Science, 2010, 31(6): 782-785.) | [43] | MacQueen J. Some Methods for Classification and Analysis of Multivariate Observations[C]. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 1967, 1: 281-297. | [44] | 韩普, 王东波, 刘艳云, 等. 词性对中英文文本聚类的影响研究[J]. 中文信息学报, 2013, 27(2): 65-73. | [44] | (Han Pu, Wang Dongbo, Liu Yanyun, et al.Influence of Part-of-Speech on Chinese and English Document Clustering[J]. Journal of Chinese Information Processing, 2013, 27(2): 65-73.) | [45] | 王娟, 范少萍, 郑春厚. 基于惩罚性矩阵分解的文本聚类分析[J]. 情报学报, 2012, 31(9): 998-1008. | [45] | (Wang Juan, Fan Shaoping, Zheng Chunhou.Penalized Matrix Decomposition Method for Text Clustering[J]. Journal of the China Society for Scientific and Technical Information, 2012, 31(9): 998-1008.) | [46] | Manning C D, Raghavan P, Schütze H.Introduction to Information Retrieval[M]. Cambridge: Cambridge University Press, 2008. |
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|