Please wait a minute...
New Technology of Library and Information Service  2016, Vol. 32 Issue (10): 50-58    DOI: 10.11925/infotech.1003-3513.2016.10.06
Orginal Article Current Issue | Archive | Adv Search |
Clustering Blog Posts with Co-occurrence Analysis
Gong Kaile(),Cheng Ying,Sun Jianjun
School of Information Management, Nanjing University, Nanjing 210023, China
Download:
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This study investigates the co-occurrence of blog comment contributors, aiming to explore their roles in blog posts clustering. [Methods] We developed a method of two-step clustering. First, we constructed the co-occurrence matrix of the contributors from different blog posts and then transform it to a correlation matrix. Then finished the first-step clustering with the help of Affinity Propagation (AP) algorithm. Second, we calculated the terms’ position weight based on the centers of AP clustering, and then finished the second-stage blog post content clustering with K-means algorithm. [Results] The average precision and recall ratio of the proposed method were 0.66 and 0.57, which were significantly higher than those of the traditional ones. [Limitations] The blog comment contributors co-occurrence improved the quality of clustering, but it has limited value in blog posts with few comments. [Conclusions] The proposed method improves the quality of blog posts clustering by combining terms and contributors’ co-occurrence. The two-step clustering method is a better option to select the initial cluster centers of the K-means algorithm.

Key wordsCo-occurrence analysis      Text clustering      Blog comments contributor      Initial cluster centers     
Received: 04 May 2016      Published: 23 November 2016

Cite this article:

Gong Kaile,Cheng Ying,Sun Jianjun. Clustering Blog Posts with Co-occurrence Analysis. New Technology of Library and Information Service, 2016, 32(10): 50-58.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2016.10.06     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2016/V32/I10/50

[1] Small H.Co-citation in the Scientific Literature: A New Measure of the Relationship Between Two Documents[J]. Journal of the American Society for Information Science, 1973, 24(4): 265-269.
[2] White H D, Griffith B C.Author Cocitation: A Literature Measure of Intellectual Structure[J]. Journal of the American Society for Information Science, 1981, 32(3): 163-171.
[3] Callon M, Law J, Rip A.Mapping the Dynamics of Science and Technology: Sociology of Science in the Real World[M]. Macmillan Press Ltd.., 1986.
[4] Larson R R.Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace [C]. In: Proceedings of the Annual Meeting- American Society for Information Science.1996, 33: 71-78.
[5] 王曰芬, 宋爽, 卢宁, 等. 共现分析在文本知识挖掘中的应用研究[J]. 中国图书馆学报, 2007, 33(2): 59-64.
[5] (Wang Yuefen, Song Shuang, Lu Ning, et al.Applications of Co-occurrence Analysis in Text Knowledge Mining[J]. Journal of Library Science in China, 2007, 33(2): 59-64.)
[6] 张树良, 冷伏海. 基于文献的知识发现的应用进展研究[J]. 情报学报, 2006, 25(6): 700-712.
[6] (Zhang Shuliang, Leng Fuhai.Study on the Applicational Development of Literature-based Knowledge Discovery[J]. Journal of the China Society for Scientific and Technical Information, 2006, 25(6): 700-712.)
[7] 王曰芬, 宋爽, 苗露. 共现分析在知识服务中的应用研究[J]. 现代图书情报技术, 2006(4): 29-34.
[7] (Wang Yuefen, Song Shuang, Miao Lu.Application Study of Co-occurrence Analysis in Knowledge Service[J]. New Technology of Library and Information Service, 2006(4): 29-34.)
[8] 孙建军, 李江. 网络信息计量理论、工具与应用[M]. 北京: 科学出版社, 2009.
[8] (Sun Jianjun, Li Jiang. On Webometrics Theories, Tools and Applications[M]. Beijing: Science Press, 2009.)
[9] Liu Y C, Wang X L, Liu B Q.A Feature Selection Algorithm for Document Clustering Based on Word Co-occurrence Frequency [C]. In: Proceedings of 2004 International Conference on Machine Learning and Cybernetics. IEEE, 2004, 5: 2963-2968.
[10] Zhang Y, Feng B Q.A Co-occurrence Based Hierarchical Method for Clustering Web Search Results[C]. In: Proceedings of IEEE / WIC / ACM International Conference on Web Intelligence, WI 2008, Sydney, Australia. 2008: 407-410.
[11] 吴夙慧, 成颖, 郑彦宁, 等. 基于学术文献同被引分析的K-means算法改进研究[J]. 情报学报, 2012, 31(1): 82-94.
[11] (Wu Suhui, Cheng Ying, Zheng Yanning, et al.Improvement of K-means Algorithm Based on Co-citation Analysis[J]. Journal of the China Society for Scientific and Technical Information, 2012, 31(1): 82-94.)
[12] He X, Zha H, Ding C H Q, et al. Web Document Clustering Using Hyperlink Structures[J]. Computational Statistics & Data Analysis, 2002, 41(1): 19-45.
[13] 邓三鸿, 顾婷婷. 我国图情领域核心期刊论文作者同被引现象的可视化分析[J]. 情报科学, 2010, 28(11): 1728-1732.
[13] (Deng Sanhong, Gu Tingting.A Visual ACA Analysis of Core Journals in the Field of LIS[J]. Information Science, 2010, 28(11): 1728-1732.)
[14] 谭旻, 许鑫, 赵星. 学术博客共推荐关系及核心结构特性研究——以科学网博客为例[J]. 现代图书情报技术, 2015(7): 24-30.
[14] (Tan Min, Xu Xin, Zhao Xing.Exploring the Co-recommendation Relationship and Its Core Structure Features of Academic Blogs——Taking ScienceNet.cn Blog as an Example[J]. New Technology of Library and Information Service, 2015(7): 24-30.)
[15] Xia F, Yang Q, Li J, et al.Data Dissemination Using Interest-tree in Socially Aware Networking[J]. Computer Networks, 2015, 91: 495-507.
[16] McPherson M, Smith-Lovin L, Cook J M. Birds of a Feather: Homophily in Social Networks[J]. Annual Review of Sociology, 2001, 27(1): 415-444.
[17] Katsaros D, Dimokas N, Tassiulas L.Social Network Analysis Concepts in the Design of Wireless Ad Hoc Network Protocols[J]. IEEE Network, 2010, 24(6): 23-29.
[18] Frey B J, Dueck D.Clustering by Passing Messages Between Data Points[J]. Science, 2007, 315(5814): 972-976.
[19] 吴夙慧, 成颖, 郑彦宁, 等. K-means算法研究综述[J]. 现代图书情报技术, 2011(5): 28-35.
[19] (Wu Suhui, Cheng Ying, Zheng Yanning, et al.Survey on K-means Algorithm[J]. New Technology of Library and Information Service, 2011(5): 28-35.)
[20] 常鹏, 冯楠, 马辉. 一种基于词共现的文档聚类算法[J]. 计算机工程, 2012, 38(2): 213-214.
[20] (Chang Peng, Feng Nan, Ma Hui.Document Clustering Algorithm Based on Word Co-occurrence[J]. Computer Engineering, 2012, 38(2): 213-214.)
[21] 肖欣延, 张东站, 高君杰, 等. 一种新的Web检索结果聚类方法[J]. 计算机研究与发展, 2007, 44(S2): 79-83.
[21] (Xiao Xinyan, Zhang Dongzhan, Gao Junjie, et al.A New Method for Web Search Results Clustering[J]. Journal of Computer Research and Development, 2007, 44(S2): 79-83.)
[22] 李枫林, 何洲芳. 基于关键词共现分析的检索结果聚类研究[J]. 情报学报, 2011, 30(8): 819-825.
[22] (Li Fenglin, He Zhoufang.Study on Clustering of Retrieval Results Based on Co-occurrence Analysis of Keywords[J]. Journal of the China Society for Scientific and Technical Information, 2011, 30(8): 819-825.)
[23] Wang Y, Kitsuregawa M.Link Based Clustering of Web Search Results [C]. In: Proceedings of International Conference on Advances in Web-Age Information Management. Springer-Verlag, 2001: 225-236.
[24] Mukhopadhyay D, Sing S R.An Algorithm for Automatic Web-page Clustering Using Link Structures [C]. In: Proceedings of the IEEE INDICON Annual Conference 2004. IEEE, 2004: 472-477.
[25] Modha D S, Spangler W S.Clustering Hypertext with Applications to Web Searching [C]. In: Proceedings of the 11th ACM Conference on Hypertext and Hypermedia. ACM, 2000: 143-152.
[26] 顾钧, 郑晓东, 张连明. 结合引文信息的生物医学文本聚类研究[J]. 计算机应用与软件, 2012(10): 5-7.
[26] (Gu Jun, Zheng Xiaodong, Zhang Lianming.Research on Bio-medical Document Clustering with Citation Information Incorporated[J]. Computer Applications and Software, 2012(10): 5-7.)
[27] Brooks C H, Montanez N.Improved Annotation of the Blogosphere via Autotagging and Hierarchical Clustering [C]. In: Proceedings of the 15th International Conference on World Wide Web. ACM, 2006: 625-632.
[28] 何文静, 何琳. 基于社会标签的文本聚类研究[J]. 现代图书情报技术, 2013(7-8): 49-54.
[28] (He Wenjing, He Lin.Research on Text Clustering Based on Social Tagging[J]. New Technology of Library and Information Service, 2013(7-8): 49-54.)
[29] Zhang Y, Gao K, Zhang B, et al.Clustering Blog Posts Using Tags and Relations in the Blogosphere [C]. In: Proceedings of the 1st International Conference on Information Science and Engineering (ICISE). IEEE, 2010: 817-820.
[30] Chen Y H, Lu J L, Wu T Y.A Blog Clustering Approach Based on Queried Keywords [C]. In: Proceedings of the 2013 International Symposium on Biometrics and Security Technologies (ISBAST). IEEE, 2013: 1-9.
[31] Li B, Xu S, Zhang J.Enhancing Clustering Blog Documents by Utilizing Author/Reader Comments [C]. In: Proceedings of the 45th Annual Southeast Regional Conference. ACM, 2007: 94-99.
[32] Kopel M, Zgrzywa A.Search Result Clustering Using Semantic Web Data [C]. In: Proceedings of the 3rd International Conference on Intelligent Information and Database Systems. Springer Berlin Heidelberg, 2011: 292-301.
[33] Chin A, Chignell M.A Social Hypertext Model for Finding Community in Blogs [C]. In: Proceedings of the 17th Conference on Hypertext and Hypermedia. ACM, 2006: 11-22.
[34] Lin Y R, Sundaram H, Chi Y, et al.Discovery of Blog Communities Based on Mutual Awareness [C]. In: Proceedings of the 3rd Annual Workshop on the Weblogging Ecosystem. 2006.
[35] Lu L, Zhu F.Blogger Clustering by Utilizing Link Information [C]. In: Proceedings of the 2010 IEEE International Conference on Intelligent Computing and Intelligent Systems (ICIS). IEEE, 2010, 2: 267-270.
[36] Bruns A, Burgess J, Highfield T, et al.Mapping the Australian Networked Public Sphere[J]. Social Science Computer Review, 2011, 29(3): 277-287.
[37] 肖宇, 于剑. 基于近邻传播算法的半监督聚类[J]. 软件学报, 2008, 19(11): 2803-2813.
[37] (Xiao Yu, Yu Jian.Semi-Supervised Clustering Based on Affinity Propagation Algorithm[J]. Journal of Software, 2008, 19(11): 2803-2813.)
[38] 周磊, 杨威, 张玉峰. 共现矩阵聚类分析的问题与再思考[J]. 情报杂志, 2014, 33(6): 32-36.
[38] (Zhou Lei, Yang Wei, Zhang Yufeng.Issues and Re-consideration on Cluster Analysis in Co-occurrence Matrix[J]. Journal of Intelligence, 2014, 33(6): 32-36.)
[39] 杭文龙, 蒋亦樟, 刘解放, 等. 迁移近邻传播聚类算法[J/OL]. 软件学报, (2015-11-26).[2016-04-01]. .
[39] (Hang Wenlong, Jiang Yizhang, Liu Jiefang, et al. Transfer Affinity Propagation Clustering Algorithm[J/OL]. Journal of Software, (2015-11-26). [2016-04-01].
[40] 韩客松, 王永成. 中文全文标引的主题词标引和主题概念标引方法[J]. 情报学报, 2001, 20(2): 212-216.
[40] (Han Kesong, Wang Yongcheng.Methods of Keyword and Subject Concept Indexing to Chinese Full-text[J]. Journal of the China Society for Scientific and Technical Information, 2001, 20(2): 212-216.)
[41] 苗家, 马军, 陈竹敏. 一种基于HITS算法的Blog文摘方法[J]. 中文信息学报, 2011, 25(1): 104-109.
[41] (Miao Jia, Ma Jun, Chen Zhumin.A New HITS-Based Summarization Approach for Blog[J]. Journal of Chinese Information Processing, 2011, 25(1): 104-109.)
[42] 郭朋伟, 高克宁, 张斌. 基于评论修正的博客聚类算法[J]. 东北大学学报: 自然科学版, 2010, 31(6): 782-785.
[42] (Guo Pengwei, Gao Kening, Zhang Bin.Public Blog Clustering Algorithm Based on Revision by Comments[J]. Journal of Northeastern University: Natural Science, 2010, 31(6): 782-785.)
[43] MacQueen J. Some Methods for Classification and Analysis of Multivariate Observations[C]. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 1967, 1: 281-297.
[44] 韩普, 王东波, 刘艳云, 等. 词性对中英文文本聚类的影响研究[J]. 中文信息学报, 2013, 27(2): 65-73.
[44] (Han Pu, Wang Dongbo, Liu Yanyun, et al.Influence of Part-of-Speech on Chinese and English Document Clustering[J]. Journal of Chinese Information Processing, 2013, 27(2): 65-73.)
[45] 王娟, 范少萍, 郑春厚. 基于惩罚性矩阵分解的文本聚类分析[J]. 情报学报, 2012, 31(9): 998-1008.
[45] (Wang Juan, Fan Shaoping, Zheng Chunhou.Penalized Matrix Decomposition Method for Text Clustering[J]. Journal of the China Society for Scientific and Technical Information, 2012, 31(9): 998-1008.)
[46] Manning C D, Raghavan P, Schütze H.Introduction to Information Retrieval[M]. Cambridge: Cambridge University Press, 2008.
[1] Huaming Zhao,Li Yu,Qiang Zhou. Determining Best Text Clustering Number with Mean Shift Algorithm[J]. 数据分析与知识发现, 2019, 3(9): 27-35.
[2] Quan Lu,Anqi Zhu,Jiyue Zhang,Jing Chen. Research on User Information Requirement in Chinese Network Health Community: Taking Tumor-forum Data of Qiuyi as an Example[J]. 数据分析与知识发现, 2019, 3(4): 22-32.
[3] Zhang Tao,Ma Haiqun. Clustering Policy Texts Based on LDA Topic Model[J]. 数据分析与知识发现, 2018, 2(9): 59-65.
[4] Guan Qin,Deng Sanhong,Wang Hao. Chinese Stopwords for Text Clustering: A Comparative Study[J]. 数据分析与知识发现, 2017, 1(3): 72-80.
[5] Wang Yuefen,Jin Jialin. Characteristics and Development Trends of Papers from “New Technology of Library and Information Service”[J]. 现代图书情报技术, 2016, 32(9): 1-16.
[6] Chen Dongyi,Zhou Zicheng,Jiang Shengyi,Wang Lianxi,Wu Jialin. A Framework for Customer Segmentation on Enterprises’ Microblog[J]. 现代图书情报技术, 2016, 32(2): 43-51.
[7] Gu Xiaoxue, Zhang Chengzhi. Using Content and Tags for Web Text Clustering[J]. 现代图书情报技术, 2014, 30(11): 45-52.
[8] Xu Xin, Hong Yunjia. Study on Text Visualization of Clustering Result for Domain Knowledge Base —— Take Knowledge Base of Chinese Cuisine Culture as the Object[J]. 现代图书情报技术, 2014, 30(10): 25-32.
[9] Deng Sanhong,Wan Jiexi,Wang Hao,Liu Xiwen. Experimental Study of Multilingual Text Clustering[J]. 现代图书情报技术, 2014, 30(1): 28-35.
[10] Zhao Hui, Liu Huailiang. Research on Short Text Clustering Algorithm for User Generated Content[J]. 现代图书情报技术, 2013, 29(9): 88-92.
[11] He Wenjing, He Lin. Research on Text Clustering Based on Social Tagging[J]. 现代图书情报技术, 2013, 29(7/8): 49-54.
[12] Hong Yunjia, Xu Xin. Study on Multi-level Text Clustering for Knowledge Base Based on Domain Ontology——Taking Knowledge Base of Chinese Cuisine Culture as an Example[J]. 现代图书情报技术, 2013, (12): 19-26.
[13] Li Shuqing, Liu Xiaoqian. The Matching Algorithm of Heterogeneous User Personalized Profile Based on Centripetal Spreading Weighted XML Model[J]. 现代图书情报技术, 2012, 28(5): 32-40.
[14] Bian Peng, Zhao Yan, Su Yuzhao. An Improved Method for Determining Optimal Number of Clusters in K-means Clustering Algorithm[J]. 现代图书情报技术, 2011, 27(9): 34-40.
[15] Li Junlian, Li Danya, Huang Lihui, Sun Haixia, Ji Yujing, Wang Qian. Research on Chinese Medical Concept Space Based on Word Co-occurrence[J]. 现代图书情报技术, 2010, 26(11): 59-63.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn