[Objective] This study investigates the co-occurrence of blog comment contributors, aiming to explore their roles in blog posts clustering. [Methods] We developed a method of two-step clustering. First, we constructed the co-occurrence matrix of the contributors from different blog posts and then transform it to a correlation matrix. Then finished the first-step clustering with the help of Affinity Propagation (AP) algorithm. Second, we calculated the terms’ position weight based on the centers of AP clustering, and then finished the second-stage blog post content clustering with K-means algorithm. [Results] The average precision and recall ratio of the proposed method were 0.66 and 0.57, which were significantly higher than those of the traditional ones. [Limitations] The blog comment contributors co-occurrence improved the quality of clustering, but it has limited value in blog posts with few comments. [Conclusions] The proposed method improves the quality of blog posts clustering by combining terms and contributors’ co-occurrence. The two-step clustering method is a better option to select the initial cluster centers of the K-means algorithm.
龚凯乐,成颖,孙建军. 基于参与者共现分析的博文聚类研究*[J]. 现代图书情报技术, 2016, 32(10): 50-58.
Gong Kaile,Cheng Ying,Sun Jianjun. Clustering Blog Posts with Co-occurrence Analysis. New Technology of Library and Information Service, 2016, 32(10): 50-58.
Small H.Co-citation in the Scientific Literature: A New Measure of the Relationship Between Two Documents[J]. Journal of the American Society for Information Science, 1973, 24(4): 265-269.
[2]
White H D, Griffith B C.Author Cocitation: A Literature Measure of Intellectual Structure[J]. Journal of the American Society for Information Science, 1981, 32(3): 163-171.
[3]
Callon M, Law J, Rip A.Mapping the Dynamics of Science and Technology: Sociology of Science in the Real World[M]. Macmillan Press Ltd.., 1986.
[4]
Larson R R.Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace [C]. In: Proceedings of the Annual Meeting- American Society for Information Science.1996, 33: 71-78.
(Wang Yuefen, Song Shuang, Lu Ning, et al.Applications of Co-occurrence Analysis in Text Knowledge Mining[J]. Journal of Library Science in China, 2007, 33(2): 59-64.)
(Zhang Shuliang, Leng Fuhai.Study on the Applicational Development of Literature-based Knowledge Discovery[J]. Journal of the China Society for Scientific and Technical Information, 2006, 25(6): 700-712.)
(Wang Yuefen, Song Shuang, Miao Lu.Application Study of Co-occurrence Analysis in Knowledge Service[J]. New Technology of Library and Information Service, 2006(4): 29-34.)
[8]
孙建军, 李江. 网络信息计量理论、工具与应用[M]. 北京: 科学出版社, 2009.
[8]
(Sun Jianjun, Li Jiang. On Webometrics Theories, Tools and Applications[M]. Beijing: Science Press, 2009.)
[9]
Liu Y C, Wang X L, Liu B Q.A Feature Selection Algorithm for Document Clustering Based on Word Co-occurrence Frequency [C]. In: Proceedings of 2004 International Conference on Machine Learning and Cybernetics. IEEE, 2004, 5: 2963-2968.
[10]
Zhang Y, Feng B Q.A Co-occurrence Based Hierarchical Method for Clustering Web Search Results[C]. In: Proceedings of IEEE / WIC / ACM International Conference on Web Intelligence, WI 2008, Sydney, Australia. 2008: 407-410.
(Wu Suhui, Cheng Ying, Zheng Yanning, et al.Improvement of K-means Algorithm Based on Co-citation Analysis[J]. Journal of the China Society for Scientific and Technical Information, 2012, 31(1): 82-94.)
[12]
He X, Zha H, Ding C H Q, et al. Web Document Clustering Using Hyperlink Structures[J]. Computational Statistics & Data Analysis, 2002, 41(1): 19-45.
(Tan Min, Xu Xin, Zhao Xing.Exploring the Co-recommendation Relationship and Its Core Structure Features of Academic Blogs——Taking ScienceNet.cn Blog as an Example[J]. New Technology of Library and Information Service, 2015(7): 24-30.)
[15]
Xia F, Yang Q, Li J, et al.Data Dissemination Using Interest-tree in Socially Aware Networking[J]. Computer Networks, 2015, 91: 495-507.
[16]
McPherson M, Smith-Lovin L, Cook J M. Birds of a Feather: Homophily in Social Networks[J]. Annual Review of Sociology, 2001, 27(1): 415-444.
[17]
Katsaros D, Dimokas N, Tassiulas L.Social Network Analysis Concepts in the Design of Wireless Ad Hoc Network Protocols[J]. IEEE Network, 2010, 24(6): 23-29.
[18]
Frey B J, Dueck D.Clustering by Passing Messages Between Data Points[J]. Science, 2007, 315(5814): 972-976.
(Xiao Xinyan, Zhang Dongzhan, Gao Junjie, et al.A New Method for Web Search Results Clustering[J]. Journal of Computer Research and Development, 2007, 44(S2): 79-83.)
(Li Fenglin, He Zhoufang.Study on Clustering of Retrieval Results Based on Co-occurrence Analysis of Keywords[J]. Journal of the China Society for Scientific and Technical Information, 2011, 30(8): 819-825.)
[23]
Wang Y, Kitsuregawa M.Link Based Clustering of Web Search Results [C]. In: Proceedings of International Conference on Advances in Web-Age Information Management. Springer-Verlag, 2001: 225-236.
[24]
Mukhopadhyay D, Sing S R.An Algorithm for Automatic Web-page Clustering Using Link Structures [C]. In: Proceedings of the IEEE INDICON Annual Conference 2004. IEEE, 2004: 472-477.
[25]
Modha D S, Spangler W S.Clustering Hypertext with Applications to Web Searching [C]. In: Proceedings of the 11th ACM Conference on Hypertext and Hypermedia. ACM, 2000: 143-152.
(Gu Jun, Zheng Xiaodong, Zhang Lianming.Research on Bio-medical Document Clustering with Citation Information Incorporated[J]. Computer Applications and Software, 2012(10): 5-7.)
[27]
Brooks C H, Montanez N.Improved Annotation of the Blogosphere via Autotagging and Hierarchical Clustering [C]. In: Proceedings of the 15th International Conference on World Wide Web. ACM, 2006: 625-632.
(He Wenjing, He Lin.Research on Text Clustering Based on Social Tagging[J]. New Technology of Library and Information Service, 2013(7-8): 49-54.)
[29]
Zhang Y, Gao K, Zhang B, et al.Clustering Blog Posts Using Tags and Relations in the Blogosphere [C]. In: Proceedings of the 1st International Conference on Information Science and Engineering (ICISE). IEEE, 2010: 817-820.
[30]
Chen Y H, Lu J L, Wu T Y.A Blog Clustering Approach Based on Queried Keywords [C]. In: Proceedings of the 2013 International Symposium on Biometrics and Security Technologies (ISBAST). IEEE, 2013: 1-9.
[31]
Li B, Xu S, Zhang J.Enhancing Clustering Blog Documents by Utilizing Author/Reader Comments [C]. In: Proceedings of the 45th Annual Southeast Regional Conference. ACM, 2007: 94-99.
[32]
Kopel M, Zgrzywa A.Search Result Clustering Using Semantic Web Data [C]. In: Proceedings of the 3rd International Conference on Intelligent Information and Database Systems. Springer Berlin Heidelberg, 2011: 292-301.
[33]
Chin A, Chignell M.A Social Hypertext Model for Finding Community in Blogs [C]. In: Proceedings of the 17th Conference on Hypertext and Hypermedia. ACM, 2006: 11-22.
[34]
Lin Y R, Sundaram H, Chi Y, et al.Discovery of Blog Communities Based on Mutual Awareness [C]. In: Proceedings of the 3rd Annual Workshop on the Weblogging Ecosystem. 2006.
[35]
Lu L, Zhu F.Blogger Clustering by Utilizing Link Information [C]. In: Proceedings of the 2010 IEEE International Conference on Intelligent Computing and Intelligent Systems (ICIS). IEEE, 2010, 2: 267-270.
[36]
Bruns A, Burgess J, Highfield T, et al.Mapping the Australian Networked Public Sphere[J]. Social Science Computer Review, 2011, 29(3): 277-287.
(Zhou Lei, Yang Wei, Zhang Yufeng.Issues and Re-consideration on Cluster Analysis in Co-occurrence Matrix[J]. Journal of Intelligence, 2014, 33(6): 32-36.)
(Hang Wenlong, Jiang Yizhang, Liu Jiefang, et al. Transfer Affinity Propagation Clustering Algorithm[J/OL]. Journal of Software, (2015-11-26). [2016-04-01].
(Han Kesong, Wang Yongcheng.Methods of Keyword and Subject Concept Indexing to Chinese Full-text[J]. Journal of the China Society for Scientific and Technical Information, 2001, 20(2): 212-216.)
(Guo Pengwei, Gao Kening, Zhang Bin.Public Blog Clustering Algorithm Based on Revision by Comments[J]. Journal of Northeastern University: Natural Science, 2010, 31(6): 782-785.)
[43]
MacQueen J. Some Methods for Classification and Analysis of Multivariate Observations[C]. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 1967, 1: 281-297.
(Han Pu, Wang Dongbo, Liu Yanyun, et al.Influence of Part-of-Speech on Chinese and English Document Clustering[J]. Journal of Chinese Information Processing, 2013, 27(2): 65-73.)
(Wang Juan, Fan Shaoping, Zheng Chunhou.Penalized Matrix Decomposition Method for Text Clustering[J]. Journal of the China Society for Scientific and Technical Information, 2012, 31(9): 998-1008.)
[46]
Manning C D, Raghavan P, Schütze H.Introduction to Information Retrieval[M]. Cambridge: Cambridge University Press, 2008.