Please wait a minute...
Advanced Search
现代图书情报技术  2016, Vol. 32 Issue (10): 50-58     https://doi.org/10.11925/infotech.1003-3513.2016.10.06
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于参与者共现分析的博文聚类研究*
龚凯乐(),成颖,孙建军
南京大学信息管理学院 南京 210023
Clustering Blog Posts with Co-occurrence Analysis
Gong Kaile(),Cheng Ying,Sun Jianjun
School of Information Management, Nanjing University, Nanjing 210023, China
全文: PDF (530 KB)   HTML ( 23
输出: BibTeX | EndNote (RIS)      
摘要 

目的】将博文参与者共现作为特征, 探析其在博文聚类中的价值。【方法】两步聚类: 构建不同博文参与者的共现矩阵并转化为相关矩阵, 采用近邻传播(Affinity Propagation, AP)算法完成第一步聚类; 将AP聚类结果的质心作为初始聚类中心, 对词项进行位置加权, 利用K-means算法完成博文内容的第二步聚类。【结果】综合博文参与者共现与词项位置加权的聚类算法平均准确率与纯度分别达到0.66和0.57, 显著优于对比实验。【局限】本研究的主要贡献是引入参与者共现作为特征改进博文聚类效果, 对于该特征甚少的博文聚类价值有限。【结论】整合词项与博文参与者特征的博文聚类显著地提高了聚类质量, 两步法聚类也为K-means算法初始聚类中心的选择提供了可行的解决方案。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
龚凯乐
成颖
孙建军
关键词 共现分析文本聚类博文参与者初始聚类中心    
Abstract

[Objective] This study investigates the co-occurrence of blog comment contributors, aiming to explore their roles in blog posts clustering. [Methods] We developed a method of two-step clustering. First, we constructed the co-occurrence matrix of the contributors from different blog posts and then transform it to a correlation matrix. Then finished the first-step clustering with the help of Affinity Propagation (AP) algorithm. Second, we calculated the terms’ position weight based on the centers of AP clustering, and then finished the second-stage blog post content clustering with K-means algorithm. [Results] The average precision and recall ratio of the proposed method were 0.66 and 0.57, which were significantly higher than those of the traditional ones. [Limitations] The blog comment contributors co-occurrence improved the quality of clustering, but it has limited value in blog posts with few comments. [Conclusions] The proposed method improves the quality of blog posts clustering by combining terms and contributors’ co-occurrence. The two-step clustering method is a better option to select the initial cluster centers of the K-means algorithm.

Key wordsCo-occurrence analysis    Text clustering    Blog comments contributor    Initial cluster centers
收稿日期: 2016-05-04      出版日期: 2016-11-23
基金资助:*本文系国家自然科学基金面上项目“融合范式视角下的链接分析理论集成框架及其实证研究”(项目编号: 71273125)和中国科学技术信息研究所合作研究项目的研究成果之一
引用本文:   
龚凯乐,成颖,孙建军. 基于参与者共现分析的博文聚类研究*[J]. 现代图书情报技术, 2016, 32(10): 50-58.
Gong Kaile,Cheng Ying,Sun Jianjun. Clustering Blog Posts with Co-occurrence Analysis. New Technology of Library and Information Service, 2016, 32(10): 50-58.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2016.10.06      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2016/V32/I10/50
[1] Small H.Co-citation in the Scientific Literature: A New Measure of the Relationship Between Two Documents[J]. Journal of the American Society for Information Science, 1973, 24(4): 265-269.
[2] White H D, Griffith B C.Author Cocitation: A Literature Measure of Intellectual Structure[J]. Journal of the American Society for Information Science, 1981, 32(3): 163-171.
[3] Callon M, Law J, Rip A.Mapping the Dynamics of Science and Technology: Sociology of Science in the Real World[M]. Macmillan Press Ltd.., 1986.
[4] Larson R R.Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace [C]. In: Proceedings of the Annual Meeting- American Society for Information Science.1996, 33: 71-78.
[5] 王曰芬, 宋爽, 卢宁, 等. 共现分析在文本知识挖掘中的应用研究[J]. 中国图书馆学报, 2007, 33(2): 59-64.
[5] (Wang Yuefen, Song Shuang, Lu Ning, et al.Applications of Co-occurrence Analysis in Text Knowledge Mining[J]. Journal of Library Science in China, 2007, 33(2): 59-64.)
[6] 张树良, 冷伏海. 基于文献的知识发现的应用进展研究[J]. 情报学报, 2006, 25(6): 700-712.
[6] (Zhang Shuliang, Leng Fuhai.Study on the Applicational Development of Literature-based Knowledge Discovery[J]. Journal of the China Society for Scientific and Technical Information, 2006, 25(6): 700-712.)
[7] 王曰芬, 宋爽, 苗露. 共现分析在知识服务中的应用研究[J]. 现代图书情报技术, 2006(4): 29-34.
[7] (Wang Yuefen, Song Shuang, Miao Lu.Application Study of Co-occurrence Analysis in Knowledge Service[J]. New Technology of Library and Information Service, 2006(4): 29-34.)
[8] 孙建军, 李江. 网络信息计量理论、工具与应用[M]. 北京: 科学出版社, 2009.
[8] (Sun Jianjun, Li Jiang. On Webometrics Theories, Tools and Applications[M]. Beijing: Science Press, 2009.)
[9] Liu Y C, Wang X L, Liu B Q.A Feature Selection Algorithm for Document Clustering Based on Word Co-occurrence Frequency [C]. In: Proceedings of 2004 International Conference on Machine Learning and Cybernetics. IEEE, 2004, 5: 2963-2968.
[10] Zhang Y, Feng B Q.A Co-occurrence Based Hierarchical Method for Clustering Web Search Results[C]. In: Proceedings of IEEE / WIC / ACM International Conference on Web Intelligence, WI 2008, Sydney, Australia. 2008: 407-410.
[11] 吴夙慧, 成颖, 郑彦宁, 等. 基于学术文献同被引分析的K-means算法改进研究[J]. 情报学报, 2012, 31(1): 82-94.
[11] (Wu Suhui, Cheng Ying, Zheng Yanning, et al.Improvement of K-means Algorithm Based on Co-citation Analysis[J]. Journal of the China Society for Scientific and Technical Information, 2012, 31(1): 82-94.)
[12] He X, Zha H, Ding C H Q, et al. Web Document Clustering Using Hyperlink Structures[J]. Computational Statistics & Data Analysis, 2002, 41(1): 19-45.
[13] 邓三鸿, 顾婷婷. 我国图情领域核心期刊论文作者同被引现象的可视化分析[J]. 情报科学, 2010, 28(11): 1728-1732.
[13] (Deng Sanhong, Gu Tingting.A Visual ACA Analysis of Core Journals in the Field of LIS[J]. Information Science, 2010, 28(11): 1728-1732.)
[14] 谭旻, 许鑫, 赵星. 学术博客共推荐关系及核心结构特性研究——以科学网博客为例[J]. 现代图书情报技术, 2015(7): 24-30.
[14] (Tan Min, Xu Xin, Zhao Xing.Exploring the Co-recommendation Relationship and Its Core Structure Features of Academic Blogs——Taking ScienceNet.cn Blog as an Example[J]. New Technology of Library and Information Service, 2015(7): 24-30.)
[15] Xia F, Yang Q, Li J, et al.Data Dissemination Using Interest-tree in Socially Aware Networking[J]. Computer Networks, 2015, 91: 495-507.
[16] McPherson M, Smith-Lovin L, Cook J M. Birds of a Feather: Homophily in Social Networks[J]. Annual Review of Sociology, 2001, 27(1): 415-444.
[17] Katsaros D, Dimokas N, Tassiulas L.Social Network Analysis Concepts in the Design of Wireless Ad Hoc Network Protocols[J]. IEEE Network, 2010, 24(6): 23-29.
[18] Frey B J, Dueck D.Clustering by Passing Messages Between Data Points[J]. Science, 2007, 315(5814): 972-976.
[19] 吴夙慧, 成颖, 郑彦宁, 等. K-means算法研究综述[J]. 现代图书情报技术, 2011(5): 28-35.
[19] (Wu Suhui, Cheng Ying, Zheng Yanning, et al.Survey on K-means Algorithm[J]. New Technology of Library and Information Service, 2011(5): 28-35.)
[20] 常鹏, 冯楠, 马辉. 一种基于词共现的文档聚类算法[J]. 计算机工程, 2012, 38(2): 213-214.
[20] (Chang Peng, Feng Nan, Ma Hui.Document Clustering Algorithm Based on Word Co-occurrence[J]. Computer Engineering, 2012, 38(2): 213-214.)
[21] 肖欣延, 张东站, 高君杰, 等. 一种新的Web检索结果聚类方法[J]. 计算机研究与发展, 2007, 44(S2): 79-83.
[21] (Xiao Xinyan, Zhang Dongzhan, Gao Junjie, et al.A New Method for Web Search Results Clustering[J]. Journal of Computer Research and Development, 2007, 44(S2): 79-83.)
[22] 李枫林, 何洲芳. 基于关键词共现分析的检索结果聚类研究[J]. 情报学报, 2011, 30(8): 819-825.
[22] (Li Fenglin, He Zhoufang.Study on Clustering of Retrieval Results Based on Co-occurrence Analysis of Keywords[J]. Journal of the China Society for Scientific and Technical Information, 2011, 30(8): 819-825.)
[23] Wang Y, Kitsuregawa M.Link Based Clustering of Web Search Results [C]. In: Proceedings of International Conference on Advances in Web-Age Information Management. Springer-Verlag, 2001: 225-236.
[24] Mukhopadhyay D, Sing S R.An Algorithm for Automatic Web-page Clustering Using Link Structures [C]. In: Proceedings of the IEEE INDICON Annual Conference 2004. IEEE, 2004: 472-477.
[25] Modha D S, Spangler W S.Clustering Hypertext with Applications to Web Searching [C]. In: Proceedings of the 11th ACM Conference on Hypertext and Hypermedia. ACM, 2000: 143-152.
[26] 顾钧, 郑晓东, 张连明. 结合引文信息的生物医学文本聚类研究[J]. 计算机应用与软件, 2012(10): 5-7.
[26] (Gu Jun, Zheng Xiaodong, Zhang Lianming.Research on Bio-medical Document Clustering with Citation Information Incorporated[J]. Computer Applications and Software, 2012(10): 5-7.)
[27] Brooks C H, Montanez N.Improved Annotation of the Blogosphere via Autotagging and Hierarchical Clustering [C]. In: Proceedings of the 15th International Conference on World Wide Web. ACM, 2006: 625-632.
[28] 何文静, 何琳. 基于社会标签的文本聚类研究[J]. 现代图书情报技术, 2013(7-8): 49-54.
[28] (He Wenjing, He Lin.Research on Text Clustering Based on Social Tagging[J]. New Technology of Library and Information Service, 2013(7-8): 49-54.)
[29] Zhang Y, Gao K, Zhang B, et al.Clustering Blog Posts Using Tags and Relations in the Blogosphere [C]. In: Proceedings of the 1st International Conference on Information Science and Engineering (ICISE). IEEE, 2010: 817-820.
[30] Chen Y H, Lu J L, Wu T Y.A Blog Clustering Approach Based on Queried Keywords [C]. In: Proceedings of the 2013 International Symposium on Biometrics and Security Technologies (ISBAST). IEEE, 2013: 1-9.
[31] Li B, Xu S, Zhang J.Enhancing Clustering Blog Documents by Utilizing Author/Reader Comments [C]. In: Proceedings of the 45th Annual Southeast Regional Conference. ACM, 2007: 94-99.
[32] Kopel M, Zgrzywa A.Search Result Clustering Using Semantic Web Data [C]. In: Proceedings of the 3rd International Conference on Intelligent Information and Database Systems. Springer Berlin Heidelberg, 2011: 292-301.
[33] Chin A, Chignell M.A Social Hypertext Model for Finding Community in Blogs [C]. In: Proceedings of the 17th Conference on Hypertext and Hypermedia. ACM, 2006: 11-22.
[34] Lin Y R, Sundaram H, Chi Y, et al.Discovery of Blog Communities Based on Mutual Awareness [C]. In: Proceedings of the 3rd Annual Workshop on the Weblogging Ecosystem. 2006.
[35] Lu L, Zhu F.Blogger Clustering by Utilizing Link Information [C]. In: Proceedings of the 2010 IEEE International Conference on Intelligent Computing and Intelligent Systems (ICIS). IEEE, 2010, 2: 267-270.
[36] Bruns A, Burgess J, Highfield T, et al.Mapping the Australian Networked Public Sphere[J]. Social Science Computer Review, 2011, 29(3): 277-287.
[37] 肖宇, 于剑. 基于近邻传播算法的半监督聚类[J]. 软件学报, 2008, 19(11): 2803-2813.
[37] (Xiao Yu, Yu Jian.Semi-Supervised Clustering Based on Affinity Propagation Algorithm[J]. Journal of Software, 2008, 19(11): 2803-2813.)
[38] 周磊, 杨威, 张玉峰. 共现矩阵聚类分析的问题与再思考[J]. 情报杂志, 2014, 33(6): 32-36.
[38] (Zhou Lei, Yang Wei, Zhang Yufeng.Issues and Re-consideration on Cluster Analysis in Co-occurrence Matrix[J]. Journal of Intelligence, 2014, 33(6): 32-36.)
[39] 杭文龙, 蒋亦樟, 刘解放, 等. 迁移近邻传播聚类算法[J/OL]. 软件学报, (2015-11-26).[2016-04-01]. .
[39] (Hang Wenlong, Jiang Yizhang, Liu Jiefang, et al. Transfer Affinity Propagation Clustering Algorithm[J/OL]. Journal of Software, (2015-11-26). [2016-04-01].
[40] 韩客松, 王永成. 中文全文标引的主题词标引和主题概念标引方法[J]. 情报学报, 2001, 20(2): 212-216.
[40] (Han Kesong, Wang Yongcheng.Methods of Keyword and Subject Concept Indexing to Chinese Full-text[J]. Journal of the China Society for Scientific and Technical Information, 2001, 20(2): 212-216.)
[41] 苗家, 马军, 陈竹敏. 一种基于HITS算法的Blog文摘方法[J]. 中文信息学报, 2011, 25(1): 104-109.
[41] (Miao Jia, Ma Jun, Chen Zhumin.A New HITS-Based Summarization Approach for Blog[J]. Journal of Chinese Information Processing, 2011, 25(1): 104-109.)
[42] 郭朋伟, 高克宁, 张斌. 基于评论修正的博客聚类算法[J]. 东北大学学报: 自然科学版, 2010, 31(6): 782-785.
[42] (Guo Pengwei, Gao Kening, Zhang Bin.Public Blog Clustering Algorithm Based on Revision by Comments[J]. Journal of Northeastern University: Natural Science, 2010, 31(6): 782-785.)
[43] MacQueen J. Some Methods for Classification and Analysis of Multivariate Observations[C]. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 1967, 1: 281-297.
[44] 韩普, 王东波, 刘艳云, 等. 词性对中英文文本聚类的影响研究[J]. 中文信息学报, 2013, 27(2): 65-73.
[44] (Han Pu, Wang Dongbo, Liu Yanyun, et al.Influence of Part-of-Speech on Chinese and English Document Clustering[J]. Journal of Chinese Information Processing, 2013, 27(2): 65-73.)
[45] 王娟, 范少萍, 郑春厚. 基于惩罚性矩阵分解的文本聚类分析[J]. 情报学报, 2012, 31(9): 998-1008.
[45] (Wang Juan, Fan Shaoping, Zheng Chunhou.Penalized Matrix Decomposition Method for Text Clustering[J]. Journal of the China Society for Scientific and Technical Information, 2012, 31(9): 998-1008.)
[46] Manning C D, Raghavan P, Schütze H.Introduction to Information Retrieval[M]. Cambridge: Cambridge University Press, 2008.
[1] 赵华茗,余丽,周强. 基于均值漂移算法的文本聚类数目优化研究 *[J]. 数据分析与知识发现, 2019, 3(9): 27-35.
[2] 陆泉,朱安琪,张霁月,陈静. 中文网络健康社区中的用户信息需求挖掘研究*——以求医网肿瘤板块数据为例[J]. 数据分析与知识发现, 2019, 3(4): 22-32.
[3] 张涛, 马海群. 一种基于LDA主题模型的政策文本聚类方法研究*[J]. 数据分析与知识发现, 2018, 2(9): 59-65.
[4] 官琴, 邓三鸿, 王昊. 中文文本聚类常用停用词表对比研究*[J]. 数据分析与知识发现, 2017, 1(3): 72-80.
[5] 王曰芬,靳嘉林. 比较分析《现代图书情报技术》近10年发文特征与发展趋势*[J]. 现代图书情报技术, 2016, 32(9): 1-16.
[6] 陈东沂,周子程,蒋盛益,王连喜,吴佳林. 面向企业微博的客户细分框架*[J]. 现代图书情报技术, 2016, 32(2): 43-51.
[7] 赵华茗. 分布式环境下的文本聚类研究与实现[J]. 现代图书情报技术, 2015, 31(1): 82-88.
[8] 顾晓雪, 章成志. 结合内容和标签的Web文本聚类研究[J]. 现代图书情报技术, 2014, 30(11): 45-52.
[9] 许鑫, 洪韵佳. 专题知识库中文本聚类结果的可视化研究——以中华烹饪文化知识库为例[J]. 现代图书情报技术, 2014, 30(10): 25-32.
[10] 邓三鸿,万接喜,王昊,刘喜文. 基于特征翻译和潜在语义标引的跨语言文本聚类实验分析*[J]. 现代图书情报技术, 2014, 30(1): 28-35.
[11] 赵辉, 刘怀亮. 面向用户生成内容的短文本聚类算法研究[J]. 现代图书情报技术, 2013, 29(9): 88-92.
[12] 何文静, 何琳. 基于社会标签的文本聚类研究[J]. 现代图书情报技术, 2013, 29(7/8): 49-54.
[13] 洪韵佳, 许鑫. 基于领域本体的知识库多层次文本聚类研究——以中华烹饪文化知识库为例[J]. 现代图书情报技术, 2013, (12): 19-26.
[14] 李树青, 刘晓倩. 基于向心扩散加权XML模型的异构用户个性化模式匹配方法[J]. 现代图书情报技术, 2012, 28(5): 32-40.
[15] 边鹏, 赵妍, 苏玉召. 一种改进的K-means算法最佳聚类数确定方法[J]. 现代图书情报技术, 2011, 27(9): 34-40.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn