Please wait a minute...
Advanced Search
数据分析与知识发现  2018, Vol. 2 Issue (12): 52-59     https://doi.org/10.11925/infotech.2096-3467.2018.0415
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于CART决策树的网络问答社区新兴话题识别研究*
程秀峰1, 张心怡2, 王宁2()
1中国科学技术信息研究所 北京 100038
2华中师范大学信息管理学院 武汉 430079
Identifying Trending Topics in Q&A Community with CART Decision Tree
Cheng Xiufeng1, Zhang Xinyi2, Wang Ning2()
1Institute of Scientific and Technical Information of China, Beijing 100038, China
2School of Information Management, Central China Normal University, Wuhan 430079, China
全文: PDF (591 KB)   HTML ( 3
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】协助相关决策部门监督和管理网络舆情, 探测可能成为舆情关注焦点的新兴话题。【方法】提出网络问答社区中新兴话题的识别标准和依据, 并基于知乎问答社区, 利用CART决策树对识别过程进行实证研究。【结果】对于网络问答社区, CART决策树在新兴话题的识别与预测方面具有较好的准确性和适用性。【局限】实验数据只占知乎所有话题板块的一小部分, 为验证该方法的有效性, 需要进一步扩展数据集。【结论】基于CART决策树的网络问答社区新兴话题识别方法能够有效预测新兴话题, 可为网络问答社区的热点话题筛选机制提供参考。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
程秀峰
张心怡
王宁
关键词 决策树网络问答社区新兴话题    
Abstract

[Objective] This paper tries to identify the trending topics, aiming to help the decision-making agencies manage online public opinion. [Methods] Firstly, we proposed the criteria to detect the trending topics of Q&A community. Then, we conducted an empirical study on China’s Zhihu Q&A community using the CART decision tree algorithm. [Results] The CART decision tree predicted the trending topics. [Limitations] We only collected data from a small portion of all topics on Zhihu. More data is needed for future studies. [Conclusions] The proposed method based on the CART decision tree algorithm could effectively predict trending topics in the Q&A community, which help us choose popular contents.

Key wordsDecision Tree    Q&A Community    Trending Topics
收稿日期: 2018-04-13      出版日期: 2019-01-16
ZTFLH:  G25  
基金资助:*本文系国家自然科学青年基金项目“基于QSIM的图书馆移动用户群体行为模拟与学习兴趣引导研究”(项目编号: 7150309)和教育部人文社会科学研究青年基金项目“移动环境下图书馆用户行为发现与知识推荐研究”(项目编号: 14YJC870004)的研究成果之一
引用本文:   
程秀峰, 张心怡, 王宁. 基于CART决策树的网络问答社区新兴话题识别研究*[J]. 数据分析与知识发现, 2018, 2(12): 52-59.
Cheng Xiufeng,Zhang Xinyi,Wang Ning. Identifying Trending Topics in Q&A Community with CART Decision Tree. Data Analysis and Knowledge Discovery, 2018, 2(12): 52-59.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2018.0415      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2018/V2/I12/52
特征 一级标准 二级标准
①吸引力较强 A问题关注度 A1浏览次数
②参与度较高 A2关注人数
③影响力较大 A3回答数量
④内容多样性 B问题内聚度 B1回答相近度
⑤间隔时间短 B2粒度特征值
⑥具备关键节点 C问题影响度 C1用户关键度
⑦传播速度快
  新兴话题识别模式特征—标准对应表
Title Focus Cohesion Impact Is Order
Tree1.Topic1 1 379 560 8.99 502 643 1 25
Tree1.Topic2 9 495 9.81 83 591 1 49
Tree1.Topic3 356 204 8.04 522 595 1 55
Tree1.Topic4 51 740 8.23 89 478 1 63
Tree1.Topic5 347 185 9.28 994 162 1 93
Tree1.Topic6 3 496 1.98 8 874 1 94
Tree1.Topic7 8 538 3.64 597 1 96
Tree1.Topic8 4 361 4.41 10 818 1 99
Tree1.Topic9 56 159 3.33 93 735 1 110
Tree1.Topic10 35 877 1.82 21 288 1 115
Tree1.Topic11 6 600 5.86 5 318 1 118
Tree1.Topic12 403 128 8.66 97 756 1 121
Tree1.Topic13 4 249 1.89 1 108 1 124
Tree1.Topic14 703 195 15.20 52 308 1 128
Tree1.Topic15 1 327 4.31 2 760 1 136
  预处理的T1问题数据
Title Focus Cohesion Impact Is Order
Tree1.Topic16 109 452 15.91 109 622 0 137
Tree1.Topic17 95 0 0 0 139
Tree1.Topic18 648 7.18 457 0 145
Tree1.Topic19 5 068 3.49 111 0 149
Tree1.Topic20 950 3.27 11 670 0 153
Tree1.Topic21 801 1.53 46 0 159
Tree1.Topic22 1 472 1.97 44 0 163
Tree1.Topic23 791 2.37 586 0 164
Tree1.Topic24 426 1.83 12 650 0 173
Tree1.Topic25 281 1.85 68 0 180
Tree1.Topic26 871 3.39 5 181 0 203
Tree1.Topic27 1 196 2.11 144 588 0 207
Tree1.Topic28 576 3.13 9 949 0 209
Tree1.Topic29 408 1.95 16 350 0 213
Tree1.Topic30 463 2.46 465 0 234
  预处理的T2问题数据
Title Focus Cohesion Impact Is Order
Tree2.Topic1 109 452 15.91 109 622 1 1
Tree2.Topic2 403128 8.66 97 756 1 8
Tree2.Topic3 14 593 5.26 2 347 1 15
Tree2.Topic4 2 357 3.28 36 327 1 22
Tree2.Topic5 217 4.29 2 751 1 29
Tree2.Topic6 233 3.92 1 178 1 36
Tree2.Topic7 165 4.00 700 1 43
Tree2.Topic8 82 4.36 1 182 1 50
Tree2.Topic9 3 496 1.98 8 874 1 57
Tree2.Topic10 151 2.77 1 156 1 64
Tree2.Topic11 170 3.03 390 1 71
Tree2.Topic12 426 1.82 12 650 1 78
Tree2.Topic13 294 3.46 59 1 85
Tree2.Topic14 135 2.98 246 1 92
Tree2.Topic15 141 2.54 322 1 99
  预处理的T3问题数据
Title Focus Cohesion Impact Is Order
Tree2.Topic16 156 1.82 8 982 0 106
Tree2.Topic17 102 3.78 51 0 113
Tree2.Topic18 309 1.64 1 141 0 120
Tree2.Topic19 865 1.34 2 178 0 127
Tree2.Topic20 161 2.68 39 0 134
Tree2.Topic21 75 3.26 54 0 141
Tree2.Topic22 187 2.68 27 0 148
Tree2.Topic23 87 2.04 169 0 155
Tree2.Topic24 57 1.81 1 350 0 162
Tree2.Topic25 117 2.03 47 0 169
Tree2.Topic26 56 1.78 405 0 176
Tree2.Topic27 31 1.91 1 091 0 183
Tree2.Topic28 130 1.99 18 0 190
Tree2.Topic29 59 1.76 239 0 197
Tree2.Topic30 93 1.71 64 0 204
  预处理的T4问题数据
  决策树Tree1
  决策树Tree2
  两棵决策树预测新兴话题的准确率对比情况
[1] Guo J, Xu S, Bao S, et al.Tapping on the Potential of Q&A Community by Recommending Answer Providers[C]// Proceedings of the 17th ACM Conference on Information and Knowledge Management. ACM, 2008: 921-930.
[2] 笱程成, 杜攀, 刘悦, 等. 在线社交网络中的新兴话题检测技术综述[J]. 中文信息学报, 2016, 30(5): 9-18.
[2] (Gou Chengcheng, Du Pan, Liu Yue, et al.Emerging Topic Detection in Online Social Networks: A Survey[J]. Journal of Chinese Information Processing, 2016, 30(5): 9-18.)
[3] Wikipedia. Decision Tree[EB/OL].[2018-05-20]. .
[4] Franco-Arcega A, Carrasco-Ochoa J A, Sánchez-Díaz G, et al. Building Fast Decision Trees from Large Training Sets[J]. Intelligent Data Analysis, 2012, 16(4): 649-664.
doi: 10.3233/IDA-2012-0542
[5] 王洪伟, 高松, 陆頲. 基于LDA和SNA的在线新闻热点识别研究[J]. 情报学报, 2016, 35(10): 1022-1037.
doi: 10.3772/j.issn.1000-0135.2016.010.002
[5] (Wang Hongwei, Gao Song, Lu Ting.Identifying Hot Topics of Online News Based on LDA and SNA[J]. Journal of the China Society for Scientific and Technical Information, 2016, 35(10): 1022-1037.)
doi: 10.3772/j.issn.1000-0135.2016.010.002
[6] Yang Y, Carbonell J G, Brown R D, et al.Learning Approaches for Detecting and Tracking News Events[J]. IEEE Intelligent Systems and Their Applications, 1999, 14(4): 32-43.
doi: 10.1109/5254.784083
[7] 范云满, 马建霞. 利用LDA的领域新兴主题探测技术综述[J]. 现代图书情报技术, 2012(12): 58-65.
[7] (Fan Yunman, Ma Jianxia.Review on the LDA-Based Techniques Detection for the Field Emerging Topic[J]. New Technology of Library and Information Service, 2012(12): 58-65.)
[8] Deerwester S, Dumais S T, Furnas G W, et al.Indexing by Latent Semantic Analysis[J]. Journal of the American Society for Information Science, 1990, 41(6): 391-407.
doi: 10.1002/(ISSN)1097-4571
[9] Mehrotra R, Sanner S, Buntine W, et al.Improving LDA Topic Models for Microblogs via Tweet Pooling and Automatic Labeling[C]// Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2013: 889-892.
[10] Takahashi T, Tomioka R, Yamanishi K.Discovering Emerging Topics in Social Streams via Link-Anomaly Detection[J]. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(1): 120-130.
doi: 10.1109/TKDE.2012.239
[11] 贺敏, 徐杰, 杜攀, 等. 基于时间序列分析的微博突发话题检测方法[J]. 通信学报, 2016, 37(3): 48-54.
doi: 10.11959/j.issn.1000-436x.2016052
[11] (He Min, Xu Jie, Du Pan, et al.Bursty Topic Detection Method for Microblog Based on Time Series Analysis[J]. Journal on Communications, 2016, 37(3): 48-54.)
doi: 10.11959/j.issn.1000-436x.2016052
[12] 黄鲁成, 蒋林杉, 苗红, 等. 基于网络问答社区的话题识别与分析——以知乎“老年人”话题为例[J]. 图书情报工作, 2016, 60(5): 93-100.
doi: 10.13266/j.issn.0252-3116.2016.05.014
[12] (Huang Lucheng, Jiang Linshan, Miao Hong, et al.Detection and Analysis of the Topic Based on the Social Q&A Website: A Case Study of “The Elderly” on Zhihu Website[J]. Library and Information Service, 2016, 60(5): 93-100.)
doi: 10.13266/j.issn.0252-3116.2016.05.014
[13] Seni G, Elder J F.Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions[M]. Williston: Morgan and Claypool Publishers, 2010.
[14] 张棪, 曹健. 面向大数据分析的决策树算法[J]. 计算机科学, 2016, 43(S1): 374-379, 383.
[14] (Zhang Yan, Cao Jian.Decision Tree Algorithms for Big Data Analysis[J]. Computer Science, 2016, 43(S1): 374-379, 383.)
[15] Quinlan J R.Simplifying Decision Trees[J]. International Journal of Man-Machine Studies, 1987, 27(3): 221-234.
doi: 10.1016/S0020-7373(87)80053-6
[16] Kretowski M, Grzes M.Evolutionary Induction of Mixed Decision Trees[J]. International Journal of Data Warehousing and Mining, 2007, 3(4): 68-82.
doi: 10.4018/IJDWM
[17] 奚浩瀚, 刘云, 熊菲. 微博噪声过滤和话题检测[J]. 铁路计算机应用, 2015, 24(3): 19-21, 32.
doi: 10.3969/j.issn.1005-8451.2015.03.005
[17] (Xi Haohan, Liu Yun, Xiong Fei.Micro-Blog Noise Filtering and Topic Detection[J]. Railway Computer Application, 2015, 24(3): 19-21, 32.)
doi: 10.3969/j.issn.1005-8451.2015.03.005
[18] 宗慧, 刘金岭. 基于短文本信息流的热点话题检测[J]. 数据采集与处理, 2015, 30(2): 464-468.
doi: 10.16337/j.1004-9037.2015.02.026
[18] (Zong Hui, Liu Jinling.Hot Topic Detection Based on Short Text Information Flow[J]. Journal of Data Acquisition and Processing, 2015, 30(2): 464-468.)
doi: 10.16337/j.1004-9037.2015.02.026
[19] Tu Y N, Seng J L.Indices of Novelty for Emerging Topic Detection[J]. Information Processing & Management, 2012, 48(2): 303-325.
doi: 10.1016/j.ipm.2011.07.006
[20] 万越, 隋杰. 基于用户行为影响的微博突发话题检测方法[J]. 中国科学技术大学学报, 2017, 47(4): 328-335.
doi: 10.3969/j.issn.0253-2778.2017.04.007
[20] (Wan Yue, Sui Jie.Bursty Topic Detection Method for Microblog Based on Influence from User Behaviors[J]. Journal of University of Science and Technology of China, 2017, 47(4): 328-335.)
doi: 10.3969/j.issn.0253-2778.2017.04.007
[21] Dang Q, Gao F, Zhou Y.Early Detection Method for Emerging Topics Based on Dynamic Bayesian Networks in Micro-Blogging Networks[J]. Expert Systems with Applications, 2016, 57: 285-295.
doi: 10.1016/j.eswa.2016.03.050
[22] 孔维泽, 刘奕群, 张敏, 等. 问答社区中回答质量的评价方法研究[J]. 中文信息学报, 2011, 25(1): 3-8.
doi: 10.3969/j.issn.1003-0077.2011.01.001
[22] (Kong Weize, Liu Yiqun, Zhang Min, et al.Answer Quality Analysis on Community Question Answering[J]. Journal of Chinese Information Processing, 2011, 25(1): 3-8.)
doi: 10.3969/j.issn.1003-0077.2011.01.001
[23] Yang Y, Pierce T, Carbonell J.A Study of Retrospective and Online Event Detection[C]// Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1998: 28-36.
[24] Zhang J, Ackerman M S, Adamic L.Expertise Networks in Online Communities: Structure and Algorithms[C]// Proceedings of the 16th International Conference on World Wide Web. ACM, 2007: 221-230.
[25] Quinlan J R.Induction of Decision Trees[J]. Machine Learning, 1986, 1(1): 81-106.
[26] Quinlan J R.C4.5: Programs for Machine Learning[M]. San Francisco: Morgan Kaufmann Publishers, 1993.
[27] Dunham M H.Data Mining: Introductory and Advanced Topics[M]. 2006.
[28] 栾丽华, 吉根林. 决策树分类技术研究[J]. 计算机工程, 2004, 30(9): 94-96, 105.
doi: 10.3969/j.issn.1000-3428.2004.09.038
[28] (Luan Lihua, Ji Genlin.The Study on Decision Tree Classification Techniques[J]. Computer Engineering, 2004, 30(9): 94-96, 105.)
doi: 10.3969/j.issn.1000-3428.2004.09.038
[29] Han J, Kambr M.Data Mining: Concepts and Techniques[M]. San Francisco: Morgan Kaufmann Publishers, 2001: 279-333.
[30] 周志华. 机器学习[M]. 北京: 清华大学出版社, 2016.
[30] (Zhou Zhihua.Machine Learning[M]. Beijing: Tsinghua University Press, 2016.)
[31] 崔瑞飞, 于洪涛, 杨赟, 等. 基于评论树的微博社区热门话题检测方法[J]. 计算机应用研究, 2014, 31(12): 3776-3779, 3827.
doi: 10.3969/j.issn.1001-3695.2014.12.066
[31] (Cui Ruifei, Yu Hongtao, Yang Yun, et al.Hot Topic Detection Method on Micro-blog Based on Comments Tree[J]. Application Research of Computers, 2014, 31(12): 3776-3779, 3827.)
doi: 10.3969/j.issn.1001-3695.2014.12.066
[1] 李明, 李莹, 周庆, 王君. 基于TF-PIDF的网络问答社区中的知识供需研究 *[J]. 数据分析与知识发现, 2021, 5(2): 106-115.
[2] 陈浩, 张梦毅, 程秀峰. 融合主题模型与决策树的跨地区专利合作关系发现与推荐*——以广东省和武汉市高校专利库为例[J]. 数据分析与知识发现, 2021, 5(10): 37-50.
[3] 范馨月, 崔雷. 基于网络属性的抗肿瘤药物靶点预测方法及其应用*[J]. 数据分析与知识发现, 2018, 2(12): 98-108.
[4] 杨旸,林辉,胡广伟. 面向光伏项目投资风险的大数据监测指标甄选研究*——以Solarbao平台为例[J]. 现代图书情报技术, 2016, 32(11): 11-19.
[5] 赵静娴. 基于决策树的网络伪舆情识别研究[J]. 现代图书情报技术, 2015, 31(6): 78-84.
[6] 唐祥彬, 陆伟, 张晓娟, 黄诗豪. 查询专指度特征分析与自动识别[J]. 现代图书情报技术, 2015, 31(2): 15-23.
[7] 徐孝娟,赵宇翔,朱庆华. 民族志决策树方法在学术博客用户行为中的研究*——以科学网博客为例[J]. 现代图书情报技术, 2014, 30(1): 79-86.
[8] 王虹予,赵英,党跃武. 基于混合算法的电子商务推荐系统设计研究[J]. 现代图书情报技术, 2009, 3(1): 80-85.
[9] 董超雄,肖晓旦,陈先来,甘勇升 . 判别分析与决策树在医院信息系统中的应用比较研究[J]. 现代图书情报技术, 2006, 1(12): 72-77.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn