Please wait a minute...
Data Analysis and Knowledge Discovery  2020, Vol. 4 Issue (10): 70-79    DOI: 10.11925/infotech.2096-3467.2020.0361
Current Issue | Archive | Adv Search |
Detecting News Topics Based on Equalized Paragraph and Sub-topic Vector
Wei Jiaze1,Dong Cheng1,He Yanqing1(),Liu Zhihui1,Peng Keyun2
1Institute of Scientific and Technical Information of China, Beijing 100038, China
2Science and Technology Bureau of Ganzi Prefecture, Kangding 626000, China
Download: PDF (839 KB)   HTML ( 11
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a model to detect the topics of trending news stories, aiming to improve user experience of news reading.[Methods] We modified the TF-IDF method with the weighting of balanced paragraphs (WTF-IDF). We also improved the K-means clustering model with sub-topic vectors in hierarchical clustering. Finally, we extracted high frequency words from titles with the new model.[Results] The F1 value of our model was 5.4% higher than the TF-IDF method (with three extracted keywords). The hierarchical clustering accuracy based on WTF-IDF and sub-topic vector was 3.1% higher than the single-layer K-means clustering.[Limitations] Our model does not include phrases extraction method and the hierarchical clustering method is complex.[Conclusions] The proposed method could effectively detect topics of trending news reports.

Key wordsEqualized Paragraph      Sub-topic Vector      Hot Topic Detection      Hierarchical Clustering     
Received: 27 April 2020      Published: 09 November 2020
ZTFLH:  TP391  
Corresponding Authors: He Yanqing     E-mail: heyq@istic.ac.cn

Cite this article:

Wei Jiaze,Dong Cheng,He Yanqing,Liu Zhihui,Peng Keyun. Detecting News Topics Based on Equalized Paragraph and Sub-topic Vector. Data Analysis and Knowledge Discovery, 2020, 4(10): 70-79.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2020.0361     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2020/V4/I10/70

Hot Topic Detection
Hierarchical Clustering
Keyword Extraction Effect of Three Methods
实验设置 N= 3 N= 5 N= 7 N= 10
P R F1 P R F1 P R F1 P R F1
P1 0.367 0.367 0.367 0.303 0.425 0.351 0.250 0.487 0.327 0.199 0.552 0.290
P2 0.392 0.392 0.392 0.325 0.458 0.377 0.266 0.519 0.348 0.210 0.583 0.306
P3 0.408 0.408 0.408 0.323 0.454 0.374 0.259 0.505 0.339 0.200 0.556 0.291
P4 0.392 0.392 0.392 0.305 0.430 0.354 0.244 0.477 0.320 0.190 0.528 0.276
P5 0.402 0.402 0.402 0.326 0.458 0.377 0.263 0.511 0.344 0.208 0.577 0.303
P6 0.394 0.394 0.394 0.324 0.456 0.375 0.263 0.512 0.344 0.208 0.578 0.303
P7 0.384 0.384 0.384 0.305 0.433 0.355 0.252 0.495 0.331 0.199 0.555 0.290
P8 0.363 0.363 0.363 0.289 0.411 0.336 0.238 0.467 0.312 0.188 0.524 0.274
P9 0.415 0.415 0.415 0.325 0.459 0.377 0.263 0.514 0.344 0.209 0.579 0.304
P10 0.421 0.421 0.421 0.336 0.474 0.390 0.269 0.527 0.353 0.213 0.592 0.310
Title and Balanced Paragraph Effect
主题 新闻数量(篇)
巴黎圣母院火灾 44
奔驰漏油事件 31
波音737坠机事件 42
华为被制裁 185
视觉中国版权风波 100
斯里兰卡连环爆炸 93
亚洲文明对话大会 176
英国脱欧 57
翟天临学历事件 102
中美贸易战 232
Number of News by Topic
Clustering Effect
人工话题描述 自动话题描述
巴黎圣母院火灾 巴黎圣母院大火警示 巴黎圣母院 圣母院激光建模
奔驰漏油事件 奔驰女车主维权 汽车金融服务费乱象何时休
波音737坠机事件 波音CEO公开信 737MAX
华为被制裁 华为海思总裁深夜 中国芯片突围战 美国芯片
视觉中国版权风波 视觉中国版权事件 黑洞照片版权遭围攻
斯里兰卡连环爆炸 斯里兰卡连环爆炸袭击 连环爆炸案嫌疑人
亚洲文明对话大会 亚洲文明对话大会开幕式 亚洲文明对话大会 文明对话大会开幕式主旨
英国脱欧 英国脱欧 英国脱欧成功 国内黄金期货跌
翟天临学历事件 翟天临事件再度发酵 学术不端须改革
中美贸易战 美国对华遏制政策 关税大棒损人害己 中美贸易战白日化
Topic Description Effect
[1] 雷震, 吴玲达, 雷蕾, 等. 初始化类中心的增量K均值法及其在新闻事件探测中的应用[J]. 情报学报, 2006,25(3):289-295.
[1] ( Lei Zhen, Wu Lingda, Lei Lei, et al. Incremental K-means Method Based on Initialization of Cluster Centers and Its Application in News Event Detection[J]. Journal of the China Society for Scientific and Technical Information, 2006,25(3):289-295.)
[2] 李霞, 王连喜, 路美秀, 等. 基于复合词生成的网络热点话题识别及描述算法[J]. 图书情报工作, 2016,60(23):128-134.
[2] ( Li Xia, Wang Lianxi, Lu Meixiu, et al. A Compound Word Based Algorithm for Hot Event Detection and Description on the Web[J]. Library and Information Service, 2016,60(23):128-134.)
[3] Zhang C, Wang H, Cao L, et al. A Hybrid Term-Term Relations Analysis Approach for Topic Detection[J]. Knowledge-Based Systems, 2016,93:109-120.
doi: 10.1016/j.knosys.2015.11.006
[4] Liu B, Niu D, Lai K, et al. Growing Story Forest Online from Massive Breaking News[C]//Proceedings of the 2017 ACM Conference on Information and Knowledge Management. 2017: 777-785.
[5] Wang X Y. Hot Topic Detection in News Blog[J]. Applied Mechanics and Materials, 2014, 513-517:1114-1118.
doi: 10.4028/www.scientific.net/AMM.513-517
[6] 肖香龙, 李信, 高寒, 等. 基于关键词共现的学科领域研究空白(Research Gaps)发现[J]. 情报工程, 2018,4(6):37-50.
[6] ( Xiao Xianglong, Li Xin, Gao Han, et al. Research on Scientific Gaps Recognition Based on Keywords Co-occurrence[J]. Technology Intelligence Engineering, 2018,4(6):37-50.)
[7] 杨莲莲, 杨之音, 杨朝峰. 基于共词分析的微生物学植物学领域研究热点分析[J]. 情报工程, 2016,2(4):96-103.
[7] ( Yang Lianlian, Yang Zhiyin, Yang Chaofeng. Research on the Hotspots of Microbiology and Botany Based on the Co-Word Analysis[J]. Technology Intelligence Engineering, 2016,2(4):96-103.)
[8] Hu X. News Hotspots Detection and Tracking Based on LDA Topic Model[C]// Proceedings of the 2016 IEEE International Conference on Progress in Informatics and Computing. 2016: 248-252.
[9] 陈龙, 徐建, 于亚男, 等. 基于话题相似性改进的K-means新闻话题聚类[J]. 计算机与数字工程, 2017,45(8):1560-1565.
[9] ( Chen Long, Xu Jian, Yu Ya’nan, et al. News Topic Clustering Based on Topic Similarity Improvement of K-means[J]. Computer & Digital Engineering, 2017,45(8):1560-1565.)
[10] 温廷新, 李洋子, 孙静霜. 基于多因素特征选择与AFOA/K-means的新闻热点发现方法[J]. 数据分析与知识发现, 2019,3(4):97-106.
[10] ( Wen Tingxin, Li Yangzi, Sun Jingshuang. News Hotspots Discovery Method Based on Multi Factor Feature Selection and AFOA/K-means[J]. Data Analysis and Knowledge Discovery, 2019,3(4):97-106.)
[11] 常耀成, 张宇翔, 王红, 等. 特征驱动的关键词提取算法综述[J]. 软件学报, 2018,29(7):2046-2070.
[11] ( Chang Yaocheng, Zhang Yuxiang, Wang Hong, et al. Features Oriented Survey of State-of-the-Art Keyphrase Extraction Algorithms[J]. Journal of Software, 2018,29(7):2046-2070.)
[12] Jones K S. A Statistical Interpretation of Term Specificity and Its Application in Retrieval[J]. Journal of Documentation, 1972,28(1):11-21.
doi: 10.1108/eb026526
[13] Mihalcea R, Tarau P. TextRank: Bringing Order into Texts[C]// Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. 2004: 404-411.
[14] Wan X, Xiao J. Single Document Keyphrase Extraction Using Neighborhood Knowledge[C]//Proceedings of the 23rd National Conference on Artificial Intelligence. 2008: 855-860.
[15] Florescu C, Caragea C. A Position-Biased PageRank Algorithm for Keyphrase Extraction[C]//Proceedings of the 31st American Association for Artificial Intelligence. 2017.
[16] Bougouin A, Boudin F, Béatrice D. TopicRank: Graph-Based Topic Ranking for Keyphrase Extraction[C]//Proceedings of the 6th International Joint Conference on Natural Language Processing. 2013: 543-551.
[17] Florescu C, Caragea C. A New Scheme for Scoring Phrases in Unsupervised Keyphrase Extraction[C]//Proceedings of the 39th European Conference on Information Retrieval. 2017: 477-483.
[18] Wang R, Liu W, McDonald C. Corpus-independent Generic Keyphrase Extraction Using Word Embedding Vectors[C]// Proceedings of the 2014 Software Engineering Research Conference. 2014.
[19] Lu Y, Lu S, Fotouhi F, et al. FGKA: A Fast Genetic K-means Clustering Algorithm[C]//Proceedings of the 2004 ACM Symposium on Applied Computing. 2004: 622-623.
[20] Gong Z, Jia Z, Luo S, et al. An Adaptive Topic Tracking Approach Based on Single-Pass Clustering with Sliding Time Window[C]//Proceedings of the 2011 International Conference on Computer Science and Network Technology. 2011: 1311-1314.
[21] Guan R, Shi X, Marchese M, et al. Text Clustering with Seeds Affinity Propagation[J]. IEEE Transactions on Knowledge and Data Engineering, 2011,23(4):627-637.
doi: 10.1109/TKDE.2010.144
[22] Zheng L, Li L, Hong W, et al. PENETRATE: Personalized News Recommendation Using Ensemble Hierarchical Clustering[J]. Expert Systems with Applications, 2013,40(6):2127-2136.
doi: 10.1016/j.eswa.2012.10.029
[23] 石正新. 网络新闻热点话题检测分析与趋势研究[D]. 北京: 首都经济贸易大学, 2018.
[23] ( Shi Zhengxin. Hot Topics Detection Analysis and Trend Research on Network News[D]. Beijing: Capital University of Economics and Business, 2018.)
[24] 彭楠赟, 王厚峰, 凌晨添. 基于层次聚类的网络新闻热点发现[C]//第十一届全国计算语言学学术会议. 2011: 487-492.
[24] ( Peng Nanyun, Wang Houfeng, Ling Chentian. Event Mining in On-line News Based on Hierarchical Clustering[C]//Proceedings of the 11th China National Conference on Computational Linguistics. 2011: 487-492.)
[25] 古万荣, 董守斌, 何锦潮, 等. 基于二次聚类的新闻推荐方法[J]. 华南理工大学学报(自然科学版), 2014,42(7):15-20, 32.
[25] ( Gu Wanrong, Dong Shoubin, He Jinchao, et al. A News Recommendation Method Based on Two-Fold Clustering[J]. Journal of South China University of Technology (Natural Science Edition), 2014,42(7):15-20, 32.)
[26] 谢晓东. 基于LDA融合模型和多层聚类的新闻话题检测[D]. 天津:天津大学, 2016.
[26] ( Xie Xiaodong. News Topic Detection Based on LDA Fusion Model and Multi-layer Clustering[D]. Tianjin: Tianjin University, 2016.)
[27] 代翔, 黄细凤, 唐瑞, 等. 基于层次聚类的子话题检测算法[J]. 华南理工大学学报(自然科学版), 2019,47(8):84-95.
[27] ( Dai Xiang, Huang Xifeng, Tang Rui, et al. Subtopic Detection Algorithm Based on Hierarchical Clustering[J]. Journal of South China University of Technology (Natural Science Edition), 2019,47(8):84-95.)
[28] 寇宛秋, 李芳. 基于种子词汇的话题标签抽取研究[J]. 中文信息学报, 2013,27(5):114-121, 143.
[28] ( Kou Wanqiu, Li Fang. Topic Label Extraction Based on Seed Words[J]. Journal of Chinese Information Processing, 2013,27(5):114-121,143.)
[29] 周楠, 杜攀, 靳小龙, 等. 面向舆情事件的子话题标签生成模型ET-TAG[J]. 计算机学报, 2018,41(7):1490-1503.
[29] ( Zhou Nan, Du Pan, Jin Xiaolong, et al. ET-TAG: A Tag Generation Model for the Sub-Topics of Public Opinion Events[J]. Chinese Journal of Computers, 2018,41(7):1490-1503.)
[30] 贺敏, 王丽宏, 杜攀, 等. 基于有意义串聚类的微博热点话题发现方法[J]. 通信学报, 2013,34(S1):256-262.
[30] ( He Min, Wang Lihong, Du Pan, et al. Microblog Hot Topic Detection Method Based on Meaningful String Clustering[J]. Journal on Communications, 2013,34(S1):256-262.)
[31] 杨洁, 季铎, 蔡东风, 等. 基于TextRank的多文档关键词抽取技术[C]//第四届全国信息检索与内容安全学术会议论文集(上), 2008: 404-411.
[31] ( Yang Jie, Ji Duo, Cai Dongfeng, et al. Keyword Extraction in Multi-Document Based on TextRank Technology[C]//Proceedings of the 4th National Conference on Information Retrieval and Content Security (Part 1), 2008: 404-411.)
[32] 夏天. 词向量聚类加权TextRank的关键词抽取[J]. 数据分析与知识发现, 2017,1(2):28-34.
[32] ( Xia Tian. Extracting Keywords with Modified TextRank Model[J]. Data Analysis and Knowledge Discovery, 2017,1(2):28-34.)
[33] 李凯, 王兰. 层次聚类的簇集成方法研究[J]. 计算机工程与应用, 2010,46(27):120-123.
[33] ( Li Kai, Wang Lan. Research on Cluster Ensembles Methods Based on Hierarchical Clustering[J]. Computer Engineering and Applications, 2010,46(27):120-123.)
[1] Junzhi Jia,Zhuangzhuang Ye. Clustering Wikidata’s Organizational Entities with Latent Semantic Index[J]. 数据分析与知识发现, 2019, 3(10): 56-65.
[2] Ding Shengchun,Gong Silan,Li Hongmei. A New Method to Detect Bursty Events from Micro-blog Posts Based on Bursty Topic Words and Agglomerative Hierarchical Clustering Algorithm[J]. 现代图书情报技术, 2016, 32(7-8): 12-20.
[3] Xiao Tianjiu, Liu Ying. Words and N-gram Models Analysis for “A Dream of Red Mansions”[J]. 现代图书情报技术, 2015, 31(4): 50-57.
[4] Zhao Pengwei, Ma Lin, Qin Chunxiu. Formation of Interest-based Peer-to-Peer Community[J]. 现代图书情报技术, 2013, 29(10): 53-58.
[5] Xiao Ming, Li Wenchao, Xia Qiuju. Mapping the Themes of Information Retrieval Based on Prefuse and Hierarchical Clustering[J]. 现代图书情报技术, 2012, 28(4): 35-40.
[6] Zhao Yingguang, An Xinying, Li Yong, Jia Xiaofeng. A Method for Detecting the Hot Topic of Literature Based on Lifecycle——A Case Study of Neoplasm Field[J]. 现代图书情报技术, 2012, (11): 86-91.
[7] Zhang Shunrui, You Hongliang. Chinese People Name Disambiguation by Hierarchical Clustering[J]. 现代图书情报技术, 2010, 26(11): 64-68.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn