1Institute of Scientific and Technical Information of China, Beijing 100038, China 2Science and Technology Bureau of Ganzi Prefecture, Kangding 626000, China
[Objective] This paper proposes a model to detect the topics of trending news stories, aiming to improve user experience of news reading.[Methods] We modified the TF-IDF method with the weighting of balanced paragraphs (WTF-IDF). We also improved the K-means clustering model with sub-topic vectors in hierarchical clustering. Finally, we extracted high frequency words from titles with the new model.[Results] The F1 value of our model was 5.4% higher than the TF-IDF method (with three extracted keywords). The hierarchical clustering accuracy based on WTF-IDF and sub-topic vector was 3.1% higher than the single-layer K-means clustering.[Limitations] Our model does not include phrases extraction method and the hierarchical clustering method is complex.[Conclusions] The proposed method could effectively detect topics of trending news reports.
( Lei Zhen, Wu Lingda, Lei Lei, et al. Incremental K-means Method Based on Initialization of Cluster Centers and Its Application in News Event Detection[J]. Journal of the China Society for Scientific and Technical Information, 2006,25(3):289-295.)
( Li Xia, Wang Lianxi, Lu Meixiu, et al. A Compound Word Based Algorithm for Hot Event Detection and Description on the Web[J]. Library and Information Service, 2016,60(23):128-134.)
[3]
Zhang C, Wang H, Cao L, et al. A Hybrid Term-Term Relations Analysis Approach for Topic Detection[J]. Knowledge-Based Systems, 2016,93:109-120.
doi: 10.1016/j.knosys.2015.11.006
[4]
Liu B, Niu D, Lai K, et al. Growing Story Forest Online from Massive Breaking News[C]//Proceedings of the 2017 ACM Conference on Information and Knowledge Management. 2017: 777-785.
( Xiao Xianglong, Li Xin, Gao Han, et al. Research on Scientific Gaps Recognition Based on Keywords Co-occurrence[J]. Technology Intelligence Engineering, 2018,4(6):37-50.)
( Yang Lianlian, Yang Zhiyin, Yang Chaofeng. Research on the Hotspots of Microbiology and Botany Based on the Co-Word Analysis[J]. Technology Intelligence Engineering, 2016,2(4):96-103.)
[8]
Hu X. News Hotspots Detection and Tracking Based on LDA Topic Model[C]// Proceedings of the 2016 IEEE International Conference on Progress in Informatics and Computing. 2016: 248-252.
( Chen Long, Xu Jian, Yu Ya’nan, et al. News Topic Clustering Based on Topic Similarity Improvement of K-means[J]. Computer & Digital Engineering, 2017,45(8):1560-1565.)
( Wen Tingxin, Li Yangzi, Sun Jingshuang. News Hotspots Discovery Method Based on Multi Factor Feature Selection and AFOA/K-means[J]. Data Analysis and Knowledge Discovery, 2019,3(4):97-106.)
( Chang Yaocheng, Zhang Yuxiang, Wang Hong, et al. Features Oriented Survey of State-of-the-Art Keyphrase Extraction Algorithms[J]. Journal of Software, 2018,29(7):2046-2070.)
[12]
Jones K S. A Statistical Interpretation of Term Specificity and Its Application in Retrieval[J]. Journal of Documentation, 1972,28(1):11-21.
doi: 10.1108/eb026526
[13]
Mihalcea R, Tarau P. TextRank: Bringing Order into Texts[C]// Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. 2004: 404-411.
[14]
Wan X, Xiao J. Single Document Keyphrase Extraction Using Neighborhood Knowledge[C]//Proceedings of the 23rd National Conference on Artificial Intelligence. 2008: 855-860.
[15]
Florescu C, Caragea C. A Position-Biased PageRank Algorithm for Keyphrase Extraction[C]//Proceedings of the 31st American Association for Artificial Intelligence. 2017.
[16]
Bougouin A, Boudin F, Béatrice D. TopicRank: Graph-Based Topic Ranking for Keyphrase Extraction[C]//Proceedings of the 6th International Joint Conference on Natural Language Processing. 2013: 543-551.
[17]
Florescu C, Caragea C. A New Scheme for Scoring Phrases in Unsupervised Keyphrase Extraction[C]//Proceedings of the 39th European Conference on Information Retrieval. 2017: 477-483.
[18]
Wang R, Liu W, McDonald C. Corpus-independent Generic Keyphrase Extraction Using Word Embedding Vectors[C]// Proceedings of the 2014 Software Engineering Research Conference. 2014.
[19]
Lu Y, Lu S, Fotouhi F, et al. FGKA: A Fast Genetic K-means Clustering Algorithm[C]//Proceedings of the 2004 ACM Symposium on Applied Computing. 2004: 622-623.
[20]
Gong Z, Jia Z, Luo S, et al. An Adaptive Topic Tracking Approach Based on Single-Pass Clustering with Sliding Time Window[C]//Proceedings of the 2011 International Conference on Computer Science and Network Technology. 2011: 1311-1314.
[21]
Guan R, Shi X, Marchese M, et al. Text Clustering with Seeds Affinity Propagation[J]. IEEE Transactions on Knowledge and Data Engineering, 2011,23(4):627-637.
doi: 10.1109/TKDE.2010.144
[22]
Zheng L, Li L, Hong W, et al. PENETRATE: Personalized News Recommendation Using Ensemble Hierarchical Clustering[J]. Expert Systems with Applications, 2013,40(6):2127-2136.
doi: 10.1016/j.eswa.2012.10.029
[23]
石正新. 网络新闻热点话题检测分析与趋势研究[D]. 北京: 首都经济贸易大学, 2018.
[23]
( Shi Zhengxin. Hot Topics Detection Analysis and Trend Research on Network News[D]. Beijing: Capital University of Economics and Business, 2018.)
( Peng Nanyun, Wang Houfeng, Ling Chentian. Event Mining in On-line News Based on Hierarchical Clustering[C]//Proceedings of the 11th China National Conference on Computational Linguistics. 2011: 487-492.)
( Gu Wanrong, Dong Shoubin, He Jinchao, et al. A News Recommendation Method Based on Two-Fold Clustering[J]. Journal of South China University of Technology (Natural Science Edition), 2014,42(7):15-20, 32.)
[26]
谢晓东. 基于LDA融合模型和多层聚类的新闻话题检测[D]. 天津:天津大学, 2016.
[26]
( Xie Xiaodong. News Topic Detection Based on LDA Fusion Model and Multi-layer Clustering[D]. Tianjin: Tianjin University, 2016.)
( Dai Xiang, Huang Xifeng, Tang Rui, et al. Subtopic Detection Algorithm Based on Hierarchical Clustering[J]. Journal of South China University of Technology (Natural Science Edition), 2019,47(8):84-95.)
( Zhou Nan, Du Pan, Jin Xiaolong, et al. ET-TAG: A Tag Generation Model for the Sub-Topics of Public Opinion Events[J]. Chinese Journal of Computers, 2018,41(7):1490-1503.)
( He Min, Wang Lihong, Du Pan, et al. Microblog Hot Topic Detection Method Based on Meaningful String Clustering[J]. Journal on Communications, 2013,34(S1):256-262.)
( Yang Jie, Ji Duo, Cai Dongfeng, et al. Keyword Extraction in Multi-Document Based on TextRank Technology[C]//Proceedings of the 4th National Conference on Information Retrieval and Content Security (Part 1), 2008: 404-411.)
( Li Kai, Wang Lan. Research on Cluster Ensembles Methods Based on Hierarchical Clustering[J]. Computer Engineering and Applications, 2010,46(27):120-123.)