|
|
Detecting News Topics Based on Equalized Paragraph and Sub-topic Vector |
Wei Jiaze1,Dong Cheng1,He Yanqing1(),Liu Zhihui1,Peng Keyun2 |
1Institute of Scientific and Technical Information of China, Beijing 100038, China 2Science and Technology Bureau of Ganzi Prefecture, Kangding 626000, China |
|
|
Abstract [Objective] This paper proposes a model to detect the topics of trending news stories, aiming to improve user experience of news reading.[Methods] We modified the TF-IDF method with the weighting of balanced paragraphs (WTF-IDF). We also improved the K-means clustering model with sub-topic vectors in hierarchical clustering. Finally, we extracted high frequency words from titles with the new model.[Results] The F1 value of our model was 5.4% higher than the TF-IDF method (with three extracted keywords). The hierarchical clustering accuracy based on WTF-IDF and sub-topic vector was 3.1% higher than the single-layer K-means clustering.[Limitations] Our model does not include phrases extraction method and the hierarchical clustering method is complex.[Conclusions] The proposed method could effectively detect topics of trending news reports.
|
Received: 27 April 2020
Published: 09 November 2020
|
|
Corresponding Authors:
He Yanqing
E-mail: heyq@istic.ac.cn
|
[1] |
雷震, 吴玲达, 雷蕾, 等. 初始化类中心的增量K均值法及其在新闻事件探测中的应用[J]. 情报学报, 2006,25(3):289-295.
|
[1] |
( Lei Zhen, Wu Lingda, Lei Lei, et al. Incremental K-means Method Based on Initialization of Cluster Centers and Its Application in News Event Detection[J]. Journal of the China Society for Scientific and Technical Information, 2006,25(3):289-295.)
|
[2] |
李霞, 王连喜, 路美秀, 等. 基于复合词生成的网络热点话题识别及描述算法[J]. 图书情报工作, 2016,60(23):128-134.
|
[2] |
( Li Xia, Wang Lianxi, Lu Meixiu, et al. A Compound Word Based Algorithm for Hot Event Detection and Description on the Web[J]. Library and Information Service, 2016,60(23):128-134.)
|
[3] |
Zhang C, Wang H, Cao L, et al. A Hybrid Term-Term Relations Analysis Approach for Topic Detection[J]. Knowledge-Based Systems, 2016,93:109-120.
doi: 10.1016/j.knosys.2015.11.006
|
[4] |
Liu B, Niu D, Lai K, et al. Growing Story Forest Online from Massive Breaking News[C]//Proceedings of the 2017 ACM Conference on Information and Knowledge Management. 2017: 777-785.
|
[5] |
Wang X Y. Hot Topic Detection in News Blog[J]. Applied Mechanics and Materials, 2014, 513-517:1114-1118.
doi: 10.4028/www.scientific.net/AMM.513-517
|
[6] |
肖香龙, 李信, 高寒, 等. 基于关键词共现的学科领域研究空白(Research Gaps)发现[J]. 情报工程, 2018,4(6):37-50.
|
[6] |
( Xiao Xianglong, Li Xin, Gao Han, et al. Research on Scientific Gaps Recognition Based on Keywords Co-occurrence[J]. Technology Intelligence Engineering, 2018,4(6):37-50.)
|
[7] |
杨莲莲, 杨之音, 杨朝峰. 基于共词分析的微生物学植物学领域研究热点分析[J]. 情报工程, 2016,2(4):96-103.
|
[7] |
( Yang Lianlian, Yang Zhiyin, Yang Chaofeng. Research on the Hotspots of Microbiology and Botany Based on the Co-Word Analysis[J]. Technology Intelligence Engineering, 2016,2(4):96-103.)
|
[8] |
Hu X. News Hotspots Detection and Tracking Based on LDA Topic Model[C]// Proceedings of the 2016 IEEE International Conference on Progress in Informatics and Computing. 2016: 248-252.
|
[9] |
陈龙, 徐建, 于亚男, 等. 基于话题相似性改进的K-means新闻话题聚类[J]. 计算机与数字工程, 2017,45(8):1560-1565.
|
[9] |
( Chen Long, Xu Jian, Yu Ya’nan, et al. News Topic Clustering Based on Topic Similarity Improvement of K-means[J]. Computer & Digital Engineering, 2017,45(8):1560-1565.)
|
[10] |
温廷新, 李洋子, 孙静霜. 基于多因素特征选择与AFOA/K-means的新闻热点发现方法[J]. 数据分析与知识发现, 2019,3(4):97-106.
|
[10] |
( Wen Tingxin, Li Yangzi, Sun Jingshuang. News Hotspots Discovery Method Based on Multi Factor Feature Selection and AFOA/K-means[J]. Data Analysis and Knowledge Discovery, 2019,3(4):97-106.)
|
[11] |
常耀成, 张宇翔, 王红, 等. 特征驱动的关键词提取算法综述[J]. 软件学报, 2018,29(7):2046-2070.
|
[11] |
( Chang Yaocheng, Zhang Yuxiang, Wang Hong, et al. Features Oriented Survey of State-of-the-Art Keyphrase Extraction Algorithms[J]. Journal of Software, 2018,29(7):2046-2070.)
|
[12] |
Jones K S. A Statistical Interpretation of Term Specificity and Its Application in Retrieval[J]. Journal of Documentation, 1972,28(1):11-21.
doi: 10.1108/eb026526
|
[13] |
Mihalcea R, Tarau P. TextRank: Bringing Order into Texts[C]// Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. 2004: 404-411.
|
[14] |
Wan X, Xiao J. Single Document Keyphrase Extraction Using Neighborhood Knowledge[C]//Proceedings of the 23rd National Conference on Artificial Intelligence. 2008: 855-860.
|
[15] |
Florescu C, Caragea C. A Position-Biased PageRank Algorithm for Keyphrase Extraction[C]//Proceedings of the 31st American Association for Artificial Intelligence. 2017.
|
[16] |
Bougouin A, Boudin F, Béatrice D. TopicRank: Graph-Based Topic Ranking for Keyphrase Extraction[C]//Proceedings of the 6th International Joint Conference on Natural Language Processing. 2013: 543-551.
|
[17] |
Florescu C, Caragea C. A New Scheme for Scoring Phrases in Unsupervised Keyphrase Extraction[C]//Proceedings of the 39th European Conference on Information Retrieval. 2017: 477-483.
|
[18] |
Wang R, Liu W, McDonald C. Corpus-independent Generic Keyphrase Extraction Using Word Embedding Vectors[C]// Proceedings of the 2014 Software Engineering Research Conference. 2014.
|
[19] |
Lu Y, Lu S, Fotouhi F, et al. FGKA: A Fast Genetic K-means Clustering Algorithm[C]//Proceedings of the 2004 ACM Symposium on Applied Computing. 2004: 622-623.
|
[20] |
Gong Z, Jia Z, Luo S, et al. An Adaptive Topic Tracking Approach Based on Single-Pass Clustering with Sliding Time Window[C]//Proceedings of the 2011 International Conference on Computer Science and Network Technology. 2011: 1311-1314.
|
[21] |
Guan R, Shi X, Marchese M, et al. Text Clustering with Seeds Affinity Propagation[J]. IEEE Transactions on Knowledge and Data Engineering, 2011,23(4):627-637.
doi: 10.1109/TKDE.2010.144
|
[22] |
Zheng L, Li L, Hong W, et al. PENETRATE: Personalized News Recommendation Using Ensemble Hierarchical Clustering[J]. Expert Systems with Applications, 2013,40(6):2127-2136.
doi: 10.1016/j.eswa.2012.10.029
|
[23] |
石正新. 网络新闻热点话题检测分析与趋势研究[D]. 北京: 首都经济贸易大学, 2018.
|
[23] |
( Shi Zhengxin. Hot Topics Detection Analysis and Trend Research on Network News[D]. Beijing: Capital University of Economics and Business, 2018.)
|
[24] |
彭楠赟, 王厚峰, 凌晨添. 基于层次聚类的网络新闻热点发现[C]//第十一届全国计算语言学学术会议. 2011: 487-492.
|
[24] |
( Peng Nanyun, Wang Houfeng, Ling Chentian. Event Mining in On-line News Based on Hierarchical Clustering[C]//Proceedings of the 11th China National Conference on Computational Linguistics. 2011: 487-492.)
|
[25] |
古万荣, 董守斌, 何锦潮, 等. 基于二次聚类的新闻推荐方法[J]. 华南理工大学学报(自然科学版), 2014,42(7):15-20, 32.
|
[25] |
( Gu Wanrong, Dong Shoubin, He Jinchao, et al. A News Recommendation Method Based on Two-Fold Clustering[J]. Journal of South China University of Technology (Natural Science Edition), 2014,42(7):15-20, 32.)
|
[26] |
谢晓东. 基于LDA融合模型和多层聚类的新闻话题检测[D]. 天津:天津大学, 2016.
|
[26] |
( Xie Xiaodong. News Topic Detection Based on LDA Fusion Model and Multi-layer Clustering[D]. Tianjin: Tianjin University, 2016.)
|
[27] |
代翔, 黄细凤, 唐瑞, 等. 基于层次聚类的子话题检测算法[J]. 华南理工大学学报(自然科学版), 2019,47(8):84-95.
|
[27] |
( Dai Xiang, Huang Xifeng, Tang Rui, et al. Subtopic Detection Algorithm Based on Hierarchical Clustering[J]. Journal of South China University of Technology (Natural Science Edition), 2019,47(8):84-95.)
|
[28] |
寇宛秋, 李芳. 基于种子词汇的话题标签抽取研究[J]. 中文信息学报, 2013,27(5):114-121, 143.
|
[28] |
( Kou Wanqiu, Li Fang. Topic Label Extraction Based on Seed Words[J]. Journal of Chinese Information Processing, 2013,27(5):114-121,143.)
|
[29] |
周楠, 杜攀, 靳小龙, 等. 面向舆情事件的子话题标签生成模型ET-TAG[J]. 计算机学报, 2018,41(7):1490-1503.
|
[29] |
( Zhou Nan, Du Pan, Jin Xiaolong, et al. ET-TAG: A Tag Generation Model for the Sub-Topics of Public Opinion Events[J]. Chinese Journal of Computers, 2018,41(7):1490-1503.)
|
[30] |
贺敏, 王丽宏, 杜攀, 等. 基于有意义串聚类的微博热点话题发现方法[J]. 通信学报, 2013,34(S1):256-262.
|
[30] |
( He Min, Wang Lihong, Du Pan, et al. Microblog Hot Topic Detection Method Based on Meaningful String Clustering[J]. Journal on Communications, 2013,34(S1):256-262.)
|
[31] |
杨洁, 季铎, 蔡东风, 等. 基于TextRank的多文档关键词抽取技术[C]//第四届全国信息检索与内容安全学术会议论文集(上), 2008: 404-411.
|
[31] |
( Yang Jie, Ji Duo, Cai Dongfeng, et al. Keyword Extraction in Multi-Document Based on TextRank Technology[C]//Proceedings of the 4th National Conference on Information Retrieval and Content Security (Part 1), 2008: 404-411.)
|
[32] |
夏天. 词向量聚类加权TextRank的关键词抽取[J]. 数据分析与知识发现, 2017,1(2):28-34.
|
[32] |
( Xia Tian. Extracting Keywords with Modified TextRank Model[J]. Data Analysis and Knowledge Discovery, 2017,1(2):28-34.)
|
[33] |
李凯, 王兰. 层次聚类的簇集成方法研究[J]. 计算机工程与应用, 2010,46(27):120-123.
|
[33] |
( Li Kai, Wang Lan. Research on Cluster Ensembles Methods Based on Hierarchical Clustering[J]. Computer Engineering and Applications, 2010,46(27):120-123.)
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|