Please wait a minute...
Data Analysis and Knowledge Discovery  2022, Vol. 6 Issue (11): 72-78    DOI: 10.11925/infotech.2096-3467.2022.0115
Current Issue | Archive | Adv Search |
Selecting Optimal LDA Numbers to Identify News Topics
Yang Yang(),Jiang Kaizhong,Yuan Mingjun,Hui Lanxin
School of Mathematics and Statistics, Shanghai University of Engineering Science, Shanghai 201620, China
Download: PDF (763 KB)   HTML ( 15
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes an adaptive method to decide the optimal topic numbers for the LDA model, aiming to effectively identify news topics. [Methods] Frist, we extract the needed data from news using semantics and time series, which helped us construct the corresponding feature vectors. Then, we utilized the Co-DPSC algorithm to collaboratively train the two views and obtained a semantic feature matrix containing timing effects. Finally, we conducted the density peak clustering by row after the matrix dimension reduction, which generated the optimal number of topics. [Results] The precision and F value of the proposed model were improved by 35.09% and 15.39%. [Limitations] We only clustered keywords from news and need to examine the new model with datasets from other fields. [Conclusions] The proposed method could provide better number of topics for the LDA model.

Key wordsLDA Model      News Topics      Multi-View Clustering     
Received: 14 February 2022      Published: 13 January 2023
ZTFLH:  TP393  
  G250  
Fund:National Statistical Science Research Project of China(2020LY080)
Corresponding Authors: Yang Yang     E-mail: yy_5ten8@126.com

Cite this article:

Yang Yang,Jiang Kaizhong,Yuan Mingjun,Hui Lanxin. Selecting Optimal LDA Numbers to Identify News Topics. Data Analysis and Knowledge Discovery, 2022, 6(11): 72-78.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2022.0115     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2022/V6/I11/72

Flow Chart
Clustering Decision Graph
Visualization of Clustering Results
Perplexity for the Number of Different Topics
主题 关键词
Topic1 公司 比赛 球员 基金 市场
Topic2 汽车 市场 北京 消费者 公司
Topic3 市场 增长 销量 发展 同比
Topic4 公司 时间 比赛 俄罗斯 市场
Topic5 公司 车型 时间 市场 情况
Topic6 比赛 车型 计划 欧洲杯 利率
Topic7 比赛 汽车 市场 发展 时间
Topic8 比赛 叙利亚 情况 影响 银行
Topic9 市场 公司 价格 基金 汽车
Topic10 西班牙 比赛 训练 印度 建设
Extraction Results Based on Multi-View Clustering
主题 关键词
Topic1 木雕 市场 情况 血液 比赛
Topic2 男士 谢师宴 酒店 国际 发展
Topic3 游行 部门 民众 项目 经费
Topic4 消费者 价格 花生油 社会 交易
Topic5 车型 市场 人民币 银行 学生
Topic6 官兵 市场 护航 海军 幼师
Topic7 幼儿园 市场 产品 利率 家长
Topic8 比赛 门罗 球员 欧洲杯 影响
Topic9 汽车 印度 面包 车型 北京
Topic10 导弹 美国 利率 公司 发生
Extraction Results Based on Perplexity
方法 T e x t r a c t T c o r r e c t T s t a n d r a d 查准率P/% 查全率R/% F值/%
基于语义与时序的方法 6 4 7 66.67 57.14 61.54
基于困惑度 19 6 7 31.58 85.71 46.15
Method Performance
[1] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. The Journal of Machine Learning Research, 2003, 3: 993-1002.
[2] Teh Y W, Jordan M I, Beal M J, et al. Hierarchical Dirichlet Processes[J]. Journal of the American Statistical Association, 2006, 101(476): 1566-1581.
doi: 10.1198/016214506000000302
[3] Blei D M, Jordan M I, Griffiths T L, et al. Hierarchical Topic Models and the Nested Chinese Restaurant Process[C]// Proceedings of the 16th International Conference on Neural Information Processing Systems. 2003:17-24.
[4] 何建云, 陈兴蜀, 杜敏, 等. 基于改进的在线LDA模型的主题演化分析[J]. 中南大学学报(自然科学版), 2015, 46(2): 547-553.
[4] (He Jianyun, Chen Xingshu, Du Min, et al. Topic Evolution Analysis Based on Improved Online LDA Model[J]. Journal of Central South University (Science and Technology), 2015, 46(2): 547-553.)
[5] 曹娟, 张勇东, 李锦涛, 等. 一种基于密度的自适应最优LDA模型选择方法[J]. 计算机学报, 2008, 31(10): 1780-1787.
doi: 10.3724/SP.J.1016.2008.01780
[5] (Cao Juan, Zhang Yongdong, Li Jintao, et al. A Method of Adaptively Selecting Best LDA Model Based on Density[J]. Chinese Journal of Computers, 2008, 31(10): 1780-1787.)
doi: 10.3724/SP.J.1016.2008.01780
[6] 关鹏, 王曰芬. 科技情报分析中LDA主题模型最优主题数确定方法研究[J]. 现代图书情报技术, 2016(9): 42-50.
[6] (Guan Peng, Wang Yuefen. Identifying Optimal Topic Numbers from Sci-Tech Information with LDA Model[J]. New Technology of Library and Information Service, 2016(9): 42-50.)
[7] 李菲菲, 王移芝. 基于频繁词网络的LDA最优主题个数选取方法[J]. 计算机技术与发展, 2018, 28(8): 1-5.
[7] (Li Feifei, Wang Yizhi. Selection Method of LDA Optimal Topic Number Based on Frequent Word Network[J]. Computer Technology and Development, 2018, 28(8): 1-5.)
[8] Wang H B, Wang J X, Zhang Y F, et al. Optimization of Topic Recognition Model for News Texts Based on LDA[J]. Journal of Digital Information Management, 2019, 17(5):257-269.
doi: 10.6025/jdim/2019/17/5/257-269
[9] 余冲, 李晶, 孙旭东, 等. 基于词嵌入与概率主题模型的社会媒体话题识别[J]. 计算机工程, 2017, 43(12):184-191.
[9] (Yu Chong, Li Jing, Sun Xudong, et al. Social Media Topic Recognition Based on Word Embedding and Probabilistic Topic Model[J]. Computer Engineering, 2017, 43(12):184-191.)
[10] 曹牧原. 基于爬虫和LDA的新闻话题挖掘[D]. 保定: 河北大学, 2018.
[10] (Cao Muyuan. News Topic Mining Based on Web Crawler and LDA[D]. Baoding: Hebei University, 2018.)
[11] 李琮, 袁方, 刘宇, 等. 基于LDA模型和T-OPTICS算法的中文新闻话题检测[J]. 河北大学学报(自然科学版), 2016, 36(1):106-112.
[11] (Li Cong, Yuan Fang, Liu Yu, et al. Chinese News Topic Detection Based on LDA and T-OPTICS[J]. Journal of Hebei University(Natural Science Edition), 2016, 36(1):106-112.)
[12] 万红新, 彭云, 郑睿颖. 时序化LDA的舆情文本动态主题提取[J]. 计算机与现代化, 2016(7):91-94.
[12] (Wan Hongxin, Peng Yun, Zheng Ruiying. Time Constrained LDA for Topic Extraction of Public Opinion Texts[J]. Computer and Modernization, 2016(7):91-94.)
[13] Stilo G, Velardi P. Efficient Temporal Mining of Micro-Blog Texts and its Application to Event Discovery[J]. Data Mining and Knowledge Discovery, 2016, 30(2):372-402.
doi: 10.1007/s10618-015-0412-3
[14] Rodriguez A, Laio A. Clustering by Fast Search and Find of Density Peaks[J]. Science, 2014, 344(6191): 1492-1496.
doi: 10.1126/science.1242072 pmid: 24970081
[15] 李阳. 协同训练算法及其在分类中的应用研究[D]. 东营: 中国石油大学(华东), 2016.
[15] (Li Yang. Research on Co-Training Algorithm and Its Application in Classification[D]. Dongying: China University of Petroleum (Huadong), 2016.)
[16] Weng J S, Lee B S. Event Detection in Twitter[C]// Proceedings of the 5th International AAAI Conference on Weblogs and Social Media. 2011:401-408.
[17] Kumar A, Daume H. A Co-Training Approach for Multi-View Spectral Clustering[C]// Proceedings of the 28th International Conference on Machine Learning. 2011:393-400.
[1] Cai Yongming,Liu Lu,Wang Kewei. Identifying Key Users and Topics from Online Learning Community[J]. 数据分析与知识发现, 2020, 4(6): 69-79.
[2] Liu Yuwen,Wang Kai. Finding Geographic Locations of Popular Online Topics[J]. 数据分析与知识发现, 2020, 4(2/3): 173-181.
[3] Ye Guanghui,Xu Tong,Bi Chongwu,Li Xinyue. Analyzing Evolution of City Tourism Portraits with Multi-Dimensional Features and LDA Model[J]. 数据分析与知识发现, 2020, 4(11): 121-130.
[4] Yunfei Shao,Dongsu Liu. Classifying Short-texts with Class Feature Extension[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[5] Xu Yanhua,Miao Yujie,Miao Lin,Lv Xueqiang. Generating HSK Writing Essays with LDA Model[J]. 数据分析与知识发现, 2018, 2(9): 80-87.
[6] Wang Li,Zou Lixue,Liu Xiwen. Visualizing Document Correlation Based on LDA Model[J]. 数据分析与知识发现, 2018, 2(3): 98-106.
[7] Wang Jingqi,Li Rui,Wu Huayi. The Evolution of Online Public Opinion Based on Spatial Autocorrelation[J]. 数据分析与知识发现, 2018, 2(2): 64-73.
[8] Li Zhen,Ding Shengchun,Wang Nan. Identifying Topics of Online Public Opinion[J]. 数据分析与知识发现, 2017, 1(8): 18-30.
[9] Fang Xiaofei,Huang Xiaoxi,Wang Rongbo,Chen Zhiqun,Wang Xiaohua. Identifying Hot Topics from Mobile Complaint Texts[J]. 数据分析与知识发现, 2017, 1(2): 19-27.
[10] Zhang Lei,Ma Jing,Li Dandan,Shen Yang. Hypernetwork Model for Semantic Social Network and Automatic Identification of Key Nodes[J]. 现代图书情报技术, 2016, 32(3): 8-17.
[11] Ruyi Yang,Dongsu Liu,Hui Li. An Improved Topic Model Integrating Extra-Features[J]. 现代图书情报技术, 2016, 32(1): 48-54.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn