Please wait a minute...
Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (3): 109-120    DOI: 10.11925/infotech.2096-3467.2020.0765
Current Issue | Archive | Adv Search |
Topic Recognition of News Reports with Imbalanced Contents
Wang Hongbin1,2,Wang Jianxiong1,2,Zhang Yafei1,2(),Yang Heng3
1Faculty of Information Engineering and Automation, Kunming University of Science and Technology,
Kunming 650500, China
2Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology,Kunming 650500, China
3Yun Nan Wei Heng Ji Ye Co., Ltd., Kunming 650000, China
Download: PDF (790 KB)   HTML ( 19
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a topic recognition method for news dataset with imbalanced number of reports on different topics, aiming to address the issue of inaccurate topic recognition by traditional LDA model. [Methods] First, we modified the LDA model with three feature detection methods: independence detection, variance detection and information entropy detection. Then, we identified news topics with the proposed model. [Results] We examined our model with the dataset of 10,000 news reports. Compared with the traditional LDA topic recognition method, the recall, precision and F1 values of the proposed method were improved by 0.2121, 0.0407 and 0.1520. [Limitations] Due to the large number of new words, the word segmentation accuracy was not very satisfactory, which affected the performance of news topic recognition. [Conclusions] The proposed method could effectively identify news topics from reports with imbalanced contents.

Key wordsTopic Imbalanced      News Text Data Set      Topic Recognition      Latent Dirichlet Allocation (LDA)     
Received: 05 August 2020      Published: 12 April 2021
ZTFLH:  TP393  
  G250  
Fund:National Natural Science Foundation of China(61966020);National Natural Science Foundation of China(61762056);Yunnan Provincial Major Science and Technology Special Plan Projects(2018ZF019)
Corresponding Authors: Zhang Yafei     E-mail: zyfeimail@163.com

Cite this article:

Wang Hongbin,Wang Jianxiong,Zhang Yafei,Yang Heng. Topic Recognition of News Reports with Imbalanced Contents. Data Analysis and Knowledge Discovery, 2021, 5(3): 109-120.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2020.0765     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2021/V5/I3/109

Topic Recognition Framework for Topic Imbalanced News Text Data
主题词 概率
山东省 0.009 14
疫苗 0.007 14
长生 0.003 04
合格 0.002 08
疾控中心 0.001 08
长春 0.001 08
Examples of High-Quality Topics
主题词 概率
服务 0.031 75
滴滴 0.001 39
疫苗 0.001 38
长春 0.001 33
顺风车 0.001 32
失信 0.000 77
Examples of Non-High-Quality Topics
主题 概率 表征词
主题1(长生生物造假) 0.332 9 长生、账户、上市公司、关税、疫苗
主题2(北京房租上涨) 0.029 3 租赁、房租、租金、北京、上涨
主题3(噪声主题) 0.236 4 项目、副董事长、产业园、证券、质押
主题4(美国加征关税) 0.013 8 美国、关税、特朗普、公安分局、钢铁
主题5(噪声主题) 0.008 5 洗衣机、追加、千亿美元、电子、美国政府
Determine the Final Topics of a Text
主题 数量 主题 数量
莫焕晶被执行死刑 38 电影《大轰炸》宣布取消公映 19
“鸿茅药酒事件”引关注 25 赵丽颖冯绍峰宣布结婚 53
美股大跌,市值蒸发超8万亿 144 中美贸易关税 32
美国年度最强飓风“迈克尔”登陆佛罗里达州 6 房租价格上涨 24
Some Corpus Topics and the Number of Topic Texts
主题 数目 主题 数目
莫焕晶被执行死刑 19 滴滴顺风车 23
中美贸易关税 19 云南通海地震 3
房租价格上涨 17 长生生物疫苗造假事件 35
A Few Corpus Topics and the Number of Topic Texts
独立性阈值区间 查准率 查全率 F1
(1×10e-5,1×10e-1 0.924 1 0.517 2 0.663 2
(1×10e-4,1×10e-1 0.924 1 0.524 1 0.668 9
(1×10e-4,1×10e-2 0.896 6 0.544 8 0.678 1
(1×10e-3,1×10e-2 0.827 6 0.537 9 0.652 0
Subject Independence Test
方差阈值区间 查准率 查全率 F1
(0.01,1.00) 0.979 3 0.454 2 0.620 6
(0.03,0.50) 0.951 7 0.482 8 0.640 6
(0.05,0.50) 0.896 6 0.517 2 0.656 0
(0.08,0.50) 0.862 0 0.510 3 0.641 0
Subject Variance Test
信息熵取值 查准率 查全率 F1
0.8 0.910 3 0.669 0 0.771 2
1.0 0.869 0 0.758 6 0.810 1
1.1 0.827 6 0.813 8 0.818 4
1.2 0.779 3 0.841 3 0.809 1
Subject Information Entropy Test
Identify the Optimal Number of Topics Based on Perplexity
主题 主题词/概率
主题1 ('美国', 0.005 51) ('关税', 0.004 64) ('特朗普', 0.003 75) ('钢铁', 0.002 97)
主题2 ('滴滴', 0.006 34) ('莫焕晶', 0.004 51) ('死刑', 0.002 84) ('顺风', 0.002 69)
主题3 ('长春', 0.002 71) ('长生', 0.002 53) ('疫苗', 0.001 82) ('公安分局', 0.001 49)
主题4 ('疫苗', 0.003 85) ('长生', 0.002 20) ('地震', 0.001 99) ('账户', 0.001 59)
主题5 ('租赁', 0.004 22) ('房租', 0.003 82) ('租金', 0.003 13) ('上涨', 0.002 68)
主题6 ('办法', 0.002 17) ('判决', 0.002 17) ('发表声明', 0.002 17) ('发生冲突', 0.002 17)
主题7 (莫焕晶', 0.006 75) ('死刑', 0.004 23) ('放火', 0.003 49) ('保姆', 0.002 48)
主题8 ('疫苗', 0.004 38) ('我省', 0.001 85) ('长生', 0.001 85) ('接种', 0.001 82)
Subject Recognition Results
主题 信息熵 主题 信息熵
主题1 1.132 64 主题5 1.473 56
主题2 0.479 96 主题6 0.618 04
主题3 1.300 88 主题7 1.293 57
主题4 0.774 11 主题8 1.542 74
Topic Information Entropy
评价项 文本1 文本2 文本3 文本4
最终主题/概率 主题1 (0.328 3) 主题3 (0.364 1) 主题3 (0.292 3) 主题5 (0.073 1)
共现度 0.843 9 0.213 6 0.785 5 0.778 1
Identifying the Text Whether is Fully Represented by LDA
主题 主题表征词/概率
主题1 万元
0.002 00
公告
0.001 92
证券
0.001 92
账户
0.001 89
项目
0.001 87
The Second Times of Topic Recognition Results
主题 信息熵
主题1 1.613 96
主题2 0.716 05
主题3 1.300 88
主题4 0.507 24
主题5 1.130 15
The Second Times of Topic Information Entropy
主题 主题词/概率
主题1 ('疫苗', 0.004 15) ('账户', 0.002 70) ('项目', 0.002 60) ('失信', 0.002 40)
主题2 ('截图', 0.002 84) ('证券', 0.002 37) ('微博', 0.002 26) ('上市公司', 0.002 23)
主题3 ('疫苗', 0.004 15) ('账户', 0.002 70) ('项目', 0.002 60) ('失信', 0.002 40)
主题4 ('钟元', 0.002 83) ('补助', 0.002 51) ('监委', 0.002 33) ('长春', 0.002 19)
主题5 ('地震', 0.005 93) ('通海县', 0.004 20) ('云南省', 0.002 33) ('发生', 0.002 03)
The Third Times of Topic Recognition Results
主题 主题词/概率
主题1 ('美国', 0.005 51-) ('关税', 0.004 64) ('特朗普', 0.003 75) ('钢铁', 0.002 97)
主题2 ('长春', 0.002 71) ('长生', 0.002 53) ('疫苗', 0.001 82) ('公安分局', 0.001 49)
主题3 ('租赁', 0.004 22) ('房租', 0.003 82) ('租金', 0.003 13) ('上涨', 0.002 68)
主题4 ('莫焕晶', 0.006 75) ('死刑', 0.004 23) ('放火', 0.003 49) ('保姆', 0.002 48)
主题5 ('疫苗', 0.004 38) ('我省', 0.001 85) ('长生', 0.001 84) ('接种', 0.001 82)
主题6 ('疫苗', 0.004 15) ('账户', 0.002 70) ('项目', 0.002 60) ('失信', 0.002 40)
主题7 ('疫苗', 0.004 15) ('账户', 0.002 70) ('项目', 0.002 60) ('失信', 0.002 40)
主题8 ('地震', 0.005 93) ('通海县', 0.004 20) ('云南省', 0.002 33) ('发生', 0.002 03)
主题9 ('万元', 0.002 00) (公告', 0.001 92) ('证券', 0.001 92) ('账户', 0.001 89)
Final Topics
方法 查准率 查全率 F1
LDA 0.625 0 0.500 0 0.555 6
解决不平衡问题的LDA 0.888 9 1.000 0 0.941 2
Comparison of LDA Topic Extraction Effects by Different Methods
文本数量(篇) 方法 查准率 查全率 F1
100 LDA 0.862 1 0.455 2 0.595 8
解决不平衡问题的LDA 0.951 7 0.537 9 0.687 3
500 LDA 0.700 0 0.533 3 0.605 4
解决不平衡问题的LDA 0.794 1 0.733 3 0.763 7
5 000 LDA 0.880 0 0.761 9 0.816 7
解决不平衡问题的LDA 0.913 6 0.857 2 0.884 5
10 000 LDA 0.829 1 0.545 5 0.657 8
解决不平衡问题的LDA 0.869 8 0.757 6 0.809 8
Comparison of LDA Topic Extraction Effects by Different Methods
[1] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003,3:993-1022.
[2] Salton G, McGill M J. Introduction to Modern Information Retrieval[M]. New York: McGraw-Hill, 1983: 239-240.
[3] Deerwester S, Dumais S T, Furnas G, et al. Indexing by Latent Semantic Analysis[J]. Journal of the American Society for Information Science, 1990,41(6):391-407.
[4] Hofmann T. Probabilistic Latent Semantic Indexing[C]// Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: IEEE Press, 1999: 50-57.
[5] Li W, Zhu L, Fergus R. A Hybrid Neural Network-Latent Topic Model[C]// Proceedings of the 15th International Conference on Artificial Intelligence and Statistics (AISTATS-12). 2012: 1287-1294.
[6] Larochelle H, Lauly S. A Neural Autoregressive Topic Model[C]// Proceedings of the 25th International Conference on Neural Information Processing Systems. 2012: 2708-2716.
[7] Salakhutdinov R, Hinton G. Replicated Softmax: An Undirected Topic Model[C]// Proceedings of the 22nd International Conference on Neural Information Processing Systems. 2009: 1607-1614.
[8] Dieng A B, Wang C, Gao J F, et al. TopicRNN: A Recurrent Neural Network with Long-Range Semantic Dependency[OL]. arXiv Preprint, arXiv: 1611. 01702.
[9] Lau J H, Baldwin T, Cohn T. Topically Driven Neural Language Model[OL]. arXiv Preprint, arXiv: 1704. 08012.
[10] Li X M, Ouyang J H, Zhou X T. Labelset Topic Model for Multilabel Document Classification[J]. Journal of Intelligent Information Systems, 2016,46(1):83-97.
[11] Wu M S. Modeling Query-Document Dependencies with Topic Language Models for Information Retrieval[J]. Information Sciences, 2015,312:1-12.
[12] 刘定祥, 乔少杰, 张永清, 等. 不平衡分类的数据采样方法综述[J]. 重庆理工大学学报(自然科学), 2019,33(7):102-112.
[12] ( Liu Dingxiang, Qiao Shaojie, Zhang Yongqing, et al. A Survey on Data Sampling Methods in Imbalance Classification[J]. Journal of Chongqing University of Technology(Natural Science), 2019,33(7):102-112.)
[13] 骆凯敏. 文本分类中不平衡数据的处理[D]. 广州:中山大学, 2005.
[13] ( Luo Kaimin. Imbalanced Data Processing in Text Categorization[D]. Guangzhou: Sun Yat-Sen University, 2005.)
[14] Chawla N V, Bowyer K W, Hall L O, et a1. SMOTE: Synthetic Minority Over-Sampling Technique[J]. Journal of Artificial Intelligence Research, 2002,16(1):321-357.
[15] He H, Bai Y, Garcia E A, et al. ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning[C]// Proceedings of 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). 2008: 1322-1328.
[16] Tomek I. Two Modifications of CNN[J]. IEEE Transactions on Systems, Man and Cybernetics, 1976,11(6):769-772.
[17] 王光, 邱云飞, 史庆伟. 一种用于中文主题分类的CSVM算法[J]. 计算机工程, 2012,38(8):131-133.
[17] ( Wang Guang, Qiu Yunfei, Shi Qingwei. CSVM Algorithm for Chinese Theme Classification[J]. Computer Engineering, 2012,38(8):131-133.)
[18] 吴雨茜, 王俊丽, 杨丽, 等. 代价敏感深度学习方法研究综述[J]. 计算机科学, 2019,46(5):8-19.
[18] ( Wu Yuqian, Wang Junli, Yang Li, et al. Survey on Cost-Sensitive Deep Learning Methods[J]. Computer Science, 2019,46(5):8-19.)
[19] 李红莲, 王春花, 袁保宗. 一种改进的支持向量机NN-SVM[J]. 计算机学报, 2003,26(8):1015-1020.
[19] ( Li Honglian, Wang Chunhua, Yuan Baozong. An Imporved SVM: NN-SVM[J]. Chinese Journal of Computers, 2003,26(8):1015-1020.)
[20] 居亚亚, 杨璐, 严建峰. 基于动态权重的LDA算法[J]. 计算机科学, 2019,46(8):260-265.
[20] ( Ju Yaya, Yang Lu, Yan Jianfeng. LDA Algorithm Based on Dynamic Weight[J]. Computer Science, 2019,46(8):260-265.)
[21] 廖列法, 勒孚刚, 朱亚兰. LDA模型在专利文本分类中的应用[J]. 现代情报, 2017,37(3):35-39.
[21] ( Liao Liefa, Le Fugang, Zhu Yalan. The Application of LDA Model in Patent Text Classification[J]. Journal of Modern Information, 2017,37(3):35-39.)
[22] 刘江华. 一种基于kmeans聚类算法和LDA主题模型的文本检索方法及有效性验证[J]. 情报科学, 2017,35(2):16-21.
[22] ( Liu Jianghua. A Text Retrieval Method Based on Kmeans Clustering Algorithm and LDA Topic Model and Its Effectiveness[J]. Information Science, 2017,35(2):16-21.)
[23] 郭剑飞. 基于LDA多模型中文短文本主题分类体现构建与分类[D]. 哈尔滨:哈尔滨工业大学, 2014.
[23] ( Guo Jianfei. Classification for Chinese Short Text Based on Multi LDA Models[D]. Harbin: Harbin Institute of Technology, 2014.)
[24] 东北大学. 基于优质主题扩展的微博文本分类方法及系统与流程:CN201811064231.3[P]. 2019-02-15.
[24] ( Northeastern University. Microblog Text Classification Method, System and Process Based on High Quality Topic Extension: CN201811064231.3[P]. 2019-02-15.)
[25] Precision and Recall[EB/OL]. (2013-08-08). https://blog.csdn.net/watkinsong/article/details/9836167?utm_medium=distribute.pc_ relevant.none-task-blog-OPENSEARCH-2.channel_param&depth_1-utm_source=distribute.pc_relevant.none-task-blog-OPENSEARCH-2.channel_param.
[1] Zhang Jinzhu, Yu Wenqian. Topic Recognition and Key-Phrase Extraction with Phrase Representation Learning[J]. 数据分析与知识发现, 2021, 5(2): 50-60.
[2] Liu Yuwen,Wang Kai. Finding Geographic Locations of Popular Online Topics[J]. 数据分析与知识发现, 2020, 4(2/3): 173-181.
[3] Hongfei Ling,Shiyan Ou. Review of Automatic Labeling for Topic Models[J]. 数据分析与知识发现, 2019, 3(9): 16-26.
[4] Bowen Liu,Rujiang Bai,Yanting Zhou,Xiaoyue Wang. Identifying Frontier Topics from Funding and Paper——Case Study of Carbon Nanotube[J]. 数据分析与知识发现, 2019, 3(8): 114-122.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn