Topic Recognition of News Reports with Imbalanced Contents
Wang Hongbin1,2,Wang Jianxiong1,2,Zhang Yafei1,2(),Yang Heng3
1Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China 2Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology,Kunming 650500, China 3Yun Nan Wei Heng Ji Ye Co., Ltd., Kunming 650000, China
[Objective] This paper proposes a topic recognition method for news dataset with imbalanced number of reports on different topics, aiming to address the issue of inaccurate topic recognition by traditional LDA model. [Methods] First, we modified the LDA model with three feature detection methods: independence detection, variance detection and information entropy detection. Then, we identified news topics with the proposed model. [Results] We examined our model with the dataset of 10,000 news reports. Compared with the traditional LDA topic recognition method, the recall, precision and F1 values of the proposed method were improved by 0.2121, 0.0407 and 0.1520. [Limitations] Due to the large number of new words, the word segmentation accuracy was not very satisfactory, which affected the performance of news topic recognition. [Conclusions] The proposed method could effectively identify news topics from reports with imbalanced contents.
Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003,3:993-1022.
Salton G, McGill M J. Introduction to Modern Information Retrieval[M]. New York: McGraw-Hill, 1983: 239-240.
Deerwester S, Dumais S T, Furnas G, et al. Indexing by Latent Semantic Analysis[J]. Journal of the American Society for Information Science, 1990,41(6):391-407.
Hofmann T. Probabilistic Latent Semantic Indexing[C]// Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: IEEE Press, 1999: 50-57.
Li W, Zhu L, Fergus R. A Hybrid Neural Network-Latent Topic Model[C]// Proceedings of the 15th International Conference on Artificial Intelligence and Statistics (AISTATS-12). 2012: 1287-1294.
Larochelle H, Lauly S. A Neural Autoregressive Topic Model[C]// Proceedings of the 25th International Conference on Neural Information Processing Systems. 2012: 2708-2716.
Salakhutdinov R, Hinton G. Replicated Softmax: An Undirected Topic Model[C]// Proceedings of the 22nd International Conference on Neural Information Processing Systems. 2009: 1607-1614.
Dieng A B, Wang C, Gao J F, et al. TopicRNN: A Recurrent Neural Network with Long-Range Semantic Dependency[OL]. arXiv Preprint, arXiv: 1611. 01702.
Lau J H, Baldwin T, Cohn T. Topically Driven Neural Language Model[OL]. arXiv Preprint, arXiv: 1704. 08012.
Li X M, Ouyang J H, Zhou X T. Labelset Topic Model for Multilabel Document Classification[J]. Journal of Intelligent Information Systems, 2016,46(1):83-97.
Wu M S. Modeling Query-Document Dependencies with Topic Language Models for Information Retrieval[J]. Information Sciences, 2015,312:1-12.
( Liu Dingxiang, Qiao Shaojie, Zhang Yongqing, et al. A Survey on Data Sampling Methods in Imbalance Classification[J]. Journal of Chongqing University of Technology(Natural Science), 2019,33(7):102-112.)
骆凯敏. 文本分类中不平衡数据的处理[D]. 广州:中山大学, 2005.
( Luo Kaimin. Imbalanced Data Processing in Text Categorization[D]. Guangzhou: Sun Yat-Sen University, 2005.)
Chawla N V, Bowyer K W, Hall L O, et a1. SMOTE: Synthetic Minority Over-Sampling Technique[J]. Journal of Artificial Intelligence Research, 2002,16(1):321-357.
He H, Bai Y, Garcia E A, et al. ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning[C]// Proceedings of 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). 2008: 1322-1328.
Tomek I. Two Modifications of CNN[J]. IEEE Transactions on Systems, Man and Cybernetics, 1976,11(6):769-772.