|
|
Topic Recognition of News Reports with Imbalanced Contents |
Wang Hongbin1,2,Wang Jianxiong1,2,Zhang Yafei1,2(),Yang Heng3 |
1Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China 2Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology,Kunming 650500, China 3Yun Nan Wei Heng Ji Ye Co., Ltd., Kunming 650000, China |
|
|
Abstract [Objective] This paper proposes a topic recognition method for news dataset with imbalanced number of reports on different topics, aiming to address the issue of inaccurate topic recognition by traditional LDA model. [Methods] First, we modified the LDA model with three feature detection methods: independence detection, variance detection and information entropy detection. Then, we identified news topics with the proposed model. [Results] We examined our model with the dataset of 10,000 news reports. Compared with the traditional LDA topic recognition method, the recall, precision and F1 values of the proposed method were improved by 0.2121, 0.0407 and 0.1520. [Limitations] Due to the large number of new words, the word segmentation accuracy was not very satisfactory, which affected the performance of news topic recognition. [Conclusions] The proposed method could effectively identify news topics from reports with imbalanced contents.
|
Received: 05 August 2020
Published: 12 April 2021
|
|
Fund:National Natural Science Foundation of China(61966020);National Natural Science Foundation of China(61762056);Yunnan Provincial Major Science and Technology Special Plan Projects(2018ZF019) |
Corresponding Authors:
Zhang Yafei
E-mail: zyfeimail@163.com
|
[1] |
Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003,3:993-1022.
|
[2] |
Salton G, McGill M J. Introduction to Modern Information Retrieval[M]. New York: McGraw-Hill, 1983: 239-240.
|
[3] |
Deerwester S, Dumais S T, Furnas G, et al. Indexing by Latent Semantic Analysis[J]. Journal of the American Society for Information Science, 1990,41(6):391-407.
|
[4] |
Hofmann T. Probabilistic Latent Semantic Indexing[C]// Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: IEEE Press, 1999: 50-57.
|
[5] |
Li W, Zhu L, Fergus R. A Hybrid Neural Network-Latent Topic Model[C]// Proceedings of the 15th International Conference on Artificial Intelligence and Statistics (AISTATS-12). 2012: 1287-1294.
|
[6] |
Larochelle H, Lauly S. A Neural Autoregressive Topic Model[C]// Proceedings of the 25th International Conference on Neural Information Processing Systems. 2012: 2708-2716.
|
[7] |
Salakhutdinov R, Hinton G. Replicated Softmax: An Undirected Topic Model[C]// Proceedings of the 22nd International Conference on Neural Information Processing Systems. 2009: 1607-1614.
|
[8] |
Dieng A B, Wang C, Gao J F, et al. TopicRNN: A Recurrent Neural Network with Long-Range Semantic Dependency[OL]. arXiv Preprint, arXiv: 1611. 01702.
|
[9] |
Lau J H, Baldwin T, Cohn T. Topically Driven Neural Language Model[OL]. arXiv Preprint, arXiv: 1704. 08012.
|
[10] |
Li X M, Ouyang J H, Zhou X T. Labelset Topic Model for Multilabel Document Classification[J]. Journal of Intelligent Information Systems, 2016,46(1):83-97.
|
[11] |
Wu M S. Modeling Query-Document Dependencies with Topic Language Models for Information Retrieval[J]. Information Sciences, 2015,312:1-12.
|
[12] |
刘定祥, 乔少杰, 张永清, 等. 不平衡分类的数据采样方法综述[J]. 重庆理工大学学报(自然科学), 2019,33(7):102-112.
|
[12] |
( Liu Dingxiang, Qiao Shaojie, Zhang Yongqing, et al. A Survey on Data Sampling Methods in Imbalance Classification[J]. Journal of Chongqing University of Technology(Natural Science), 2019,33(7):102-112.)
|
[13] |
骆凯敏. 文本分类中不平衡数据的处理[D]. 广州:中山大学, 2005.
|
[13] |
( Luo Kaimin. Imbalanced Data Processing in Text Categorization[D]. Guangzhou: Sun Yat-Sen University, 2005.)
|
[14] |
Chawla N V, Bowyer K W, Hall L O, et a1. SMOTE: Synthetic Minority Over-Sampling Technique[J]. Journal of Artificial Intelligence Research, 2002,16(1):321-357.
|
[15] |
He H, Bai Y, Garcia E A, et al. ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning[C]// Proceedings of 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). 2008: 1322-1328.
|
[16] |
Tomek I. Two Modifications of CNN[J]. IEEE Transactions on Systems, Man and Cybernetics, 1976,11(6):769-772.
|
[17] |
王光, 邱云飞, 史庆伟. 一种用于中文主题分类的CSVM算法[J]. 计算机工程, 2012,38(8):131-133.
|
[17] |
( Wang Guang, Qiu Yunfei, Shi Qingwei. CSVM Algorithm for Chinese Theme Classification[J]. Computer Engineering, 2012,38(8):131-133.)
|
[18] |
吴雨茜, 王俊丽, 杨丽, 等. 代价敏感深度学习方法研究综述[J]. 计算机科学, 2019,46(5):8-19.
|
[18] |
( Wu Yuqian, Wang Junli, Yang Li, et al. Survey on Cost-Sensitive Deep Learning Methods[J]. Computer Science, 2019,46(5):8-19.)
|
[19] |
李红莲, 王春花, 袁保宗. 一种改进的支持向量机NN-SVM[J]. 计算机学报, 2003,26(8):1015-1020.
|
[19] |
( Li Honglian, Wang Chunhua, Yuan Baozong. An Imporved SVM: NN-SVM[J]. Chinese Journal of Computers, 2003,26(8):1015-1020.)
|
[20] |
居亚亚, 杨璐, 严建峰. 基于动态权重的LDA算法[J]. 计算机科学, 2019,46(8):260-265.
|
[20] |
( Ju Yaya, Yang Lu, Yan Jianfeng. LDA Algorithm Based on Dynamic Weight[J]. Computer Science, 2019,46(8):260-265.)
|
[21] |
廖列法, 勒孚刚, 朱亚兰. LDA模型在专利文本分类中的应用[J]. 现代情报, 2017,37(3):35-39.
|
[21] |
( Liao Liefa, Le Fugang, Zhu Yalan. The Application of LDA Model in Patent Text Classification[J]. Journal of Modern Information, 2017,37(3):35-39.)
|
[22] |
刘江华. 一种基于kmeans聚类算法和LDA主题模型的文本检索方法及有效性验证[J]. 情报科学, 2017,35(2):16-21.
|
[22] |
( Liu Jianghua. A Text Retrieval Method Based on Kmeans Clustering Algorithm and LDA Topic Model and Its Effectiveness[J]. Information Science, 2017,35(2):16-21.)
|
[23] |
郭剑飞. 基于LDA多模型中文短文本主题分类体现构建与分类[D]. 哈尔滨:哈尔滨工业大学, 2014.
|
[23] |
( Guo Jianfei. Classification for Chinese Short Text Based on Multi LDA Models[D]. Harbin: Harbin Institute of Technology, 2014.)
|
[24] |
东北大学. 基于优质主题扩展的微博文本分类方法及系统与流程:CN201811064231.3[P]. 2019-02-15.
|
[24] |
( Northeastern University. Microblog Text Classification Method, System and Process Based on High Quality Topic Extension: CN201811064231.3[P]. 2019-02-15.)
|
[25] |
Precision and Recall[EB/OL]. (2013-08-08). https://blog.csdn.net/watkinsong/article/details/9836167?utm_medium=distribute.pc_ relevant.none-task-blog-OPENSEARCH-2.channel_param&depth_1-utm_source=distribute.pc_relevant.none-task-blog-OPENSEARCH-2.channel_param.
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|