Please wait a minute...
Data Analysis and Knowledge Discovery  0, Vol. Issue (): 1-    DOI: 10.11925/infotech.2096-3467. 2020.0765
Current Issue | Archive | Adv Search |
Topic Recognition Research on Topic Imbalanced News Text Data Set
Wang Hongbin,Wang Jianxiong,Zhang Yafei,Yang Heng
(Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China)
(Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology, Kunming 650500, China)
(YUN NAN WEI HENG JI YE Co., Ltd., Kunming 650000, China)
Download:
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] The traditional LDA model is not accurate for text topic recognition,because of the number of different topic texts in news text dataset is not balanced. [Methods] This paper proposes a topic recognition method based on the traditional LDA model on unbalanced news text data sets, which combines three different feature detection methods: independence detection, variance detection and information entropy detection. [Results] Experiments are conducted on 10000 news texts, the proposed method improves recall by 0.2121, precision by 0.0407 and F1 value by 0.152, compared with the traditional LDA topic recognition method. [Limitations] Due to the large number of new words in news text, the segmentation accuracy of word segmentation tools used in the experiment will be reduced, and the effect of news text topic recognition is affected by the dependence on the accuracy of segmentation. [Conclusions] Experimental results show that the proposed method can solve the problem of LDA topic recognition on unbalanced number of texts between different topics in news text dataset  to a certain extent.

Key words Topic imbalanced      News text data set      Text topic recognition      Latent Dirichlet Allocation(LDA)      
Published: 11 November 2020
ZTFLH:  TP393,G250  

Cite this article:

Wang Hongbin, Wang Jianxiong, Zhang Yafei, Yang Heng. Topic Recognition Research on Topic Imbalanced News Text Data Set . Data Analysis and Knowledge Discovery, 0, (): 1-.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467. 2020.0765     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y0/V/I/1

[1] Wang Hongbin,Wang Jianxiong,Zhang Yafei,Yang Heng. Topic Recognition of News Reports with Imbalanced Contents[J]. 数据分析与知识发现, 2021, 5(3): 109-120.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn