News Classification with Semi-Supervised and Active Learning
Chen Guo1,2(),Ye Chao1
1School of Economics & Management, Nanjing University of Science & Technology, Nanjing 210094, China 2Jiangsu Science and Technology Collaborative Innovation Center of Social Public Safety, Nanjing 210094, China
[Objective] This paper proposes a news classification scheme combining semi-supervised learning and active learning, aiming to improve intelligence monitoring based on news mining. [Methods] First, we carried out K-means clustering based on the learning of news text representations, and selected a small number of representative samples from various clusters for manual judgment. These categories were merged and adjusted as sub-field categories. Then, we used the representative samples as the training set for a variety of integrated classification algorithms and train the initial classifier. Finally, we utilized active learning to optimize the initial classifier. [Results] We tested our new model with news on tanks and armored vehicles. After active learning, we received better text classification results. The precision, recall and F1 value reached 83.68%, 83.35% and 83.17%, which were increased by 2.71%, 2.52% and 2.81% respectively. [Limitations] To reduce manually labeling work, we only conducted 2 iterations. [Conclusions] The proposed method can effectively classify news with little corpus annotation and no pre-trained classifier. It could also be used in other fields.
陈果, 叶潮. 融合半监督学习与主动学习的细分领域新闻分类研究*[J]. 数据分析与知识发现, 2022, 6(4): 28-38.
Chen Guo, Ye Chao. News Classification with Semi-Supervised and Active Learning. Data Analysis and Knowledge Discovery, 2022, 6(4): 28-38.
( Ding Lianhong, Sun Bin, Zhang Hongwei. Short Text Classification Based on Knowledge Graph Extension[J]. Technology Intelligence Engineering, 2018, 4(5):38-46.)
( Yu You, Fu Yu, Wu Xiaoping. News Text Classification Method Based on Convolutional Network of Word-Event Topic[J]. Computer Applications and Software, 2021, 38(5):170-174.)
( Hu Yulan, Zhao Qingshan, Chen Li, et al. A Fusion Network Model for Chinese News Text Classification[J]. Journal of Chinese Information Processing, 2021, 35(3):107-114.)
( Liu Yue, Zhai Donghai, Ren Qingning. News Text Classification Based on CNLSTM Model with Attention Mechanism[J]. Computer Engineering, 2019, 45(7):303-308.)
( Zhang Yongkui, Li Hongjuan. Text Classification of Accident News Based on Category Keyword[J]. Journal of Computer Applications, 2008, 28(S1):139-140.)
( Yang Liying, Li Hongjuan, Zhang Yongkui. The Research on Classification System of Accident News Corpus[C]//Proceedings of the 25th Anniversary Academic Conference of the Chinese Information Processing Society of China. Chinese Information Processing Society of China, 2006.)
( Xia Hualin, Zhang Yangsen. Multiple-Layer Classification of Web Emergency News Based on Rules and Statistics[J]. Journal of Computer Applications, 2012, 32(2):392-394.)
( Song Yinghua, Lyu Long, Liu Dan. Study on Identification and Classification of Emergency News Based on the Combined Deep Learning Model[J]. Journal of the China Society for Scientific and Technical Information, 2021, 40(2):145-151.)
( Ge Yan, Zheng Lijie, Du Junwei, et al. Chemical Accident Classification Based on BLSTM-Attention Neural Network Model[J]. Computer Systems & Applications, 2020, 29(10):205-210.)
( Zhu Fangpeng, Wang Xiaofeng. Text Classification for Ship Industry News[J]. Journal of Electronic Measurement and Instrumentation, 2020, 34(1):149-155.)
( Zhang Xiaolong, Zhi Long, Gao Jian, et al. A Semi-Supervised Learning Financial News Classification Algorithm[J/OL]. Big Data Research. http://kns.cnki.net/kcms/detail/10.1321.G2.20210918.1606.002.html
( He Yuhong, Huang Peijie, Du Zefeng, et al. Distant Supervision Based Utterance Domain Classification with Domain-Specific NER[J]. Journal of Chinese Information Processing, 2020, 34(5):10-18.)
[14]
He Y L, Lin C H. Protein-Protein Interactions Classification from Text via Local Learning with Class Priors[C]//Proceedings of the 14th International Conference on Applications of Natural Language to Information Systems. 2009: 182-191.
[15]
Liu M K, Wen M S, Kopru S, et al. Semi-Supervised Learning with Auxiliary Evaluation Component for Large Scale E-Commerce Text Classification[C]//Proceedings of the Workshop on Deep Learning Approaches for Low-Resource NLP. 2018. DOI: 10.18653/v1/W18-3409.
doi: 10.18653/v1/W18-3409
[16]
Karlos S, Fazakis N, Kalleris K, et al. An Incremental Self-Trained Ensemble Algorithm[C]//Proceedings of the 2018 IEEE Conference on Evolving and Adaptive Intelligent Systems. IEEE, 2018: 1-8.
( Zhao Hong, Wang Fang. A Deep Learning Model and Self-Training Algorithm for Theoretical Terms Extraction[J]. Journal of the China Society for Scientific and Technical Information, 2018, 37(9):923-938.)
[18]
Zhu X J, Ghahramani Z. Learning from Labeled and Unlabeled Data with Label Propagation, CMU-CMU-CALD-02-107[R]. Pitts burgher: Carnegie Mellon University, 2002.
( Zhang Junli, Chang Yanli, Shi Wen. Overview on Label Propagation Algorithm and Applications[J]. Application Research of Computers, 2013, 30(1):21-25.)
[20]
Rossi R G, de Andrade L A, Rezende S O. Optimization and Label Propagation in Bipartite Heterogeneous Networks to Improve Transductive Classification of Texts[J]. Information Processing & Management, 2016, 52(2):217-257.
doi: 10.1016/j.ipm.2015.07.004
[21]
Velikovich L, Blair-Goldensohn S, Hannan K, et al. The Viability of Web-derived Polarity Lexicons[C]//Proceedings of Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. 2010: 777-785.
[22]
Pan S J, Yang Q. A Survey on Transfer Learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(10):1345-1359.
doi: 10.1109/TKDE.2009.191
[23]
Garg S, Sharma R K, Liang Y Y. SimpleTran: Transferring Pre-Trained Sentence Embeddings for Low Resource Text Classification[OL]. arXiv Preprint, arXiv: 2004.05119.
[24]
Cohn D A, Ghahramani Z, Jordan M I. Active Learning with Statistical Models[J]. Journal of Artificial Intelligence Research, 1996, 4:129-145.
doi: 10.1613/jair.295
[25]
McCallum A, Nigam K. Employing EM and Pool-Based Active Learning for Text Classification[C]//Proceedings of the 15th International Conference on Machine Learning. 1998: 350-358.
( Nian Sulei, Li Ming, Du Ke, et al. Classifying Communication Dispatch System Logs of Smart Grid Based on Active Semi-Supervised Learning[J]. Computer Science, 2012, 39(12):167-170, 207.)
( Bi Qiumin, Li Ming, Zeng Zhiyong. Semi-Supervised Micro-Blog Sentiment Classification Method Combining Active Learning and Co-Training[J]. New Technology of Library and Information Service, 2015(1):38-44.)
( Chen Guo, Xu Tianxiang. Sentence Function Recognition Based on Active Learning[J]. Data Analysis and Knowledge Discovery, 2019, 3(8):53-61.)
[29]
Sinha S, Ebrahimi S, Darrell T. Variational Adversarial Active Learning[C]//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019: 5971-5980.
[30]
Naseem U, Khushi M, Khan S K, et al. A Comparative Analysis of Active Learning for Biomedical Text Mining[J]. Applied System Innovation, 2021, 4(1):23.
doi: 10.3390/asi4010023
[31]
Figueroa R L, Zeng-Treitler Q, Ngo L H, et al. Active Learning for Clinical Text Classification: Is It Better than Random Sampling?[J]. Journal of the American Medical Informatics Association, 2012, 19(5):809-816.
doi: 10.1136/amiajnl-2011-000648
pmid: 22707743
[32]
de Angeli K, Gao S, Alawad M, et al. Deep Active Learning for Classifying Cancer Pathology Reports[J]. BMC Bioinformatics, 2021, 22(1):113.
doi: 10.1186/s12859-021-04047-1
pmid: 33750288
[33]
Le Q, Mikolov T. Distributed Representations of Sentences and Documents[C]//Proceedings of the 31st International Conference on Machine Learning. 2014.
[34]
Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781.
( Pan Bo, Zhang Qingchuan, Yu Chongchong, et al. Application of Doc2Vec on Job Salary Prediction[J]. Application Research of Computers, 2018, 35(1):155-157.)
( Deng Junfeng, Zhang Xiaolong. Deep Learning Algorithm Optimization Based on Combination of Auto-Encoders[J]. Journal of Computer Applications, 2016, 36(3):697-702.)
( Li Haifeng. Investigating the Topics Discovery and Correlation Analysis of Newspaper Reports on the Integrated Development of Beijing-Tianjin-Hebei Region[J]. Science Technology and Engineering, 2021, 21(28):12185-12193.)
( Yang Bo, Shao Wanting. Research on Weak Signal Recognition Facing Enterprise Competitive Intelligence[J]. Journal of Modern Information, 2021, 41(9):53-63.)
( Chen Yue, Song Kai, Liu Anrong, et al. Artificial Intelligence Technology: Novel Strategy for Patent Dataset Creation Based on Machine Learning[J]. Journal of the China Society for Scientific and Technical Information, 2021, 40(3):286-296.)
( Li Xiangdong, Cao Huan, Huang Li. Study about Effect of Relevant Quantitative Indexes of Training Set in Text Classification[J]. Application Research of Computers, 2014, 31(11):3324-3327, 3332.)
( Xue Chunxiang, Zhang Yufang. Research Review on Chinese Text Classification in the News Field[J]. Library and Information Service, 2013, 57(14):134-139.)
( Chen Guo, Xu Tianxiang. Research on the Discovery of Entity Relationships in Subdivided Domains under the Guidance of a Small-scale Knowledge Base[J]. Journal of the China Society for Scientific and Technical Information, 2019, 38(11):1200-1211.)
( Pang Guansong, Jiang Shengyi. A Summary of Research on Automatic Text Classification Technologies[J]. Information Studies:Theory & Application, 2012, 35(2):123-128.)