Please wait a minute...
Data Analysis and Knowledge Discovery  2022, Vol. 6 Issue (4): 28-38    DOI: 10.11925/infotech.2096-3467.2021.0545
Current Issue | Archive | Adv Search |
News Classification with Semi-Supervised and Active Learning
Chen Guo1,2(),Ye Chao1
1School of Economics & Management, Nanjing University of Science & Technology, Nanjing 210094, China
2Jiangsu Science and Technology Collaborative Innovation Center of Social Public Safety, Nanjing 210094, China
Download: PDF (953 KB)   HTML ( 53
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a news classification scheme combining semi-supervised learning and active learning, aiming to improve intelligence monitoring based on news mining. [Methods] First, we carried out K-means clustering based on the learning of news text representations, and selected a small number of representative samples from various clusters for manual judgment. These categories were merged and adjusted as sub-field categories. Then, we used the representative samples as the training set for a variety of integrated classification algorithms and train the initial classifier. Finally, we utilized active learning to optimize the initial classifier. [Results] We tested our new model with news on tanks and armored vehicles. After active learning, we received better text classification results. The precision, recall and F1 value reached 83.68%, 83.35% and 83.17%, which were increased by 2.71%, 2.52% and 2.81% respectively. [Limitations] To reduce manually labeling work, we only conducted 2 iterations. [Conclusions] The proposed method can effectively classify news with little corpus annotation and no pre-trained classifier. It could also be used in other fields.

Key wordsSemi-Supervised Learning      Active Learning      Text Classification      Ensemble Learning     
Received: 01 June 2021      Published: 12 May 2022
ZTFLH:  G350  
Fund:Youth Foundation of Social Science and Humanity, China Ministry of Education(21YJC870003);Social Science Fund of Jiangsu Province(21TQC002)
Corresponding Authors: Chen Guo,ORCID:0000-0003-2873-1051     E-mail: dephi1987@qq.com

Cite this article:

Chen Guo, Ye Chao. News Classification with Semi-Supervised and Active Learning. Data Analysis and Knowledge Discovery, 2022, 6(4): 28-38.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2021.0545     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2022/V6/I4/28

Flow Chart of News Classification in Subdivided Fields Based on Clustering and Active Learning
Active Learning Process
主题类别 数目
军事行动与部署 387
武器装备贸易 185
军事演练 317
新型装备技术 311
杂质 687
总计 1 887
Statistics of News Topics in Tank and Armored Vehicle Field
基分类器 初始权重
子训练集1 子训练集2 子训练集3
随机森林 0.68 0.70 0.70
SVM 0.89 0.84 0.86
Softmax 0.90 0.87 0.87
Weight of Initial Base Classifier
Confusion Matrix Analysis Based on Training Set
分类模型 正确率/% 召回率/% F1值/%
初始分类模型 80.97 80.83 80.36
第一轮主动学习后 83.38 83.00 82.51
第二轮主动学习后 83.68 83.35 83.17
Classification Results of Weapon Equipment News Based on Active Learning
Confusion Matrix Analysis Based on Test Set
[1] 丁连红, 孙斌, 张宏伟. 基于知识图谱扩展的短文本分类方法[J]. 情报工程, 2018, 4(5):38-46.
[1] ( Ding Lianhong, Sun Bin, Zhang Hongwei. Short Text Classification Based on Knowledge Graph Extension[J]. Technology Intelligence Engineering, 2018, 4(5):38-46.)
[2] 于游, 付钰, 吴晓平. 一种基于词和事件主题的卷积网络的新闻文本分类方法[J]. 计算机应用与软件, 2021, 38(5):170-174.
[2] ( Yu You, Fu Yu, Wu Xiaoping. News Text Classification Method Based on Convolutional Network of Word-Event Topic[J]. Computer Applications and Software, 2021, 38(5):170-174.)
[3] 胡玉兰, 赵青杉, 陈莉, 等. 面向中文新闻文本分类的融合网络模型[J]. 中文信息学报, 2021, 35(3):107-114.
[3] ( Hu Yulan, Zhao Qingshan, Chen Li, et al. A Fusion Network Model for Chinese News Text Classification[J]. Journal of Chinese Information Processing, 2021, 35(3):107-114.)
[4] 刘月, 翟东海, 任庆宁. 基于注意力CNLSTM模型的新闻文本分类[J]. 计算机工程, 2019, 45(7):303-308.
[4] ( Liu Yue, Zhai Donghai, Ren Qingning. News Text Classification Based on CNLSTM Model with Attention Mechanism[J]. Computer Engineering, 2019, 45(7):303-308.)
[5] 张永奎, 李红娟. 基于类别关键词的突发事件新闻文本分类方法[J]. 计算机应用, 2008, 28(S1):139-140.
[5] ( Zhang Yongkui, Li Hongjuan. Text Classification of Accident News Based on Category Keyword[J]. Journal of Computer Applications, 2008, 28(S1):139-140.)
[6] 杨丽英, 李红娟, 张永奎. 突发事件新闻语料分类体系研究[C]//中文信息处理前沿进展——中国中文信息学会二十五周年学术会议论文集.中国中文信息学会, 2006.
[6] ( Yang Liying, Li Hongjuan, Zhang Yongkui. The Research on Classification System of Accident News Corpus[C]//Proceedings of the 25th Anniversary Academic Conference of the Chinese Information Processing Society of China. Chinese Information Processing Society of China, 2006.)
[7] 夏华林, 张仰森. 基于规则与统计的Web突发事件新闻多层次分类[J]. 计算机应用, 2012, 32(2):392-394.
[7] ( Xia Hualin, Zhang Yangsen. Multiple-Layer Classification of Web Emergency News Based on Rules and Statistics[J]. Journal of Computer Applications, 2012, 32(2):392-394.)
[8] 宋英华, 吕龙, 刘丹. 基于组合深度学习模型的突发事件新闻识别与分类研究[J]. 情报学报, 2021, 40(2):145-151.
[8] ( Song Yinghua, Lyu Long, Liu Dan. Study on Identification and Classification of Emergency News Based on the Combined Deep Learning Model[J]. Journal of the China Society for Scientific and Technical Information, 2021, 40(2):145-151.)
[9] 葛艳, 郑利杰, 杜军威, 等. 基于BLSTM-Attention神经网络模型的化工事故分类[J]. 计算机系统应用, 2020, 29(10):205-210.
[9] ( Ge Yan, Zheng Lijie, Du Junwei, et al. Chemical Accident Classification Based on BLSTM-Attention Neural Network Model[J]. Computer Systems & Applications, 2020, 29(10):205-210.)
[10] 朱芳鹏, 王晓峰. 面向船舶工业新闻的文本分类[J]. 电子测量与仪器学报, 2020, 34(1):149-155.
[10] ( Zhu Fangpeng, Wang Xiaofeng. Text Classification for Ship Industry News[J]. Journal of Electronic Measurement and Instrumentation, 2020, 34(1):149-155.)
[11] 张晓龙, 支龙, 高剑, 等. 一个半监督学习的金融新闻文本分类算法[J/OL]. 大数据. http://kns.cnki.net/kcms/detail/10.1321.G2.20210918.1606.002.html.
[11] ( Zhang Xiaolong, Zhi Long, Gao Jian, et al. A Semi-Supervised Learning Financial News Classification Algorithm[J/OL]. Big Data Research. http://kns.cnki.net/kcms/detail/10.1321.G2.20210918.1606.002.html
[12] 张世同. 基于BERT与BiLSTM的铁路安监文本分类方法[J]. 现代计算机, 2021(22):38-42.
[12] ( Zhang Shitong. BERT and BiLSTM Based Text Classification Method for Railway Safety Supervision System[J]. Modern Computer, 2021(22):38-42.)
[13] 何宇虹, 黄沛杰, 杜泽峰, 等. 结合特殊领域实体识别的远监督话语领域分类[J]. 中文信息学报, 2020, 34(5):10-18.
[13] ( He Yuhong, Huang Peijie, Du Zefeng, et al. Distant Supervision Based Utterance Domain Classification with Domain-Specific NER[J]. Journal of Chinese Information Processing, 2020, 34(5):10-18.)
[14] He Y L, Lin C H. Protein-Protein Interactions Classification from Text via Local Learning with Class Priors[C]//Proceedings of the 14th International Conference on Applications of Natural Language to Information Systems. 2009: 182-191.
[15] Liu M K, Wen M S, Kopru S, et al. Semi-Supervised Learning with Auxiliary Evaluation Component for Large Scale E-Commerce Text Classification[C]//Proceedings of the Workshop on Deep Learning Approaches for Low-Resource NLP. 2018. DOI: 10.18653/v1/W18-3409.
doi: 10.18653/v1/W18-3409
[16] Karlos S, Fazakis N, Kalleris K, et al. An Incremental Self-Trained Ensemble Algorithm[C]//Proceedings of the 2018 IEEE Conference on Evolving and Adaptive Intelligent Systems. IEEE, 2018: 1-8.
[17] 赵洪, 王芳. 理论术语抽取的深度学习模型及自训练算法研究[J]. 情报学报, 2018, 37(9):923-938.
[17] ( Zhao Hong, Wang Fang. A Deep Learning Model and Self-Training Algorithm for Theoretical Terms Extraction[J]. Journal of the China Society for Scientific and Technical Information, 2018, 37(9):923-938.)
[18] Zhu X J, Ghahramani Z. Learning from Labeled and Unlabeled Data with Label Propagation, CMU-CMU-CALD-02-107[R]. Pitts burgher: Carnegie Mellon University, 2002.
[19] 张俊丽, 常艳丽, 师文. 标签传播算法理论及其应用研究综述[J]. 计算机应用研究, 2013, 30(1):21-25.
[19] ( Zhang Junli, Chang Yanli, Shi Wen. Overview on Label Propagation Algorithm and Applications[J]. Application Research of Computers, 2013, 30(1):21-25.)
[20] Rossi R G, de Andrade L A, Rezende S O. Optimization and Label Propagation in Bipartite Heterogeneous Networks to Improve Transductive Classification of Texts[J]. Information Processing & Management, 2016, 52(2):217-257.
doi: 10.1016/j.ipm.2015.07.004
[21] Velikovich L, Blair-Goldensohn S, Hannan K, et al. The Viability of Web-derived Polarity Lexicons[C]//Proceedings of Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. 2010: 777-785.
[22] Pan S J, Yang Q. A Survey on Transfer Learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(10):1345-1359.
doi: 10.1109/TKDE.2009.191
[23] Garg S, Sharma R K, Liang Y Y. SimpleTran: Transferring Pre-Trained Sentence Embeddings for Low Resource Text Classification[OL]. arXiv Preprint, arXiv: 2004.05119.
[24] Cohn D A, Ghahramani Z, Jordan M I. Active Learning with Statistical Models[J]. Journal of Artificial Intelligence Research, 1996, 4:129-145.
doi: 10.1613/jair.295
[25] McCallum A, Nigam K. Employing EM and Pool-Based Active Learning for Text Classification[C]//Proceedings of the 15th International Conference on Machine Learning. 1998: 350-358.
[26] 年素磊, 黎铭, 杜科, 等. 基于主动半监督学习的智能电网信调日志分类[J]. 计算机科学, 2012, 39(12):167-170, 207.
[26] ( Nian Sulei, Li Ming, Du Ke, et al. Classifying Communication Dispatch System Logs of Smart Grid Based on Active Semi-Supervised Learning[J]. Computer Science, 2012, 39(12):167-170, 207.)
[27] 毕秋敏, 李明, 曾志勇. 一种主动学习和协同训练相结合的半监督微博情感分类方法[J]. 现代图书情报技术, 2015(1):38-44.
[27] ( Bi Qiumin, Li Ming, Zeng Zhiyong. Semi-Supervised Micro-Blog Sentiment Classification Method Combining Active Learning and Co-Training[J]. New Technology of Library and Information Service, 2015(1):38-44.)
[28] 陈果, 许天祥. 基于主动学习的科技论文句子功能识别研究[J]. 数据分析与知识发现, 2019, 3(8):53-61.
[28] ( Chen Guo, Xu Tianxiang. Sentence Function Recognition Based on Active Learning[J]. Data Analysis and Knowledge Discovery, 2019, 3(8):53-61.)
[29] Sinha S, Ebrahimi S, Darrell T. Variational Adversarial Active Learning[C]//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019: 5971-5980.
[30] Naseem U, Khushi M, Khan S K, et al. A Comparative Analysis of Active Learning for Biomedical Text Mining[J]. Applied System Innovation, 2021, 4(1):23.
doi: 10.3390/asi4010023
[31] Figueroa R L, Zeng-Treitler Q, Ngo L H, et al. Active Learning for Clinical Text Classification: Is It Better than Random Sampling?[J]. Journal of the American Medical Informatics Association, 2012, 19(5):809-816.
doi: 10.1136/amiajnl-2011-000648 pmid: 22707743
[32] de Angeli K, Gao S, Alawad M, et al. Deep Active Learning for Classifying Cancer Pathology Reports[J]. BMC Bioinformatics, 2021, 22(1):113.
doi: 10.1186/s12859-021-04047-1 pmid: 33750288
[33] Le Q, Mikolov T. Distributed Representations of Sentences and Documents[C]//Proceedings of the 31st International Conference on Machine Learning. 2014.
[34] Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781.
[35] 潘博, 张青川, 于重重, 等. Doc2Vec在薪水预测中的应用研究[J]. 计算机应用研究, 2018, 35(1):155-157.
[35] ( Pan Bo, Zhang Qingchuan, Yu Chongchong, et al. Application of Doc2Vec on Job Salary Prediction[J]. Application Research of Computers, 2018, 35(1):155-157.)
[36] 吴夙慧, 成颖, 郑彦宁, 等. K-Means算法研究综述[J]. 现代图书情报技术, 2011(5):28-35.
[36] ( Wu Suhui, Cheng Ying, Zheng Yanning, et al. Survey on K-Means Algorithm[J]. New Technology of Library and Information Service, 2011(5):28-35.)
[37] Breiman L. Random Forests[J]. Machine Learning, 2001, 45:5-32.
doi: 10.1023/A:1010933404324
[38] Cortes C, Vapnik V. Support-Vector Networks[J]. Machine Learning, 1995, 20(3):273-297.
[39] 邓俊锋, 张晓龙. 基于自动编码器组合的深度学习优化方法[J]. 计算机应用, 2016, 36(3):697-702.
[39] ( Deng Junfeng, Zhang Xiaolong. Deep Learning Algorithm Optimization Based on Combination of Auto-Encoders[J]. Journal of Computer Applications, 2016, 36(3):697-702.)
[40] 李海峰. 京津冀协同发展报纸新闻主题发现及其关联分析[J]. 科学技术与工程, 2021, 21(28):12185-12193.
[40] ( Li Haifeng. Investigating the Topics Discovery and Correlation Analysis of Newspaper Reports on the Integrated Development of Beijing-Tianjin-Hebei Region[J]. Science Technology and Engineering, 2021, 21(28):12185-12193.)
[41] 杨波, 邵婉婷. 面向企业竞争情报的弱信号识别研究[J]. 现代情报, 2021, 41(9):53-63.
[41] ( Yang Bo, Shao Wanting. Research on Weak Signal Recognition Facing Enterprise Competitive Intelligence[J]. Journal of Modern Information, 2021, 41(9):53-63.)
[42] 陈悦, 宋凯, 刘安蓉, 等. 基于机器学习的人工智能技术专利数据集构建新策略[J]. 情报学报, 2021, 40(3):286-296.
[42] ( Chen Yue, Song Kai, Liu Anrong, et al. Artificial Intelligence Technology: Novel Strategy for Patent Dataset Creation Based on Machine Learning[J]. Journal of the China Society for Scientific and Technical Information, 2021, 40(3):286-296.)
[43] 李湘东, 曹环, 黄莉. 文本分类中训练集相关数量指标的影响研究[J]. 计算机应用研究, 2014, 31(11):3324-3327, 3332.
[43] ( Li Xiangdong, Cao Huan, Huang Li. Study about Effect of Relevant Quantitative Indexes of Training Set in Text Classification[J]. Application Research of Computers, 2014, 31(11):3324-3327, 3332.)
[44] 薛春香, 张玉芳. 面向新闻领域的中文文本分类研究综述[J]. 图书情报工作, 2013, 57(14):134-139.
[44] ( Xue Chunxiang, Zhang Yufang. Research Review on Chinese Text Classification in the News Field[J]. Library and Information Service, 2013, 57(14):134-139.)
[45] 陈果, 许天祥. 小规模知识库指导下的细分领域实体关系发现研究[J]. 情报学报, 2019, 38(11):1200-1211.
[45] ( Chen Guo, Xu Tianxiang. Research on the Discovery of Entity Relationships in Subdivided Domains under the Guidance of a Small-scale Knowledge Base[J]. Journal of the China Society for Scientific and Technical Information, 2019, 38(11):1200-1211.)
[46] 庞观松, 蒋盛益. 文本自动分类技术研究综述[J]. 情报理论与实践, 2012, 35(2):123-128.
[46] ( Pang Guansong, Jiang Shengyi. A Summary of Research on Automatic Text Classification Technologies[J]. Information Studies:Theory & Application, 2012, 35(2):123-128.)
[1] Tu Zhenchao, Ma Jing. Item Categorization Algorithm Based on Improved Text Representation[J]. 数据分析与知识发现, 2022, 6(5): 34-43.
[2] Xiao Yuejun, Li Honglian, Zhang Le, Lv Xueqiang, You Xindong. Classifying Chinese Patent Texts with Feature Fusion[J]. 数据分析与知识发现, 2022, 6(4): 49-59.
[3] Yang Lin, Huang Xiaoshuo, Wang Jiayang, Ding Lingling, Li Zixiao, Li Jiao. Identifying Subtypes of Clinical Trial Diseases with BERT-TextCNN[J]. 数据分析与知识发现, 2022, 6(4): 69-81.
[4] Xu Yuemei, Fan Zuwei, Cao Han. A Multi-Task Text Classification Model Based on Label Embedding of Attention Mechanism[J]. 数据分析与知识发现, 2022, 6(2/3): 105-116.
[5] Wang Nan, Li Hairong, Tan Shuru. Predicting Public Opinion Reversal Based on Evolution Analysis of Events and Improved KE-SMOTE Algorithm[J]. 数据分析与知识发现, 2022, 6(2/3): 396-408.
[6] Xie Xingyu, Yu Bengong. Automatic Classification of E-commerce Comments with Multi-Feature Fusion Model[J]. 数据分析与知识发现, 2022, 6(1): 101-112.
[7] Che Hongxin,Wang Tong,Wang Wei. Comparing Prediction Models for Prostate Cancer[J]. 数据分析与知识发现, 2021, 5(9): 107-114.
[8] Zhou Zeyu,Wang Hao,Zhao Zibo,Li Yueyan,Zhang Xiaoqin. Construction and Application of GCN Model for Text Classification with Associated Information[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[9] Chen Jie,Ma Jing,Li Xiaofeng. Short-Text Classification Method with Text Features from Pre-trained Models[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[10] Xu Liangchen, Guo Chonghui. Predicting Survival Rates for Gastric Cancer Based on Ensemble Learning[J]. 数据分析与知识发现, 2021, 5(8): 86-99.
[11] Yu Bengong,Zhu Xiaojie,Zhang Ziwei. A Capsule Network Model for Text Classification with Multi-level Feature Extraction[J]. 数据分析与知识发现, 2021, 5(6): 93-102.
[12] Liu Tong,Liu Chen,Ni Weijian. A Semi-Supervised Sentiment Analysis Method for Chinese Based on Multi-Level Data Augmentation[J]. 数据分析与知识发现, 2021, 5(5): 51-58.
[13] Wang Nan,Li Hairong,Tan Shuru. Predicting of Public Opinion Reversal with Improved SMOTE Algorithm and Ensemble Learning[J]. 数据分析与知识发现, 2021, 5(4): 37-48.
[14] Qiu Yunfei, Guo Lei. Predicting Diabetic Complications with Unbalanced Data[J]. 数据分析与知识发现, 2021, 5(2): 116-128.
[15] Zhou Zhichao. Review of Automatic Citation Classification Based on Machine Learning[J]. 数据分析与知识发现, 2021, 5(12): 14-24.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn