|
|
News Classification with Semi-Supervised and Active Learning |
Chen Guo1,2( ),Ye Chao1 |
1School of Economics & Management, Nanjing University of Science & Technology, Nanjing 210094, China 2Jiangsu Science and Technology Collaborative Innovation Center of Social Public Safety, Nanjing 210094, China |
|
|
Abstract [Objective] This paper proposes a news classification scheme combining semi-supervised learning and active learning, aiming to improve intelligence monitoring based on news mining. [Methods] First, we carried out K-means clustering based on the learning of news text representations, and selected a small number of representative samples from various clusters for manual judgment. These categories were merged and adjusted as sub-field categories. Then, we used the representative samples as the training set for a variety of integrated classification algorithms and train the initial classifier. Finally, we utilized active learning to optimize the initial classifier. [Results] We tested our new model with news on tanks and armored vehicles. After active learning, we received better text classification results. The precision, recall and F1 value reached 83.68%, 83.35% and 83.17%, which were increased by 2.71%, 2.52% and 2.81% respectively. [Limitations] To reduce manually labeling work, we only conducted 2 iterations. [Conclusions] The proposed method can effectively classify news with little corpus annotation and no pre-trained classifier. It could also be used in other fields.
|
Received: 01 June 2021
Published: 12 May 2022
|
|
Fund:Youth Foundation of Social Science and Humanity, China Ministry of Education(21YJC870003);Social Science Fund of Jiangsu Province(21TQC002) |
Corresponding Authors:
Chen Guo,ORCID:0000-0003-2873-1051
E-mail: dephi1987@qq.com
|
[1] |
丁连红, 孙斌, 张宏伟. 基于知识图谱扩展的短文本分类方法[J]. 情报工程, 2018, 4(5):38-46.
|
[1] |
( Ding Lianhong, Sun Bin, Zhang Hongwei. Short Text Classification Based on Knowledge Graph Extension[J]. Technology Intelligence Engineering, 2018, 4(5):38-46.)
|
[2] |
于游, 付钰, 吴晓平. 一种基于词和事件主题的卷积网络的新闻文本分类方法[J]. 计算机应用与软件, 2021, 38(5):170-174.
|
[2] |
( Yu You, Fu Yu, Wu Xiaoping. News Text Classification Method Based on Convolutional Network of Word-Event Topic[J]. Computer Applications and Software, 2021, 38(5):170-174.)
|
[3] |
胡玉兰, 赵青杉, 陈莉, 等. 面向中文新闻文本分类的融合网络模型[J]. 中文信息学报, 2021, 35(3):107-114.
|
[3] |
( Hu Yulan, Zhao Qingshan, Chen Li, et al. A Fusion Network Model for Chinese News Text Classification[J]. Journal of Chinese Information Processing, 2021, 35(3):107-114.)
|
[4] |
刘月, 翟东海, 任庆宁. 基于注意力CNLSTM模型的新闻文本分类[J]. 计算机工程, 2019, 45(7):303-308.
|
[4] |
( Liu Yue, Zhai Donghai, Ren Qingning. News Text Classification Based on CNLSTM Model with Attention Mechanism[J]. Computer Engineering, 2019, 45(7):303-308.)
|
[5] |
张永奎, 李红娟. 基于类别关键词的突发事件新闻文本分类方法[J]. 计算机应用, 2008, 28(S1):139-140.
|
[5] |
( Zhang Yongkui, Li Hongjuan. Text Classification of Accident News Based on Category Keyword[J]. Journal of Computer Applications, 2008, 28(S1):139-140.)
|
[6] |
杨丽英, 李红娟, 张永奎. 突发事件新闻语料分类体系研究[C]//中文信息处理前沿进展——中国中文信息学会二十五周年学术会议论文集.中国中文信息学会, 2006.
|
[6] |
( Yang Liying, Li Hongjuan, Zhang Yongkui. The Research on Classification System of Accident News Corpus[C]//Proceedings of the 25th Anniversary Academic Conference of the Chinese Information Processing Society of China. Chinese Information Processing Society of China, 2006.)
|
[7] |
夏华林, 张仰森. 基于规则与统计的Web突发事件新闻多层次分类[J]. 计算机应用, 2012, 32(2):392-394.
|
[7] |
( Xia Hualin, Zhang Yangsen. Multiple-Layer Classification of Web Emergency News Based on Rules and Statistics[J]. Journal of Computer Applications, 2012, 32(2):392-394.)
|
[8] |
宋英华, 吕龙, 刘丹. 基于组合深度学习模型的突发事件新闻识别与分类研究[J]. 情报学报, 2021, 40(2):145-151.
|
[8] |
( Song Yinghua, Lyu Long, Liu Dan. Study on Identification and Classification of Emergency News Based on the Combined Deep Learning Model[J]. Journal of the China Society for Scientific and Technical Information, 2021, 40(2):145-151.)
|
[9] |
葛艳, 郑利杰, 杜军威, 等. 基于BLSTM-Attention神经网络模型的化工事故分类[J]. 计算机系统应用, 2020, 29(10):205-210.
|
[9] |
( Ge Yan, Zheng Lijie, Du Junwei, et al. Chemical Accident Classification Based on BLSTM-Attention Neural Network Model[J]. Computer Systems & Applications, 2020, 29(10):205-210.)
|
[10] |
朱芳鹏, 王晓峰. 面向船舶工业新闻的文本分类[J]. 电子测量与仪器学报, 2020, 34(1):149-155.
|
[10] |
( Zhu Fangpeng, Wang Xiaofeng. Text Classification for Ship Industry News[J]. Journal of Electronic Measurement and Instrumentation, 2020, 34(1):149-155.)
|
[11] |
张晓龙, 支龙, 高剑, 等. 一个半监督学习的金融新闻文本分类算法[J/OL]. 大数据. http://kns.cnki.net/kcms/detail/10.1321.G2.20210918.1606.002.html.
|
[11] |
( Zhang Xiaolong, Zhi Long, Gao Jian, et al. A Semi-Supervised Learning Financial News Classification Algorithm[J/OL]. Big Data Research. http://kns.cnki.net/kcms/detail/10.1321.G2.20210918.1606.002.html
|
[12] |
张世同. 基于BERT与BiLSTM的铁路安监文本分类方法[J]. 现代计算机, 2021(22):38-42.
|
[12] |
( Zhang Shitong. BERT and BiLSTM Based Text Classification Method for Railway Safety Supervision System[J]. Modern Computer, 2021(22):38-42.)
|
[13] |
何宇虹, 黄沛杰, 杜泽峰, 等. 结合特殊领域实体识别的远监督话语领域分类[J]. 中文信息学报, 2020, 34(5):10-18.
|
[13] |
( He Yuhong, Huang Peijie, Du Zefeng, et al. Distant Supervision Based Utterance Domain Classification with Domain-Specific NER[J]. Journal of Chinese Information Processing, 2020, 34(5):10-18.)
|
[14] |
He Y L, Lin C H. Protein-Protein Interactions Classification from Text via Local Learning with Class Priors[C]//Proceedings of the 14th International Conference on Applications of Natural Language to Information Systems. 2009: 182-191.
|
[15] |
Liu M K, Wen M S, Kopru S, et al. Semi-Supervised Learning with Auxiliary Evaluation Component for Large Scale E-Commerce Text Classification[C]//Proceedings of the Workshop on Deep Learning Approaches for Low-Resource NLP. 2018. DOI: 10.18653/v1/W18-3409.
doi: 10.18653/v1/W18-3409
|
[16] |
Karlos S, Fazakis N, Kalleris K, et al. An Incremental Self-Trained Ensemble Algorithm[C]//Proceedings of the 2018 IEEE Conference on Evolving and Adaptive Intelligent Systems. IEEE, 2018: 1-8.
|
[17] |
赵洪, 王芳. 理论术语抽取的深度学习模型及自训练算法研究[J]. 情报学报, 2018, 37(9):923-938.
|
[17] |
( Zhao Hong, Wang Fang. A Deep Learning Model and Self-Training Algorithm for Theoretical Terms Extraction[J]. Journal of the China Society for Scientific and Technical Information, 2018, 37(9):923-938.)
|
[18] |
Zhu X J, Ghahramani Z. Learning from Labeled and Unlabeled Data with Label Propagation, CMU-CMU-CALD-02-107[R]. Pitts burgher: Carnegie Mellon University, 2002.
|
[19] |
张俊丽, 常艳丽, 师文. 标签传播算法理论及其应用研究综述[J]. 计算机应用研究, 2013, 30(1):21-25.
|
[19] |
( Zhang Junli, Chang Yanli, Shi Wen. Overview on Label Propagation Algorithm and Applications[J]. Application Research of Computers, 2013, 30(1):21-25.)
|
[20] |
Rossi R G, de Andrade L A, Rezende S O. Optimization and Label Propagation in Bipartite Heterogeneous Networks to Improve Transductive Classification of Texts[J]. Information Processing & Management, 2016, 52(2):217-257.
doi: 10.1016/j.ipm.2015.07.004
|
[21] |
Velikovich L, Blair-Goldensohn S, Hannan K, et al. The Viability of Web-derived Polarity Lexicons[C]//Proceedings of Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. 2010: 777-785.
|
[22] |
Pan S J, Yang Q. A Survey on Transfer Learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(10):1345-1359.
doi: 10.1109/TKDE.2009.191
|
[23] |
Garg S, Sharma R K, Liang Y Y. SimpleTran: Transferring Pre-Trained Sentence Embeddings for Low Resource Text Classification[OL]. arXiv Preprint, arXiv: 2004.05119.
|
[24] |
Cohn D A, Ghahramani Z, Jordan M I. Active Learning with Statistical Models[J]. Journal of Artificial Intelligence Research, 1996, 4:129-145.
doi: 10.1613/jair.295
|
[25] |
McCallum A, Nigam K. Employing EM and Pool-Based Active Learning for Text Classification[C]//Proceedings of the 15th International Conference on Machine Learning. 1998: 350-358.
|
[26] |
年素磊, 黎铭, 杜科, 等. 基于主动半监督学习的智能电网信调日志分类[J]. 计算机科学, 2012, 39(12):167-170, 207.
|
[26] |
( Nian Sulei, Li Ming, Du Ke, et al. Classifying Communication Dispatch System Logs of Smart Grid Based on Active Semi-Supervised Learning[J]. Computer Science, 2012, 39(12):167-170, 207.)
|
[27] |
毕秋敏, 李明, 曾志勇. 一种主动学习和协同训练相结合的半监督微博情感分类方法[J]. 现代图书情报技术, 2015(1):38-44.
|
[27] |
( Bi Qiumin, Li Ming, Zeng Zhiyong. Semi-Supervised Micro-Blog Sentiment Classification Method Combining Active Learning and Co-Training[J]. New Technology of Library and Information Service, 2015(1):38-44.)
|
[28] |
陈果, 许天祥. 基于主动学习的科技论文句子功能识别研究[J]. 数据分析与知识发现, 2019, 3(8):53-61.
|
[28] |
( Chen Guo, Xu Tianxiang. Sentence Function Recognition Based on Active Learning[J]. Data Analysis and Knowledge Discovery, 2019, 3(8):53-61.)
|
[29] |
Sinha S, Ebrahimi S, Darrell T. Variational Adversarial Active Learning[C]//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019: 5971-5980.
|
[30] |
Naseem U, Khushi M, Khan S K, et al. A Comparative Analysis of Active Learning for Biomedical Text Mining[J]. Applied System Innovation, 2021, 4(1):23.
doi: 10.3390/asi4010023
|
[31] |
Figueroa R L, Zeng-Treitler Q, Ngo L H, et al. Active Learning for Clinical Text Classification: Is It Better than Random Sampling?[J]. Journal of the American Medical Informatics Association, 2012, 19(5):809-816.
doi: 10.1136/amiajnl-2011-000648
pmid: 22707743
|
[32] |
de Angeli K, Gao S, Alawad M, et al. Deep Active Learning for Classifying Cancer Pathology Reports[J]. BMC Bioinformatics, 2021, 22(1):113.
doi: 10.1186/s12859-021-04047-1
pmid: 33750288
|
[33] |
Le Q, Mikolov T. Distributed Representations of Sentences and Documents[C]//Proceedings of the 31st International Conference on Machine Learning. 2014.
|
[34] |
Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781.
|
[35] |
潘博, 张青川, 于重重, 等. Doc2Vec在薪水预测中的应用研究[J]. 计算机应用研究, 2018, 35(1):155-157.
|
[35] |
( Pan Bo, Zhang Qingchuan, Yu Chongchong, et al. Application of Doc2Vec on Job Salary Prediction[J]. Application Research of Computers, 2018, 35(1):155-157.)
|
[36] |
吴夙慧, 成颖, 郑彦宁, 等. K-Means算法研究综述[J]. 现代图书情报技术, 2011(5):28-35.
|
[36] |
( Wu Suhui, Cheng Ying, Zheng Yanning, et al. Survey on K-Means Algorithm[J]. New Technology of Library and Information Service, 2011(5):28-35.)
|
[37] |
Breiman L. Random Forests[J]. Machine Learning, 2001, 45:5-32.
doi: 10.1023/A:1010933404324
|
[38] |
Cortes C, Vapnik V. Support-Vector Networks[J]. Machine Learning, 1995, 20(3):273-297.
|
[39] |
邓俊锋, 张晓龙. 基于自动编码器组合的深度学习优化方法[J]. 计算机应用, 2016, 36(3):697-702.
|
[39] |
( Deng Junfeng, Zhang Xiaolong. Deep Learning Algorithm Optimization Based on Combination of Auto-Encoders[J]. Journal of Computer Applications, 2016, 36(3):697-702.)
|
[40] |
李海峰. 京津冀协同发展报纸新闻主题发现及其关联分析[J]. 科学技术与工程, 2021, 21(28):12185-12193.
|
[40] |
( Li Haifeng. Investigating the Topics Discovery and Correlation Analysis of Newspaper Reports on the Integrated Development of Beijing-Tianjin-Hebei Region[J]. Science Technology and Engineering, 2021, 21(28):12185-12193.)
|
[41] |
杨波, 邵婉婷. 面向企业竞争情报的弱信号识别研究[J]. 现代情报, 2021, 41(9):53-63.
|
[41] |
( Yang Bo, Shao Wanting. Research on Weak Signal Recognition Facing Enterprise Competitive Intelligence[J]. Journal of Modern Information, 2021, 41(9):53-63.)
|
[42] |
陈悦, 宋凯, 刘安蓉, 等. 基于机器学习的人工智能技术专利数据集构建新策略[J]. 情报学报, 2021, 40(3):286-296.
|
[42] |
( Chen Yue, Song Kai, Liu Anrong, et al. Artificial Intelligence Technology: Novel Strategy for Patent Dataset Creation Based on Machine Learning[J]. Journal of the China Society for Scientific and Technical Information, 2021, 40(3):286-296.)
|
[43] |
李湘东, 曹环, 黄莉. 文本分类中训练集相关数量指标的影响研究[J]. 计算机应用研究, 2014, 31(11):3324-3327, 3332.
|
[43] |
( Li Xiangdong, Cao Huan, Huang Li. Study about Effect of Relevant Quantitative Indexes of Training Set in Text Classification[J]. Application Research of Computers, 2014, 31(11):3324-3327, 3332.)
|
[44] |
薛春香, 张玉芳. 面向新闻领域的中文文本分类研究综述[J]. 图书情报工作, 2013, 57(14):134-139.
|
[44] |
( Xue Chunxiang, Zhang Yufang. Research Review on Chinese Text Classification in the News Field[J]. Library and Information Service, 2013, 57(14):134-139.)
|
[45] |
陈果, 许天祥. 小规模知识库指导下的细分领域实体关系发现研究[J]. 情报学报, 2019, 38(11):1200-1211.
|
[45] |
( Chen Guo, Xu Tianxiang. Research on the Discovery of Entity Relationships in Subdivided Domains under the Guidance of a Small-scale Knowledge Base[J]. Journal of the China Society for Scientific and Technical Information, 2019, 38(11):1200-1211.)
|
[46] |
庞观松, 蒋盛益. 文本自动分类技术研究综述[J]. 情报理论与实践, 2012, 35(2):123-128.
|
[46] |
( Pang Guansong, Jiang Shengyi. A Summary of Research on Automatic Text Classification Technologies[J]. Information Studies:Theory & Application, 2012, 35(2):123-128.)
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|