Please wait a minute...
Advanced Search
数据分析与知识发现  2022, Vol. 6 Issue (4): 28-38     https://doi.org/10.11925/infotech.2096-3467.2021.0545
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
融合半监督学习与主动学习的细分领域新闻分类研究*
陈果1,2(),叶潮1
1南京理工大学经济管理学院 南京 210094
2江苏省社会公共安全科技协同创新中心 南京 210094
News Classification with Semi-Supervised and Active Learning
Chen Guo1,2(),Ye Chao1
1School of Economics & Management, Nanjing University of Science & Technology, Nanjing 210094, China
2Jiangsu Science and Technology Collaborative Innovation Center of Social Public Safety, Nanjing 210094, China
全文: PDF (953 KB)   HTML ( 53
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 在基于新闻文本挖掘的开源技术情报监测任务场景下,提出一种结合半监督学习与主动学习的细分领域新闻分类方案。【方法】 首先,在新闻文本表示学习的基础上开展K-Means聚类,筛选各类簇中少量代表性样本供人工判定类目,合并调整后作为细分领域类目;其次,利用代表性样本作为训练集,集成多种分类算法训练出初始分类器;最后,结合困惑度和混淆矩阵开展主动学习有针对性地迭代优化初始分类器。【结果】 在坦克装甲车领域新闻数据集上进行测试,进行主动学习后取得较好的文本分类结果,正确率、召回率和F1值达到83.68%、83.35%和83.17%,较主动学习前分别提升2.71、2.52和2.81个百分点。【局限】 为了减少人工语料标注任务,主动学习环节仅做了两次迭代。【结论】 所提方案能够在缺乏语料标注、未预设细分类目的原始状态下,仅利用少量人工参与成本,即可一体化地获得效果较好的细分领域新闻分类器。该方案在实践中具有较高的性价比和良好的领域泛化能力。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
陈果
叶潮
关键词 半监督学习主动学习文本分类集成学习    
Abstract

[Objective] This paper proposes a news classification scheme combining semi-supervised learning and active learning, aiming to improve intelligence monitoring based on news mining. [Methods] First, we carried out K-means clustering based on the learning of news text representations, and selected a small number of representative samples from various clusters for manual judgment. These categories were merged and adjusted as sub-field categories. Then, we used the representative samples as the training set for a variety of integrated classification algorithms and train the initial classifier. Finally, we utilized active learning to optimize the initial classifier. [Results] We tested our new model with news on tanks and armored vehicles. After active learning, we received better text classification results. The precision, recall and F1 value reached 83.68%, 83.35% and 83.17%, which were increased by 2.71%, 2.52% and 2.81% respectively. [Limitations] To reduce manually labeling work, we only conducted 2 iterations. [Conclusions] The proposed method can effectively classify news with little corpus annotation and no pre-trained classifier. It could also be used in other fields.

Key wordsSemi-Supervised Learning    Active Learning    Text Classification    Ensemble Learning
收稿日期: 2021-06-01      出版日期: 2022-05-12
ZTFLH:  G350  
基金资助:*教育部人文社会科学研究青年项目(21YJC870003);江苏省社会科学基金青年项目(21TQC002)
通讯作者: 陈果,ORCID:0000-0003-2873-1051     E-mail: dephi1987@qq.com
引用本文:   
陈果, 叶潮. 融合半监督学习与主动学习的细分领域新闻分类研究*[J]. 数据分析与知识发现, 2022, 6(4): 28-38.
Chen Guo, Ye Chao. News Classification with Semi-Supervised and Active Learning. Data Analysis and Knowledge Discovery, 2022, 6(4): 28-38.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2021.0545      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2022/V6/I4/28
Fig.1  融合聚类与主动学习的细分领域新闻分类流程
Fig.2  主动学习流程
主题类别 数目
军事行动与部署 387
武器装备贸易 185
军事演练 317
新型装备技术 311
杂质 687
总计 1 887
Table 1  坦克装甲车领域新闻主题类别统计
基分类器 初始权重
子训练集1 子训练集2 子训练集3
随机森林 0.68 0.70 0.70
SVM 0.89 0.84 0.86
Softmax 0.90 0.87 0.87
Table 2  初始基分类器权重
Fig.3  基于训练集的混淆矩阵分析
分类模型 正确率/% 召回率/% F1值/%
初始分类模型 80.97 80.83 80.36
第一轮主动学习后 83.38 83.00 82.51
第二轮主动学习后 83.68 83.35 83.17
Table 3  基于主动学习的武器装备新闻分类结果
Fig.4  基于测试集的混淆矩阵分析
[1] 丁连红, 孙斌, 张宏伟. 基于知识图谱扩展的短文本分类方法[J]. 情报工程, 2018, 4(5):38-46.
[1] ( Ding Lianhong, Sun Bin, Zhang Hongwei. Short Text Classification Based on Knowledge Graph Extension[J]. Technology Intelligence Engineering, 2018, 4(5):38-46.)
[2] 于游, 付钰, 吴晓平. 一种基于词和事件主题的卷积网络的新闻文本分类方法[J]. 计算机应用与软件, 2021, 38(5):170-174.
[2] ( Yu You, Fu Yu, Wu Xiaoping. News Text Classification Method Based on Convolutional Network of Word-Event Topic[J]. Computer Applications and Software, 2021, 38(5):170-174.)
[3] 胡玉兰, 赵青杉, 陈莉, 等. 面向中文新闻文本分类的融合网络模型[J]. 中文信息学报, 2021, 35(3):107-114.
[3] ( Hu Yulan, Zhao Qingshan, Chen Li, et al. A Fusion Network Model for Chinese News Text Classification[J]. Journal of Chinese Information Processing, 2021, 35(3):107-114.)
[4] 刘月, 翟东海, 任庆宁. 基于注意力CNLSTM模型的新闻文本分类[J]. 计算机工程, 2019, 45(7):303-308.
[4] ( Liu Yue, Zhai Donghai, Ren Qingning. News Text Classification Based on CNLSTM Model with Attention Mechanism[J]. Computer Engineering, 2019, 45(7):303-308.)
[5] 张永奎, 李红娟. 基于类别关键词的突发事件新闻文本分类方法[J]. 计算机应用, 2008, 28(S1):139-140.
[5] ( Zhang Yongkui, Li Hongjuan. Text Classification of Accident News Based on Category Keyword[J]. Journal of Computer Applications, 2008, 28(S1):139-140.)
[6] 杨丽英, 李红娟, 张永奎. 突发事件新闻语料分类体系研究[C]//中文信息处理前沿进展——中国中文信息学会二十五周年学术会议论文集.中国中文信息学会, 2006.
[6] ( Yang Liying, Li Hongjuan, Zhang Yongkui. The Research on Classification System of Accident News Corpus[C]//Proceedings of the 25th Anniversary Academic Conference of the Chinese Information Processing Society of China. Chinese Information Processing Society of China, 2006.)
[7] 夏华林, 张仰森. 基于规则与统计的Web突发事件新闻多层次分类[J]. 计算机应用, 2012, 32(2):392-394.
[7] ( Xia Hualin, Zhang Yangsen. Multiple-Layer Classification of Web Emergency News Based on Rules and Statistics[J]. Journal of Computer Applications, 2012, 32(2):392-394.)
[8] 宋英华, 吕龙, 刘丹. 基于组合深度学习模型的突发事件新闻识别与分类研究[J]. 情报学报, 2021, 40(2):145-151.
[8] ( Song Yinghua, Lyu Long, Liu Dan. Study on Identification and Classification of Emergency News Based on the Combined Deep Learning Model[J]. Journal of the China Society for Scientific and Technical Information, 2021, 40(2):145-151.)
[9] 葛艳, 郑利杰, 杜军威, 等. 基于BLSTM-Attention神经网络模型的化工事故分类[J]. 计算机系统应用, 2020, 29(10):205-210.
[9] ( Ge Yan, Zheng Lijie, Du Junwei, et al. Chemical Accident Classification Based on BLSTM-Attention Neural Network Model[J]. Computer Systems & Applications, 2020, 29(10):205-210.)
[10] 朱芳鹏, 王晓峰. 面向船舶工业新闻的文本分类[J]. 电子测量与仪器学报, 2020, 34(1):149-155.
[10] ( Zhu Fangpeng, Wang Xiaofeng. Text Classification for Ship Industry News[J]. Journal of Electronic Measurement and Instrumentation, 2020, 34(1):149-155.)
[11] 张晓龙, 支龙, 高剑, 等. 一个半监督学习的金融新闻文本分类算法[J/OL]. 大数据. http://kns.cnki.net/kcms/detail/10.1321.G2.20210918.1606.002.html.
[11] ( Zhang Xiaolong, Zhi Long, Gao Jian, et al. A Semi-Supervised Learning Financial News Classification Algorithm[J/OL]. Big Data Research. http://kns.cnki.net/kcms/detail/10.1321.G2.20210918.1606.002.html
[12] 张世同. 基于BERT与BiLSTM的铁路安监文本分类方法[J]. 现代计算机, 2021(22):38-42.
[12] ( Zhang Shitong. BERT and BiLSTM Based Text Classification Method for Railway Safety Supervision System[J]. Modern Computer, 2021(22):38-42.)
[13] 何宇虹, 黄沛杰, 杜泽峰, 等. 结合特殊领域实体识别的远监督话语领域分类[J]. 中文信息学报, 2020, 34(5):10-18.
[13] ( He Yuhong, Huang Peijie, Du Zefeng, et al. Distant Supervision Based Utterance Domain Classification with Domain-Specific NER[J]. Journal of Chinese Information Processing, 2020, 34(5):10-18.)
[14] He Y L, Lin C H. Protein-Protein Interactions Classification from Text via Local Learning with Class Priors[C]//Proceedings of the 14th International Conference on Applications of Natural Language to Information Systems. 2009: 182-191.
[15] Liu M K, Wen M S, Kopru S, et al. Semi-Supervised Learning with Auxiliary Evaluation Component for Large Scale E-Commerce Text Classification[C]//Proceedings of the Workshop on Deep Learning Approaches for Low-Resource NLP. 2018. DOI: 10.18653/v1/W18-3409.
doi: 10.18653/v1/W18-3409
[16] Karlos S, Fazakis N, Kalleris K, et al. An Incremental Self-Trained Ensemble Algorithm[C]//Proceedings of the 2018 IEEE Conference on Evolving and Adaptive Intelligent Systems. IEEE, 2018: 1-8.
[17] 赵洪, 王芳. 理论术语抽取的深度学习模型及自训练算法研究[J]. 情报学报, 2018, 37(9):923-938.
[17] ( Zhao Hong, Wang Fang. A Deep Learning Model and Self-Training Algorithm for Theoretical Terms Extraction[J]. Journal of the China Society for Scientific and Technical Information, 2018, 37(9):923-938.)
[18] Zhu X J, Ghahramani Z. Learning from Labeled and Unlabeled Data with Label Propagation, CMU-CMU-CALD-02-107[R]. Pitts burgher: Carnegie Mellon University, 2002.
[19] 张俊丽, 常艳丽, 师文. 标签传播算法理论及其应用研究综述[J]. 计算机应用研究, 2013, 30(1):21-25.
[19] ( Zhang Junli, Chang Yanli, Shi Wen. Overview on Label Propagation Algorithm and Applications[J]. Application Research of Computers, 2013, 30(1):21-25.)
[20] Rossi R G, de Andrade L A, Rezende S O. Optimization and Label Propagation in Bipartite Heterogeneous Networks to Improve Transductive Classification of Texts[J]. Information Processing & Management, 2016, 52(2):217-257.
doi: 10.1016/j.ipm.2015.07.004
[21] Velikovich L, Blair-Goldensohn S, Hannan K, et al. The Viability of Web-derived Polarity Lexicons[C]//Proceedings of Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. 2010: 777-785.
[22] Pan S J, Yang Q. A Survey on Transfer Learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(10):1345-1359.
doi: 10.1109/TKDE.2009.191
[23] Garg S, Sharma R K, Liang Y Y. SimpleTran: Transferring Pre-Trained Sentence Embeddings for Low Resource Text Classification[OL]. arXiv Preprint, arXiv: 2004.05119.
[24] Cohn D A, Ghahramani Z, Jordan M I. Active Learning with Statistical Models[J]. Journal of Artificial Intelligence Research, 1996, 4:129-145.
doi: 10.1613/jair.295
[25] McCallum A, Nigam K. Employing EM and Pool-Based Active Learning for Text Classification[C]//Proceedings of the 15th International Conference on Machine Learning. 1998: 350-358.
[26] 年素磊, 黎铭, 杜科, 等. 基于主动半监督学习的智能电网信调日志分类[J]. 计算机科学, 2012, 39(12):167-170, 207.
[26] ( Nian Sulei, Li Ming, Du Ke, et al. Classifying Communication Dispatch System Logs of Smart Grid Based on Active Semi-Supervised Learning[J]. Computer Science, 2012, 39(12):167-170, 207.)
[27] 毕秋敏, 李明, 曾志勇. 一种主动学习和协同训练相结合的半监督微博情感分类方法[J]. 现代图书情报技术, 2015(1):38-44.
[27] ( Bi Qiumin, Li Ming, Zeng Zhiyong. Semi-Supervised Micro-Blog Sentiment Classification Method Combining Active Learning and Co-Training[J]. New Technology of Library and Information Service, 2015(1):38-44.)
[28] 陈果, 许天祥. 基于主动学习的科技论文句子功能识别研究[J]. 数据分析与知识发现, 2019, 3(8):53-61.
[28] ( Chen Guo, Xu Tianxiang. Sentence Function Recognition Based on Active Learning[J]. Data Analysis and Knowledge Discovery, 2019, 3(8):53-61.)
[29] Sinha S, Ebrahimi S, Darrell T. Variational Adversarial Active Learning[C]//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019: 5971-5980.
[30] Naseem U, Khushi M, Khan S K, et al. A Comparative Analysis of Active Learning for Biomedical Text Mining[J]. Applied System Innovation, 2021, 4(1):23.
doi: 10.3390/asi4010023
[31] Figueroa R L, Zeng-Treitler Q, Ngo L H, et al. Active Learning for Clinical Text Classification: Is It Better than Random Sampling?[J]. Journal of the American Medical Informatics Association, 2012, 19(5):809-816.
doi: 10.1136/amiajnl-2011-000648 pmid: 22707743
[32] de Angeli K, Gao S, Alawad M, et al. Deep Active Learning for Classifying Cancer Pathology Reports[J]. BMC Bioinformatics, 2021, 22(1):113.
doi: 10.1186/s12859-021-04047-1 pmid: 33750288
[33] Le Q, Mikolov T. Distributed Representations of Sentences and Documents[C]//Proceedings of the 31st International Conference on Machine Learning. 2014.
[34] Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781.
[35] 潘博, 张青川, 于重重, 等. Doc2Vec在薪水预测中的应用研究[J]. 计算机应用研究, 2018, 35(1):155-157.
[35] ( Pan Bo, Zhang Qingchuan, Yu Chongchong, et al. Application of Doc2Vec on Job Salary Prediction[J]. Application Research of Computers, 2018, 35(1):155-157.)
[36] 吴夙慧, 成颖, 郑彦宁, 等. K-Means算法研究综述[J]. 现代图书情报技术, 2011(5):28-35.
[36] ( Wu Suhui, Cheng Ying, Zheng Yanning, et al. Survey on K-Means Algorithm[J]. New Technology of Library and Information Service, 2011(5):28-35.)
[37] Breiman L. Random Forests[J]. Machine Learning, 2001, 45:5-32.
doi: 10.1023/A:1010933404324
[38] Cortes C, Vapnik V. Support-Vector Networks[J]. Machine Learning, 1995, 20(3):273-297.
[39] 邓俊锋, 张晓龙. 基于自动编码器组合的深度学习优化方法[J]. 计算机应用, 2016, 36(3):697-702.
[39] ( Deng Junfeng, Zhang Xiaolong. Deep Learning Algorithm Optimization Based on Combination of Auto-Encoders[J]. Journal of Computer Applications, 2016, 36(3):697-702.)
[40] 李海峰. 京津冀协同发展报纸新闻主题发现及其关联分析[J]. 科学技术与工程, 2021, 21(28):12185-12193.
[40] ( Li Haifeng. Investigating the Topics Discovery and Correlation Analysis of Newspaper Reports on the Integrated Development of Beijing-Tianjin-Hebei Region[J]. Science Technology and Engineering, 2021, 21(28):12185-12193.)
[41] 杨波, 邵婉婷. 面向企业竞争情报的弱信号识别研究[J]. 现代情报, 2021, 41(9):53-63.
[41] ( Yang Bo, Shao Wanting. Research on Weak Signal Recognition Facing Enterprise Competitive Intelligence[J]. Journal of Modern Information, 2021, 41(9):53-63.)
[42] 陈悦, 宋凯, 刘安蓉, 等. 基于机器学习的人工智能技术专利数据集构建新策略[J]. 情报学报, 2021, 40(3):286-296.
[42] ( Chen Yue, Song Kai, Liu Anrong, et al. Artificial Intelligence Technology: Novel Strategy for Patent Dataset Creation Based on Machine Learning[J]. Journal of the China Society for Scientific and Technical Information, 2021, 40(3):286-296.)
[43] 李湘东, 曹环, 黄莉. 文本分类中训练集相关数量指标的影响研究[J]. 计算机应用研究, 2014, 31(11):3324-3327, 3332.
[43] ( Li Xiangdong, Cao Huan, Huang Li. Study about Effect of Relevant Quantitative Indexes of Training Set in Text Classification[J]. Application Research of Computers, 2014, 31(11):3324-3327, 3332.)
[44] 薛春香, 张玉芳. 面向新闻领域的中文文本分类研究综述[J]. 图书情报工作, 2013, 57(14):134-139.
[44] ( Xue Chunxiang, Zhang Yufang. Research Review on Chinese Text Classification in the News Field[J]. Library and Information Service, 2013, 57(14):134-139.)
[45] 陈果, 许天祥. 小规模知识库指导下的细分领域实体关系发现研究[J]. 情报学报, 2019, 38(11):1200-1211.
[45] ( Chen Guo, Xu Tianxiang. Research on the Discovery of Entity Relationships in Subdivided Domains under the Guidance of a Small-scale Knowledge Base[J]. Journal of the China Society for Scientific and Technical Information, 2019, 38(11):1200-1211.)
[46] 庞观松, 蒋盛益. 文本自动分类技术研究综述[J]. 情报理论与实践, 2012, 35(2):123-128.
[46] ( Pang Guansong, Jiang Shengyi. A Summary of Research on Automatic Text Classification Technologies[J]. Information Studies:Theory & Application, 2012, 35(2):123-128.)
[1] 屠振超, 马静. 基于改进文本表示的商品文本分类算法研究*[J]. 数据分析与知识发现, 2022, 6(5): 34-43.
[2] 肖悦珺, 李红莲, 张乐, 吕学强, 游新冬. 特征融合的中文专利文本分类方法研究*[J]. 数据分析与知识发现, 2022, 6(4): 49-59.
[3] 杨林, 黄晓硕, 王嘉阳, 丁玲玲, 李子孝, 李姣. 基于BERT-TextCNN的临床试验疾病亚型识别研究*[J]. 数据分析与知识发现, 2022, 6(4): 69-81.
[4] 徐月梅, 樊祖薇, 曹晗. 基于标签嵌入注意力机制的多任务文本分类模型*[J]. 数据分析与知识发现, 2022, 6(2/3): 105-116.
[5] 王楠, 李海荣, 谭舒孺. 基于舆情事件演化分析及改进KE-SMOTE算法的舆情反转预测研究*[J]. 数据分析与知识发现, 2022, 6(2/3): 396-408.
[6] 谢星雨, 余本功. 基于MFFMB的电商评论文本分类研究*[J]. 数据分析与知识发现, 2022, 6(1): 101-112.
[7] 陈杰,马静,李晓峰. 融合预训练模型文本特征的短文本分类方法*[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[8] 车宏鑫,王桐,王伟. 前列腺癌预测模型对比研究*[J]. 数据分析与知识发现, 2021, 5(9): 107-114.
[9] 周泽聿,王昊,赵梓博,李跃艳,张小琴. 融合关联信息的GCN文本分类模型构建及其应用研究*[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[10] 徐良辰, 郭崇慧. 基于集成学习的胃癌生存预测模型研究*[J]. 数据分析与知识发现, 2021, 5(8): 86-99.
[11] 余本功,朱晓洁,张子薇. 基于多层次特征提取的胶囊网络文本分类研究*[J]. 数据分析与知识发现, 2021, 5(6): 93-102.
[12] 刘彤,刘琛,倪维健. 多层次数据增强的半监督中文情感分析方法*[J]. 数据分析与知识发现, 2021, 5(5): 51-58.
[13] 王楠,李海荣,谭舒孺. 基于改进SMOTE算法与集成学习的舆情反转预测研究*[J]. 数据分析与知识发现, 2021, 5(4): 37-48.
[14] 邱云飞, 郭蕾. 面向非均衡数据的糖尿病并发症预测[J]. 数据分析与知识发现, 2021, 5(2): 116-128.
[15] 周志超. 基于机器学习技术的自动引文分类研究综述*[J]. 数据分析与知识发现, 2021, 5(12): 14-24.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn