Please wait a minute...
Advanced Search
数据分析与知识发现  2020, Vol. 4 Issue (10): 58-69     https://doi.org/10.11925/infotech.2096-3467.2020.0219
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于DW-TCI的半监督文本分类方法研究*
余本功1,2,汲浩敏1()
1合肥工业大学管理学院 合肥 230009
2合肥工业大学过程优化与智能决策教育部重点实验室 合肥 230009
Semi-Supervised Method for Text Classification Based on DW-TCI
Yu Bengong1,2,Ji Haomin1()
1School of Management, Hefei University of Technology, Hefei 230009, China
2Key Laboratory of Process Optimization & Intelligent Decision-Making, Ministry of Education, Hefei University of Technology, Hefei 230009, China
全文: PDF (1317 KB)   HTML ( 12
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 对只有少量标注的文本进行高效率的分类,提出一种新的半监督文本分类方法。【方法】 提出DW-TCI半监督文本分类方法,通过使用双通道的特征提取方式得到基分类器组的两组特征输入向量,并引入基于分歧的半监督分类方法和集成学习的思想,将无监督共识结果样本引入模型训练,最后通过等值加权投票法得到预测文本的分类结果。【结果】 在两个不同的数据集下,DW-TCI方法使用20%有标签样本训练时,分类精度分别达到92.32%和87.01%,对比其他半监督分类方法最少分别提升5.54%和5.65%。【局限】 使用的数据集数量较少,未在更多的数据集上进行验证。【结论】 DW-TCI方法可以大幅减少对训练样本的标注,为服务商进行高效的文本分类提供了有效支持。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
余本功
汲浩敏
关键词 半监督分类样本分歧分类器分歧集成学习    
Abstract

[Objective] This paper proposes a new semi-supervised method for text classification, aiming to efficiently process texts with only small amount of annotations.[Methods] The proposed DW-TCI based method used double-channel feature extraction to obtain two sets of feature input vectors of the base classifier group. Then, we introduced the semi-supervised classification method with divergence and the idea of integrated learning. Finally, we trained the non-supervised sample with our model, and obtained the classification result of the predicted text with the equivalent weighted voting method.[Results] We examined our method with two different data sets having 20% labeled samples. The classification accuracy reached 92.32% and 87.01%, which were at least 5.54% and 5.65% higher than those of similar methods.[Limitations] The sample data set needs to be expanded.[Conclusions] The proposed method could reduce the labeling workloads of training samples and provide effective support for better text classification results.

Key wordsSemi-Supervised Classification    Sample Divergence    Classifier Divergence    Ensemble Learning
收稿日期: 2020-03-19      出版日期: 2020-07-28
ZTFLH:  TP391  
基金资助:*本文系国家自然科学基金项目“基于制造大数据的产品研发知识集成与服务机制研究”(71671057);过程优化与智能决策教育部重点实验室开放课题的研究成果之一
通讯作者: 汲浩敏     E-mail: 851405185@qq.com
引用本文:   
余本功,汲浩敏. 基于DW-TCI的半监督文本分类方法研究*[J]. 数据分析与知识发现, 2020, 4(10): 58-69.
Yu Bengong,Ji Haomin. Semi-Supervised Method for Text Classification Based on DW-TCI. Data Analysis and Knowledge Discovery, 2020, 4(10): 58-69.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2020.0219      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2020/V4/I10/58
Fig.1  DW-TCI模型结构
Fig.2  CBOW与Skip-gram结构
Fig.3  基分类器组结构
Fig.4  DW-TCI模型分类流程
数据项 汽车评论 搜狗新闻
来源 汽车之家 搜狗实验室开源数据集
类别数(个) 2 5
数量(条) 8334/9195 2000/2000/2000/2000/2000
平均长度(字符) 45 843
最短长度(字符) 3 30
最长长度(字符) 1 519 19 870
Table 1  数据集信息
Fig.5  分类效果评价对比
Fig.6  不同编码方式分类精度比较
Fig.7  各半监督文本分类模型效果比较
[1] Li M, Zhou Z H. Improve Computer-Aided Diagnosis with Machine Learning Techniques Using Undiagnosed Samples[J]. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 2007,37(6):1088-1098.
doi: 10.1109/TSMCA.2007.904745
[2] Chapelle O, Schölkopf B, Zien A. Semi-supervised Learning[A]//Adaptive Computation and Machine Learning[M]. MIT Press, 2006.
[3] 邱云飞, 刘聪. 基于协同训练的意图分类优化方法[J]. 现代情报, 2019,39(5):57-63, 73.
[3] ( Qiu Yunfei, Liu Cong. Intention Classification Optimization Method Based on Collaborative Training[J]. Journal of Modern Information, 2019,39(5):57-63, 73.)
[4] 徐勇, 张慧. 图像自动标注方法研究综述[J]. 现代情报, 2016,36(3):144-150.
[4] ( Xu Yong, Zhang Hui. Summary of Automatic Image Annotation Method[J]. Journal of Modern Information, 2016,36(3):144-150.)
[5] Wang G, Sun J S, Ma J, et al. Sentiment Classification: The Contribution of Ensemble Learning[J]. Decision Support Systems, 2014,57(1):77-93.
[6] 胡学钢, 马利伟, 李培培. 一种基于Tri-training的数据流集成分类算法[J]. 数据采集与处理, 2017,32(5):853-860.
[6] ( Hu Xuegang, Ma Liwei, Li Peipei. Data Stream Ensemble Classification Algorithm Based on Tri-training[J]. Journal of Data Acquisition and Processing, 2017,32(5):853-860.)
[7] 刘建伟, 刘媛, 罗雄麟. 半监督学习方法[J]. 计算机学报, 2015,38(8):1592-1617.
[7] ( Liu Jianwei, Liu Yuan, Luo Xionglin. Semi-Supervised Learning Methods[J]. Chinese Journal of Computers, 2015,38(8):1592-1617.)
[8] Zhou Z H, Li M. Semi-supervised Learning by Disagreement[J]. Knowledge and Information Systems, 2010,24(3):415-439.
doi: 10.1007/s10115-009-0209-z
[9] 周志华. 基于分歧的半监督学习[J]. 自动化学报, 2013,39(11):1871-1878.
[9] ( Zhou Zhihua. Disagreement-based Semi-supervised Learning[J]. Acta Automatica Sinica, 2013,39(11):1871-1878.)
[10] Blum A, Mitchell T. Combining Labeled and Unlabeled Data with Co-Training[C]// Proceedings of the 11th Annual Conference on Computational Learning Theory. 1998: 92-100.
[11] Zhou Z H, Li M. Tri-training: Exploiting Unlabeled Data Using Three Classifiers[J]. IEEE Transactions on Knowledge and Data Engineering, 2005,17(11):1529-1541.
doi: 10.1109/TKDE.2005.186
[12] 周志华. 机器学习[M]. 北京: 清华大学出版社, 2016.
[12] ( Zhou Zhihua. Machine Learning[M]. Beijing: Tsinghua University Press, 2016.)
[13] 吴明胜, 邓晓刚. 基于Tri-DE-ELM的半监督模式分类方法研究[J]. 计算机工程与应用, 2018,54(3):109-114.
[13] ( Wu Mingsheng, Deng Xiaogang. Semi Supervised Pattern Classification Method Based on Tri-DE-ELM[J]. Computer Engineering and Applications, 2018,54(3):109-114.)
[14] Huang G B, Zhu Q Y, Siew C K, et al. Extreme Learning Machine: A New Learning Scheme of Feedforward Neural Networks[C]//Proceedings of 2004 IEEE International Joint Conference on Neural Networks. 2004: 985-990.
[15] 王刚, 李宁宁, 杨善林. 基于IDSSL的文本情感分析研究[J]. 管理工程学报, 2018,32(3):126-133.
[15] ( Wang Gang, Li Ningning, Yang Shanlin. Study of Text Sentiment Analysis Based on IDSSL[J]. Journal of Industrial Engineering and Engineering Management, 2018,32(3):126-133.)
[16] 徐海龙, 龙光正, 别晓峰, 等. 结合Tri-training半监督学习和凸壳向量的SVM主动学习算法[J]. 模式识别与人工智能, 2016,29(1):39-46.
[16] ( Xu Hailong, Long Guangzheng, Bie Xiaofeng, et al. Active Learning Algorithm of SVM Combining Tri-training Semi-supervised Learning and Convex-hull Vector[J]. Pattern Recognition and Artificial Intelligence, 2016,29(1):39-46.)
[17] Yarowsky D. Unsupervised Word Sense Disambiguation Rivaling Supervised Methods[C]//Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics. 1995: 189-196.
[18] Lund K, Burgess C. Producing High-Dimensional Semantic Spaces from Lexical Co-Occurrence[J]. Behavior Research Methods Instruments & Computers, 1996,28(2):203-208.
[19] Bengio Y, Ducharme R, Vincent P, et al. A Neural Probabilistic Language Model[J]. Journal of Machine Learning Research, 2003,3(6):1137-1155.
[20] Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781, 2013.
[21] Ling W, Dyer C, Black A W, et al. Two/Too Simple Adaptations of Word2Vec for Syntax Problems[C]//Proceedings of 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2015: 1299-1304.
[22] 祝云凯. 基于统计特征的语义搜索引擎的研究与实现[D]. 北京: 北京邮电大学, 2015.
[22] ( Zhu Yunkai. Research and Implementation of Semantic Search Engine Based on Statistical Characteristics[D]. Beijing: Beijing University of Posts and Telecommunications, 2015.)
[23] 余本功, 陈杨楠, 杨颖. 基于nBD-SVM模型的投诉短文本分类[J]. 数据分析与知识发现, 2019,3(5):77-85.
[23] ( Yu Bengong, Chen Yangnan, Yang Ying. Classifying Short Text Complaints with nBD-SVM Model[J]. Data Analysis and Knowledge Discovery, 2019,3(5):77-85.)
[24] Breiman L. Bagging Predictors[J]. Machine Learning, 1996,24(2):123-140.
[25] 李诒靖, 郭海湘, 李亚楠, 等. 一种基于Boosting的集成学习算法在不均衡数据中的分类[J]. 系统工程理论与实践, 2016,36(1):189-199.
[25] ( Li Yijing, Guo Haixiang, Li Ya’nan, et al. A Boosting Based Ensemble Learning Algorithm in Imbalanced Data Classification[J]. Systems Engineering-Theory & Practice, 2016,36(1):189-199.)
[26] 胡云青. 专利知识获取及其推送方法研究[D]. 杭州: 浙江大学, 2019.
[26] ( Hu Yunqing. Research on Method of Patent Knowledge Acquisition and Its Pushing[D]. Hangzhou: Zhejiang University, 2019.)
[27] Goldberg Y, Levy O, et al. Word2Vec Explained: Deriving Mikolov’s Negative-Sampling Word-Embedding Method[OL]. arXiv Preprint, arXiv: 1402.3722, 2014.
[28] Cortes C, Vapnik V. Support-Vector Networks[J]. Machine Learning, 1995,20(3):273-297.
[29] Breiman L. Random Forests[J]. Machine Learning, 2001,45(1):5-32.
doi: 10.1023/A:1010933404324
[30] Breiman L, Friedman J, Olshen R, et al. Classification and Regression Trees[M]. Chapman & Hall/CRC, New York, 1984.
[31] 李航. 统计学习方法[M]. 北京: 清华大学出版社, 2012.
[31] ( Li Hang. Statistical Learning Method[M]. Beijing: Tsinghua University Press, 2012.)
[32] jieba[OL]. https://github.com/fxsjy/jieba.
[33] SogouCS.reduced [DB/OL].https://www.sogou.com/labs/.
[1] 车宏鑫,王桐,王伟. 前列腺癌预测模型对比研究*[J]. 数据分析与知识发现, 2021, 5(9): 107-114.
[2] 徐良辰, 郭崇慧. 基于集成学习的胃癌生存预测模型研究*[J]. 数据分析与知识发现, 2021, 5(8): 86-99.
[3] 王楠,李海荣,谭舒孺. 基于改进SMOTE算法与集成学习的舆情反转预测研究*[J]. 数据分析与知识发现, 2021, 5(4): 37-48.
[4] 邱云飞, 郭蕾. 面向非均衡数据的糖尿病并发症预测[J]. 数据分析与知识发现, 2021, 5(2): 116-128.
[5] 余本功,曹雨蒙,陈杨楠,杨颖. 基于nLD-SVM-RF的短文本分类研究*[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[6] 余本功,陈杨楠,杨颖. 基于nBD-SVM模型的投诉短文本分类*[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[7] 肖连杰,郜梦蕊,苏新宁. 一种基于模糊C-均值聚类的欠采样集成不平衡数据分类算法*[J]. 数据分析与知识发现, 2019, 3(4): 90-96.
[8] 桂思思,陆伟,张晓娟. 基于查询表达式特征的时态意图识别研究*[J]. 数据分析与知识发现, 2019, 3(3): 66-75.
[9] 操玮, 李灿, 贺婷婷, 朱卫东. 基于集成学习的中国P2P网络借贷信用风险预警模型的对比研究*[J]. 数据分析与知识发现, 2018, 2(10): 65-76.
[10] 王华秋, 王斌, 聂珍. 一种应用多储备池回声状态网络的图像语义映射研究[J]. 现代图书情报技术, 2015, 31(6): 41-48.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn