Please wait a minute...
Data Analysis and Knowledge Discovery  2020, Vol. 4 Issue (10): 58-69    DOI: 10.11925/infotech.2096-3467.2020.0219
Current Issue | Archive | Adv Search |
Semi-Supervised Method for Text Classification Based on DW-TCI
Yu Bengong1,2,Ji Haomin1()
1School of Management, Hefei University of Technology, Hefei 230009, China
2Key Laboratory of Process Optimization & Intelligent Decision-Making, Ministry of Education, Hefei University of Technology, Hefei 230009, China
Download: PDF (1317 KB)   HTML ( 4
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a new semi-supervised method for text classification, aiming to efficiently process texts with only small amount of annotations.[Methods] The proposed DW-TCI based method used double-channel feature extraction to obtain two sets of feature input vectors of the base classifier group. Then, we introduced the semi-supervised classification method with divergence and the idea of integrated learning. Finally, we trained the non-supervised sample with our model, and obtained the classification result of the predicted text with the equivalent weighted voting method.[Results] We examined our method with two different data sets having 20% labeled samples. The classification accuracy reached 92.32% and 87.01%, which were at least 5.54% and 5.65% higher than those of similar methods.[Limitations] The sample data set needs to be expanded.[Conclusions] The proposed method could reduce the labeling workloads of training samples and provide effective support for better text classification results.

Key wordsSemi-Supervised Classification      Sample Divergence      Classifier Divergence      Ensemble Learning     
Received: 19 March 2020      Published: 28 July 2020
ZTFLH:  TP391  
Corresponding Authors: Ji Haomin     E-mail: 851405185@qq.com

Cite this article:

Yu Bengong,Ji Haomin. Semi-Supervised Method for Text Classification Based on DW-TCI. Data Analysis and Knowledge Discovery, 2020, 4(10): 58-69.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2020.0219     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2020/V4/I10/58

DW-TCI Model Structure
CBOW and Skip-gram Structure
Structure of the Base Classifier Group
Classification Flowchart of DW-TCI Model
数据项 汽车评论 搜狗新闻
来源 汽车之家 搜狗实验室开源数据集
类别数(个) 2 5
数量(条) 8334/9195 2000/2000/2000/2000/2000
平均长度(字符) 45 843
最短长度(字符) 3 30
最长长度(字符) 1 519 19 870
Data Set
Classification Effect Evaluation
The Classification Accuracy of Different Encoding Methods
The Effects of Each Semi-Supervised Text Classification Model
[1] Li M, Zhou Z H. Improve Computer-Aided Diagnosis with Machine Learning Techniques Using Undiagnosed Samples[J]. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 2007,37(6):1088-1098.
doi: 10.1109/TSMCA.2007.904745
[2] Chapelle O, Schölkopf B, Zien A. Semi-supervised Learning[A]//Adaptive Computation and Machine Learning[M]. MIT Press, 2006.
[3] 邱云飞, 刘聪. 基于协同训练的意图分类优化方法[J]. 现代情报, 2019,39(5):57-63, 73.
[3] ( Qiu Yunfei, Liu Cong. Intention Classification Optimization Method Based on Collaborative Training[J]. Journal of Modern Information, 2019,39(5):57-63, 73.)
[4] 徐勇, 张慧. 图像自动标注方法研究综述[J]. 现代情报, 2016,36(3):144-150.
[4] ( Xu Yong, Zhang Hui. Summary of Automatic Image Annotation Method[J]. Journal of Modern Information, 2016,36(3):144-150.)
[5] Wang G, Sun J S, Ma J, et al. Sentiment Classification: The Contribution of Ensemble Learning[J]. Decision Support Systems, 2014,57(1):77-93.
[6] 胡学钢, 马利伟, 李培培. 一种基于Tri-training的数据流集成分类算法[J]. 数据采集与处理, 2017,32(5):853-860.
[6] ( Hu Xuegang, Ma Liwei, Li Peipei. Data Stream Ensemble Classification Algorithm Based on Tri-training[J]. Journal of Data Acquisition and Processing, 2017,32(5):853-860.)
[7] 刘建伟, 刘媛, 罗雄麟. 半监督学习方法[J]. 计算机学报, 2015,38(8):1592-1617.
[7] ( Liu Jianwei, Liu Yuan, Luo Xionglin. Semi-Supervised Learning Methods[J]. Chinese Journal of Computers, 2015,38(8):1592-1617.)
[8] Zhou Z H, Li M. Semi-supervised Learning by Disagreement[J]. Knowledge and Information Systems, 2010,24(3):415-439.
doi: 10.1007/s10115-009-0209-z
[9] 周志华. 基于分歧的半监督学习[J]. 自动化学报, 2013,39(11):1871-1878.
[9] ( Zhou Zhihua. Disagreement-based Semi-supervised Learning[J]. Acta Automatica Sinica, 2013,39(11):1871-1878.)
[10] Blum A, Mitchell T. Combining Labeled and Unlabeled Data with Co-Training[C]// Proceedings of the 11th Annual Conference on Computational Learning Theory. 1998: 92-100.
[11] Zhou Z H, Li M. Tri-training: Exploiting Unlabeled Data Using Three Classifiers[J]. IEEE Transactions on Knowledge and Data Engineering, 2005,17(11):1529-1541.
doi: 10.1109/TKDE.2005.186
[12] 周志华. 机器学习[M]. 北京: 清华大学出版社, 2016.
[12] ( Zhou Zhihua. Machine Learning[M]. Beijing: Tsinghua University Press, 2016.)
[13] 吴明胜, 邓晓刚. 基于Tri-DE-ELM的半监督模式分类方法研究[J]. 计算机工程与应用, 2018,54(3):109-114.
[13] ( Wu Mingsheng, Deng Xiaogang. Semi Supervised Pattern Classification Method Based on Tri-DE-ELM[J]. Computer Engineering and Applications, 2018,54(3):109-114.)
[14] Huang G B, Zhu Q Y, Siew C K, et al. Extreme Learning Machine: A New Learning Scheme of Feedforward Neural Networks[C]//Proceedings of 2004 IEEE International Joint Conference on Neural Networks. 2004: 985-990.
[15] 王刚, 李宁宁, 杨善林. 基于IDSSL的文本情感分析研究[J]. 管理工程学报, 2018,32(3):126-133.
[15] ( Wang Gang, Li Ningning, Yang Shanlin. Study of Text Sentiment Analysis Based on IDSSL[J]. Journal of Industrial Engineering and Engineering Management, 2018,32(3):126-133.)
[16] 徐海龙, 龙光正, 别晓峰, 等. 结合Tri-training半监督学习和凸壳向量的SVM主动学习算法[J]. 模式识别与人工智能, 2016,29(1):39-46.
[16] ( Xu Hailong, Long Guangzheng, Bie Xiaofeng, et al. Active Learning Algorithm of SVM Combining Tri-training Semi-supervised Learning and Convex-hull Vector[J]. Pattern Recognition and Artificial Intelligence, 2016,29(1):39-46.)
[17] Yarowsky D. Unsupervised Word Sense Disambiguation Rivaling Supervised Methods[C]//Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics. 1995: 189-196.
[18] Lund K, Burgess C. Producing High-Dimensional Semantic Spaces from Lexical Co-Occurrence[J]. Behavior Research Methods Instruments & Computers, 1996,28(2):203-208.
[19] Bengio Y, Ducharme R, Vincent P, et al. A Neural Probabilistic Language Model[J]. Journal of Machine Learning Research, 2003,3(6):1137-1155.
[20] Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781, 2013.
[21] Ling W, Dyer C, Black A W, et al. Two/Too Simple Adaptations of Word2Vec for Syntax Problems[C]//Proceedings of 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2015: 1299-1304.
[22] 祝云凯. 基于统计特征的语义搜索引擎的研究与实现[D]. 北京: 北京邮电大学, 2015.
[22] ( Zhu Yunkai. Research and Implementation of Semantic Search Engine Based on Statistical Characteristics[D]. Beijing: Beijing University of Posts and Telecommunications, 2015.)
[23] 余本功, 陈杨楠, 杨颖. 基于nBD-SVM模型的投诉短文本分类[J]. 数据分析与知识发现, 2019,3(5):77-85.
[23] ( Yu Bengong, Chen Yangnan, Yang Ying. Classifying Short Text Complaints with nBD-SVM Model[J]. Data Analysis and Knowledge Discovery, 2019,3(5):77-85.)
[24] Breiman L. Bagging Predictors[J]. Machine Learning, 1996,24(2):123-140.
[25] 李诒靖, 郭海湘, 李亚楠, 等. 一种基于Boosting的集成学习算法在不均衡数据中的分类[J]. 系统工程理论与实践, 2016,36(1):189-199.
[25] ( Li Yijing, Guo Haixiang, Li Ya’nan, et al. A Boosting Based Ensemble Learning Algorithm in Imbalanced Data Classification[J]. Systems Engineering-Theory & Practice, 2016,36(1):189-199.)
[26] 胡云青. 专利知识获取及其推送方法研究[D]. 杭州: 浙江大学, 2019.
[26] ( Hu Yunqing. Research on Method of Patent Knowledge Acquisition and Its Pushing[D]. Hangzhou: Zhejiang University, 2019.)
[27] Goldberg Y, Levy O, et al. Word2Vec Explained: Deriving Mikolov’s Negative-Sampling Word-Embedding Method[OL]. arXiv Preprint, arXiv: 1402.3722, 2014.
[28] Cortes C, Vapnik V. Support-Vector Networks[J]. Machine Learning, 1995,20(3):273-297.
[29] Breiman L. Random Forests[J]. Machine Learning, 2001,45(1):5-32.
doi: 10.1023/A:1010933404324
[30] Breiman L, Friedman J, Olshen R, et al. Classification and Regression Trees[M]. Chapman & Hall/CRC, New York, 1984.
[31] 李航. 统计学习方法[M]. 北京: 清华大学出版社, 2012.
[31] ( Li Hang. Statistical Learning Method[M]. Beijing: Tsinghua University Press, 2012.)
[32] jieba[OL]. https://github.com/fxsjy/jieba.
[33] SogouCS.reduced [DB/OL].https://www.sogou.com/labs/.
[1] Bengong Yu,Yumeng Cao,Yangnan Chen,Ying Yang. Classification of Short Texts Based on nLD-SVM-RF Model[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[2] Bengong Yu,Yangnan Chen,Ying Yang. Classifying Short Text Complaints with nBD-SVM Model[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[3] Lianjie Xiao,Mengrui Gao,Xinning Su. An Under-sampling Ensemble Classification Algorithm Based on Fuzzy C-Means Clustering for Imbalanced Data[J]. 数据分析与知识发现, 2019, 3(4): 90-96.
[4] Sisi Gui,Wei Lu,Xiaojuan Zhang. Temporal Intent Classification with Query Expression Feature[J]. 数据分析与知识发现, 2019, 3(3): 66-75.
[5] Cao Wei,Li Can,He Tingting,Zhu Weidong. Predicting Credit Risks of P2P Loans in China Based on Ensemble Learning Methods[J]. 数据分析与知识发现, 2018, 2(10): 65-76.
[6] Wang Huaqiu, Wang Bin, Nie Zhen. Research on Image Semantic Mapping with Multiple-Reservoirs Echo State Network[J]. 现代图书情报技术, 2015, 31(6): 41-48.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn