Semi-Supervised Method for Text Classification Based on DW-TCI
Yu Bengong1,2,Ji Haomin1()
1School of Management, Hefei University of Technology, Hefei 230009, China 2Key Laboratory of Process Optimization & Intelligent Decision-Making, Ministry of Education, Hefei University of Technology, Hefei 230009, China
[Objective] This paper proposes a new semi-supervised method for text classification, aiming to efficiently process texts with only small amount of annotations.[Methods] The proposed DW-TCI based method used double-channel feature extraction to obtain two sets of feature input vectors of the base classifier group. Then, we introduced the semi-supervised classification method with divergence and the idea of integrated learning. Finally, we trained the non-supervised sample with our model, and obtained the classification result of the predicted text with the equivalent weighted voting method.[Results] We examined our method with two different data sets having 20% labeled samples. The classification accuracy reached 92.32% and 87.01%, which were at least 5.54% and 5.65% higher than those of similar methods.[Limitations] The sample data set needs to be expanded.[Conclusions] The proposed method could reduce the labeling workloads of training samples and provide effective support for better text classification results.
余本功,汲浩敏. 基于DW-TCI的半监督文本分类方法研究*[J]. 数据分析与知识发现, 2020, 4(10): 58-69.
Yu Bengong,Ji Haomin. Semi-Supervised Method for Text Classification Based on DW-TCI. Data Analysis and Knowledge Discovery, 2020, 4(10): 58-69.
Li M, Zhou Z H. Improve Computer-Aided Diagnosis with Machine Learning Techniques Using Undiagnosed Samples[J]. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 2007,37(6):1088-1098.
doi: 10.1109/TSMCA.2007.904745
[2]
Chapelle O, Schölkopf B, Zien A. Semi-supervised Learning[A]//Adaptive Computation and Machine Learning[M]. MIT Press, 2006.
( Qiu Yunfei, Liu Cong. Intention Classification Optimization Method Based on Collaborative Training[J]. Journal of Modern Information, 2019,39(5):57-63, 73.)
( Hu Xuegang, Ma Liwei, Li Peipei. Data Stream Ensemble Classification Algorithm Based on Tri-training[J]. Journal of Data Acquisition and Processing, 2017,32(5):853-860.)
Blum A, Mitchell T. Combining Labeled and Unlabeled Data with Co-Training[C]// Proceedings of the 11th Annual Conference on Computational Learning Theory. 1998: 92-100.
[11]
Zhou Z H, Li M. Tri-training: Exploiting Unlabeled Data Using Three Classifiers[J]. IEEE Transactions on Knowledge and Data Engineering, 2005,17(11):1529-1541.
doi: 10.1109/TKDE.2005.186
[12]
周志华. 机器学习[M]. 北京: 清华大学出版社, 2016.
[12]
( Zhou Zhihua. Machine Learning[M]. Beijing: Tsinghua University Press, 2016.)
( Wu Mingsheng, Deng Xiaogang. Semi Supervised Pattern Classification Method Based on Tri-DE-ELM[J]. Computer Engineering and Applications, 2018,54(3):109-114.)
[14]
Huang G B, Zhu Q Y, Siew C K, et al. Extreme Learning Machine: A New Learning Scheme of Feedforward Neural Networks[C]//Proceedings of 2004 IEEE International Joint Conference on Neural Networks. 2004: 985-990.
( Wang Gang, Li Ningning, Yang Shanlin. Study of Text Sentiment Analysis Based on IDSSL[J]. Journal of Industrial Engineering and Engineering Management, 2018,32(3):126-133.)
( Xu Hailong, Long Guangzheng, Bie Xiaofeng, et al. Active Learning Algorithm of SVM Combining Tri-training Semi-supervised Learning and Convex-hull Vector[J]. Pattern Recognition and Artificial Intelligence, 2016,29(1):39-46.)
[17]
Yarowsky D. Unsupervised Word Sense Disambiguation Rivaling Supervised Methods[C]//Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics. 1995: 189-196.
[18]
Lund K, Burgess C. Producing High-Dimensional Semantic Spaces from Lexical Co-Occurrence[J]. Behavior Research Methods Instruments & Computers, 1996,28(2):203-208.
[19]
Bengio Y, Ducharme R, Vincent P, et al. A Neural Probabilistic Language Model[J]. Journal of Machine Learning Research, 2003,3(6):1137-1155.
[20]
Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781, 2013.
[21]
Ling W, Dyer C, Black A W, et al. Two/Too Simple Adaptations of Word2Vec for Syntax Problems[C]//Proceedings of 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2015: 1299-1304.
[22]
祝云凯. 基于统计特征的语义搜索引擎的研究与实现[D]. 北京: 北京邮电大学, 2015.
[22]
( Zhu Yunkai. Research and Implementation of Semantic Search Engine Based on Statistical Characteristics[D]. Beijing: Beijing University of Posts and Telecommunications, 2015.)
( Yu Bengong, Chen Yangnan, Yang Ying. Classifying Short Text Complaints with nBD-SVM Model[J]. Data Analysis and Knowledge Discovery, 2019,3(5):77-85.)
[24]
Breiman L. Bagging Predictors[J]. Machine Learning, 1996,24(2):123-140.
( Li Yijing, Guo Haixiang, Li Ya’nan, et al. A Boosting Based Ensemble Learning Algorithm in Imbalanced Data Classification[J]. Systems Engineering-Theory & Practice, 2016,36(1):189-199.)
[26]
胡云青. 专利知识获取及其推送方法研究[D]. 杭州: 浙江大学, 2019.
[26]
( Hu Yunqing. Research on Method of Patent Knowledge Acquisition and Its Pushing[D]. Hangzhou: Zhejiang University, 2019.)
[27]
Goldberg Y, Levy O, et al. Word2Vec Explained: Deriving Mikolov’s Negative-Sampling Word-Embedding Method[OL]. arXiv Preprint, arXiv: 1402.3722, 2014.
[28]
Cortes C, Vapnik V. Support-Vector Networks[J]. Machine Learning, 1995,20(3):273-297.
[29]
Breiman L. Random Forests[J]. Machine Learning, 2001,45(1):5-32.
doi: 10.1023/A:1010933404324
[30]
Breiman L, Friedman J, Olshen R, et al. Classification and Regression Trees[M]. Chapman & Hall/CRC, New York, 1984.
[31]
李航. 统计学习方法[M]. 北京: 清华大学出版社, 2012.
[31]
( Li Hang. Statistical Learning Method[M]. Beijing: Tsinghua University Press, 2012.)