|
|
Semi-Supervised Method for Text Classification Based on DW-TCI |
Yu Bengong1,2,Ji Haomin1() |
1School of Management, Hefei University of Technology, Hefei 230009, China 2Key Laboratory of Process Optimization & Intelligent Decision-Making, Ministry of Education, Hefei University of Technology, Hefei 230009, China |
|
|
Abstract [Objective] This paper proposes a new semi-supervised method for text classification, aiming to efficiently process texts with only small amount of annotations.[Methods] The proposed DW-TCI based method used double-channel feature extraction to obtain two sets of feature input vectors of the base classifier group. Then, we introduced the semi-supervised classification method with divergence and the idea of integrated learning. Finally, we trained the non-supervised sample with our model, and obtained the classification result of the predicted text with the equivalent weighted voting method.[Results] We examined our method with two different data sets having 20% labeled samples. The classification accuracy reached 92.32% and 87.01%, which were at least 5.54% and 5.65% higher than those of similar methods.[Limitations] The sample data set needs to be expanded.[Conclusions] The proposed method could reduce the labeling workloads of training samples and provide effective support for better text classification results.
|
Received: 19 March 2020
Published: 28 July 2020
|
|
Corresponding Authors:
Ji Haomin
E-mail: 851405185@qq.com
|
[1] |
Li M, Zhou Z H. Improve Computer-Aided Diagnosis with Machine Learning Techniques Using Undiagnosed Samples[J]. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 2007,37(6):1088-1098.
doi: 10.1109/TSMCA.2007.904745
|
[2] |
Chapelle O, Schölkopf B, Zien A. Semi-supervised Learning[A]//Adaptive Computation and Machine Learning[M]. MIT Press, 2006.
|
[3] |
邱云飞, 刘聪. 基于协同训练的意图分类优化方法[J]. 现代情报, 2019,39(5):57-63, 73.
|
[3] |
( Qiu Yunfei, Liu Cong. Intention Classification Optimization Method Based on Collaborative Training[J]. Journal of Modern Information, 2019,39(5):57-63, 73.)
|
[4] |
徐勇, 张慧. 图像自动标注方法研究综述[J]. 现代情报, 2016,36(3):144-150.
|
[4] |
( Xu Yong, Zhang Hui. Summary of Automatic Image Annotation Method[J]. Journal of Modern Information, 2016,36(3):144-150.)
|
[5] |
Wang G, Sun J S, Ma J, et al. Sentiment Classification: The Contribution of Ensemble Learning[J]. Decision Support Systems, 2014,57(1):77-93.
|
[6] |
胡学钢, 马利伟, 李培培. 一种基于Tri-training的数据流集成分类算法[J]. 数据采集与处理, 2017,32(5):853-860.
|
[6] |
( Hu Xuegang, Ma Liwei, Li Peipei. Data Stream Ensemble Classification Algorithm Based on Tri-training[J]. Journal of Data Acquisition and Processing, 2017,32(5):853-860.)
|
[7] |
刘建伟, 刘媛, 罗雄麟. 半监督学习方法[J]. 计算机学报, 2015,38(8):1592-1617.
|
[7] |
( Liu Jianwei, Liu Yuan, Luo Xionglin. Semi-Supervised Learning Methods[J]. Chinese Journal of Computers, 2015,38(8):1592-1617.)
|
[8] |
Zhou Z H, Li M. Semi-supervised Learning by Disagreement[J]. Knowledge and Information Systems, 2010,24(3):415-439.
doi: 10.1007/s10115-009-0209-z
|
[9] |
周志华. 基于分歧的半监督学习[J]. 自动化学报, 2013,39(11):1871-1878.
|
[9] |
( Zhou Zhihua. Disagreement-based Semi-supervised Learning[J]. Acta Automatica Sinica, 2013,39(11):1871-1878.)
|
[10] |
Blum A, Mitchell T. Combining Labeled and Unlabeled Data with Co-Training[C]// Proceedings of the 11th Annual Conference on Computational Learning Theory. 1998: 92-100.
|
[11] |
Zhou Z H, Li M. Tri-training: Exploiting Unlabeled Data Using Three Classifiers[J]. IEEE Transactions on Knowledge and Data Engineering, 2005,17(11):1529-1541.
doi: 10.1109/TKDE.2005.186
|
[12] |
周志华. 机器学习[M]. 北京: 清华大学出版社, 2016.
|
[12] |
( Zhou Zhihua. Machine Learning[M]. Beijing: Tsinghua University Press, 2016.)
|
[13] |
吴明胜, 邓晓刚. 基于Tri-DE-ELM的半监督模式分类方法研究[J]. 计算机工程与应用, 2018,54(3):109-114.
|
[13] |
( Wu Mingsheng, Deng Xiaogang. Semi Supervised Pattern Classification Method Based on Tri-DE-ELM[J]. Computer Engineering and Applications, 2018,54(3):109-114.)
|
[14] |
Huang G B, Zhu Q Y, Siew C K, et al. Extreme Learning Machine: A New Learning Scheme of Feedforward Neural Networks[C]//Proceedings of 2004 IEEE International Joint Conference on Neural Networks. 2004: 985-990.
|
[15] |
王刚, 李宁宁, 杨善林. 基于IDSSL的文本情感分析研究[J]. 管理工程学报, 2018,32(3):126-133.
|
[15] |
( Wang Gang, Li Ningning, Yang Shanlin. Study of Text Sentiment Analysis Based on IDSSL[J]. Journal of Industrial Engineering and Engineering Management, 2018,32(3):126-133.)
|
[16] |
徐海龙, 龙光正, 别晓峰, 等. 结合Tri-training半监督学习和凸壳向量的SVM主动学习算法[J]. 模式识别与人工智能, 2016,29(1):39-46.
|
[16] |
( Xu Hailong, Long Guangzheng, Bie Xiaofeng, et al. Active Learning Algorithm of SVM Combining Tri-training Semi-supervised Learning and Convex-hull Vector[J]. Pattern Recognition and Artificial Intelligence, 2016,29(1):39-46.)
|
[17] |
Yarowsky D. Unsupervised Word Sense Disambiguation Rivaling Supervised Methods[C]//Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics. 1995: 189-196.
|
[18] |
Lund K, Burgess C. Producing High-Dimensional Semantic Spaces from Lexical Co-Occurrence[J]. Behavior Research Methods Instruments & Computers, 1996,28(2):203-208.
|
[19] |
Bengio Y, Ducharme R, Vincent P, et al. A Neural Probabilistic Language Model[J]. Journal of Machine Learning Research, 2003,3(6):1137-1155.
|
[20] |
Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781, 2013.
|
[21] |
Ling W, Dyer C, Black A W, et al. Two/Too Simple Adaptations of Word2Vec for Syntax Problems[C]//Proceedings of 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2015: 1299-1304.
|
[22] |
祝云凯. 基于统计特征的语义搜索引擎的研究与实现[D]. 北京: 北京邮电大学, 2015.
|
[22] |
( Zhu Yunkai. Research and Implementation of Semantic Search Engine Based on Statistical Characteristics[D]. Beijing: Beijing University of Posts and Telecommunications, 2015.)
|
[23] |
余本功, 陈杨楠, 杨颖. 基于nBD-SVM模型的投诉短文本分类[J]. 数据分析与知识发现, 2019,3(5):77-85.
|
[23] |
( Yu Bengong, Chen Yangnan, Yang Ying. Classifying Short Text Complaints with nBD-SVM Model[J]. Data Analysis and Knowledge Discovery, 2019,3(5):77-85.)
|
[24] |
Breiman L. Bagging Predictors[J]. Machine Learning, 1996,24(2):123-140.
|
[25] |
李诒靖, 郭海湘, 李亚楠, 等. 一种基于Boosting的集成学习算法在不均衡数据中的分类[J]. 系统工程理论与实践, 2016,36(1):189-199.
|
[25] |
( Li Yijing, Guo Haixiang, Li Ya’nan, et al. A Boosting Based Ensemble Learning Algorithm in Imbalanced Data Classification[J]. Systems Engineering-Theory & Practice, 2016,36(1):189-199.)
|
[26] |
胡云青. 专利知识获取及其推送方法研究[D]. 杭州: 浙江大学, 2019.
|
[26] |
( Hu Yunqing. Research on Method of Patent Knowledge Acquisition and Its Pushing[D]. Hangzhou: Zhejiang University, 2019.)
|
[27] |
Goldberg Y, Levy O, et al. Word2Vec Explained: Deriving Mikolov’s Negative-Sampling Word-Embedding Method[OL]. arXiv Preprint, arXiv: 1402.3722, 2014.
|
[28] |
Cortes C, Vapnik V. Support-Vector Networks[J]. Machine Learning, 1995,20(3):273-297.
|
[29] |
Breiman L. Random Forests[J]. Machine Learning, 2001,45(1):5-32.
doi: 10.1023/A:1010933404324
|
[30] |
Breiman L, Friedman J, Olshen R, et al. Classification and Regression Trees[M]. Chapman & Hall/CRC, New York, 1984.
|
[31] |
李航. 统计学习方法[M]. 北京: 清华大学出版社, 2012.
|
[31] |
( Li Hang. Statistical Learning Method[M]. Beijing: Tsinghua University Press, 2012.)
|
[32] |
jieba[OL]. https://github.com/fxsjy/jieba.
|
[33] |
SogouCS.reduced [DB/OL].https://www.sogou.com/labs/.
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|