[Objective] This paper uses word vectors from different pre-trained models to enhance text semantics of Word2Vec, BERT and others, and then significantly improve the news classification. [Objective] We utilized the BERT and ERNIE models to extract context semantics, and the prior knowledge of entities and phrases through Domain-Adaptive Pretraining. Combined with the TextCNN model, the proposed method generated high-order text feature vectors. It also merged these features to achieve semantic enhancement and better short text classification. [Results] We examined the proposed method with public data sets from Today's Headline News and THUCNews. Compared with the traditional Word2Vec word vector representation, the accuracy of our new model improved by 6.37% and 3.50%. Compared with the BERT and ERNIE methods, the accuracy of our new model improved by 1.98% and 1.51% respectively. [Limitations] The news corpus in our study needs to be further expanded. [Conclusions] The proposed method could effectively classify massive short text data, which is of great significance to the follow-up text mining.
陈杰,马静,李晓峰. 融合预训练模型文本特征的短文本分类方法*[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
Chen Jie,Ma Jing,Li Xiaofeng. Short-Text Classification Method with Text Features from Pre-trained Models. Data Analysis and Knowledge Discovery, 2021, 5(9): 21-30.
( Chen Qiaohong, Wang Lei, Sun Qi, et al. Short Text Classification Based on Convolutional Neural Network[J]. Computer Systems & Applications, 2019, 28(5):137-142.)
( Wang Jing, Luo Lang, Wang Deqiang. Research on Chinese Short Text Classification Based on Word2Vec[J]. Computer Systems & Applications, 2018, 27(5):209-215.)
( Zhang Qun, Wang Hongjun, Wang Lunwen. Classifying Short Texts with Word Embedding and LDA Model[J]. New Technology of Library and Information Service, 2016(12):27-35.)
( Duan Dandan, Tang Jiashan, Wen Yong, et al. Chinese Short Text Classification Algorithm Based on BERT Model[J]. Computer Engineering, 2021, 47(1):79-86.)
( Yang Bin. Intelligent Judicial Research Based on BERT Word Vector and Attention-CNN[D]. Dalian: Dalian University of Technology, 2019.)
[7]
Mathew J, Radhakrishnan D. An FIR Digital Filter Using One-Hot Coded Residue Representation [C]//Proceedings of the 10th European Signal Processing Conference. IEEE, 2000.
[8]
Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781.
[9]
Pennington J, Socher R, Manning C. GloVe: Global Vectors for Word Representation [C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 1532-1543.
[10]
Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805.
[11]
Sun Y, Wang S H, Li Y K, et al. ERNIE 2.0: A Continual Pre-Training Framework for Language Understanding[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(5):8968-8975.
doi: 10.1609/aaai.v34i05.6428
( Du Shiyu, Qi Jiayin. Research on the Evaluation of Microblog Topic Influence Index Based on PCA Methods[J]. Journal of Intelligence, 2014, 33(5):129-135.)
[15]
Kim Y. Convolutional Neural Networks for Sentence Classification[OL]. arXiv Preprint, arXiv: 1408.5882.
[16]
Atrey P K, Hossain M A, El Saddik A, et al. Multimodal Fusion for Multimedia Analysis: A Survey[J]. Multimedia Systems, 2010, 16(6):345-379.
doi: 10.1007/s00530-010-0182-0
( Zhang Xiaochuan, Yu Linfeng, Sang Ruiting, et al. A Study of the Short Text Classification with CNN and LDA[J]. Software Engineering, 2018, 21(6):17-21.)
( Nie Weimin, Chen Yongzhou, Ma Jing. A Text Vector Representation Model Merging Multi-Granularity Information[J]. Data Analysis and Knowledge Discovery, 2019, 3(9):45-52.)
[19]
潘常玮. 迁移学习中预训练中文词向量优化方法研究[D]. 北京: 北京交通大学, 2019.
[19]
( Pan Changwei. A Study on Optimization of Pre-trained Chinese Word Embedding in Transfer Learning[D]. Beijing: Beijing Jiaotong University, 2019.)
[20]
Cho K, van Merriënboer B, Gulcehre C, et al. Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation[OL]. arXiv Preprint, arXiv: 1406.1078.
(THUCTC: An Efficient Chinese text Classification Tool Kit [OL]. [2020-11-11]. http://thuctc.thunlp.org/
[22]
周志华. 机器学习[M]. 北京: 清华大学出版社, 2016.
[22]
( Zhou Zhihua. Machine Learning[M]. Beijing: Tsinghua University Press, 2016.)
[23]
李航. 统计学习方法[M]. 北京: 清华大学出版社, 2012.
[23]
( Li Hang. Statistical Learning Method[M]. Beijing: Tsinghua University Press, 2012.)
[24]
Li S, Zhao Z, Hu R F, et al. Analogical Reasoning on Chinese Morphological and Semantic Relations [C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics(Short Papers). 2018: 138-143.
[25]
Clark K, Khandelwal U, Levy O, et al. What does BERT Look at? An Analysis of BERT's Attention [C]//Proceedings of the 2nd BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. 2019: 276-286.
[26]
Paszke A, Gross S, Massa F, et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library [C]//Proceedings of the 33rd Conference on Neural Information Processing Systems. 2019: 8024-8035.