[Objective] This paper uses word vectors from different pre-trained models to enhance text semantics of Word2Vec, BERT and others, and then significantly improve the news classification. [Objective] We utilized the BERT and ERNIE models to extract context semantics, and the prior knowledge of entities and phrases through Domain-Adaptive Pretraining. Combined with the TextCNN model, the proposed method generated high-order text feature vectors. It also merged these features to achieve semantic enhancement and better short text classification. [Results] We examined the proposed method with public data sets from Today's Headline News and THUCNews. Compared with the traditional Word2Vec word vector representation, the accuracy of our new model improved by 6.37% and 3.50%. Compared with the BERT and ERNIE methods, the accuracy of our new model improved by 1.98% and 1.51% respectively. [Limitations] The news corpus in our study needs to be further expanded. [Conclusions] The proposed method could effectively classify massive short text data, which is of great significance to the follow-up text mining.
陈杰,马静,李晓峰. 融合预训练模型文本特征的短文本分类方法*[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
Chen Jie,Ma Jing,Li Xiaofeng. Short-Text Classification Method with Text Features from Pre-trained Models. Data Analysis and Knowledge Discovery, 2021, 5(9): 21-30.
( Yang Bin. Intelligent Judicial Research Based on BERT Word Vector and Attention-CNN[D]. Dalian: Dalian University of Technology, 2019.)
Mathew J, Radhakrishnan D. An FIR Digital Filter Using One-Hot Coded Residue Representation [C]//Proceedings of the 10th European Signal Processing Conference. IEEE, 2000.
Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781.
Pennington J, Socher R, Manning C. GloVe: Global Vectors for Word Representation [C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 1532-1543.
Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805.
Sun Y, Wang S H, Li Y K, et al. ERNIE 2.0: A Continual Pre-Training Framework for Language Understanding[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(5):8968-8975.
(THUCTC: An Efficient Chinese text Classification Tool Kit [OL]. [2020-11-11]. http://thuctc.thunlp.org/
周志华. 机器学习[M]. 北京: 清华大学出版社, 2016.
( Zhou Zhihua. Machine Learning[M]. Beijing: Tsinghua University Press, 2016.)
李航. 统计学习方法[M]. 北京: 清华大学出版社, 2012.
( Li Hang. Statistical Learning Method[M]. Beijing: Tsinghua University Press, 2012.)
Li S, Zhao Z, Hu R F, et al. Analogical Reasoning on Chinese Morphological and Semantic Relations [C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics(Short Papers). 2018: 138-143.
Clark K, Khandelwal U, Levy O, et al. What does BERT Look at? An Analysis of BERT's Attention [C]//Proceedings of the 2nd BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. 2019: 276-286.
Paszke A, Gross S, Massa F, et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library [C]//Proceedings of the 33rd Conference on Neural Information Processing Systems. 2019: 8024-8035.