|
|
Short-Text Classification Method with Text Features from Pre-trained Models |
Chen Jie,Ma Jing(),Li Xiaofeng |
College of Economics and Management, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China |
|
|
Abstract [Objective] This paper uses word vectors from different pre-trained models to enhance text semantics of Word2Vec, BERT and others, and then significantly improve the news classification. [Objective] We utilized the BERT and ERNIE models to extract context semantics, and the prior knowledge of entities and phrases through Domain-Adaptive Pretraining. Combined with the TextCNN model, the proposed method generated high-order text feature vectors. It also merged these features to achieve semantic enhancement and better short text classification. [Results] We examined the proposed method with public data sets from Today's Headline News and THUCNews. Compared with the traditional Word2Vec word vector representation, the accuracy of our new model improved by 6.37% and 3.50%. Compared with the BERT and ERNIE methods, the accuracy of our new model improved by 1.98% and 1.51% respectively. [Limitations] The news corpus in our study needs to be further expanded. [Conclusions] The proposed method could effectively classify massive short text data, which is of great significance to the follow-up text mining.
|
Received: 22 March 2021
Published: 29 June 2021
|
|
Fund:*National Social Science Fund of China(20ZDA092);Fundamental Research Fund for the Central Universities(NW2020001);Fund for Graduate Innovation Base (Laboratory)(kfjj20200905) |
Corresponding Authors:
Ma Jing
E-mail: majing5525@126.com
|
[1] |
张野, 杨建林. 基于KNN和SVM的中文文本自动分类研究[J]. 情报科学, 2011, 29(9):1313-1317.
|
[1] |
( Zhang Ye, Yang Jianlin. Reseach on Automatic Classification for Chinese Text Based on KNN and SVM[J]. Information Science, 2011, 29(9):1313-1317.)
|
[2] |
陈巧红, 王磊, 孙麒, 等. 卷积神经网络的短文本分类方法[J]. 计算机系统应用, 2019, 28(5):137-142.
|
[2] |
( Chen Qiaohong, Wang Lei, Sun Qi, et al. Short Text Classification Based on Convolutional Neural Network[J]. Computer Systems & Applications, 2019, 28(5):137-142.)
|
[3] |
汪静, 罗浪, 王德强. 基于Word2Vec的中文短文本分类问题研究[J]. 计算机系统应用, 2018, 27(5):209-215.
|
[3] |
( Wang Jing, Luo Lang, Wang Deqiang. Research on Chinese Short Text Classification Based on Word2Vec[J]. Computer Systems & Applications, 2018, 27(5):209-215.)
|
[4] |
张群, 王红军, 王伦文. 词向量与LDA相融合的短文本分类方法[J]. 现代图书情报技术, 2016(12):27-35.
|
[4] |
( Zhang Qun, Wang Hongjun, Wang Lunwen. Classifying Short Texts with Word Embedding and LDA Model[J]. New Technology of Library and Information Service, 2016(12):27-35.)
|
[5] |
段丹丹, 唐加山, 温勇, 等. 基于BERT模型的中文短文本分类算法[J]. 计算机工程, 2021, 47(1):79-86.
|
[5] |
( Duan Dandan, Tang Jiashan, Wen Yong, et al. Chinese Short Text Classification Algorithm Based on BERT Model[J]. Computer Engineering, 2021, 47(1):79-86.)
|
[6] |
杨彬. 基于BERT词向量和Attention-CNN的智能司法研究[D]. 大连:大连理工大学, 2019.
|
[6] |
( Yang Bin. Intelligent Judicial Research Based on BERT Word Vector and Attention-CNN[D]. Dalian: Dalian University of Technology, 2019.)
|
[7] |
Mathew J, Radhakrishnan D. An FIR Digital Filter Using One-Hot Coded Residue Representation [C]//Proceedings of the 10th European Signal Processing Conference. IEEE, 2000.
|
[8] |
Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781.
|
[9] |
Pennington J, Socher R, Manning C. GloVe: Global Vectors for Word Representation [C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 1532-1543.
|
[10] |
Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805.
|
[11] |
Sun Y, Wang S H, Li Y K, et al. ERNIE 2.0: A Continual Pre-Training Framework for Language Understanding[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(5):8968-8975.
doi: 10.1609/aaai.v34i05.6428
|
[12] |
覃世安, 李法运. 文本分类中TF-IDF方法的改进研究[J]. 现代图书情报技术, 2013(10):27-30.
|
[12] |
( Qin Shian, Li Fayun. Improved TF-IDF Method in Text Classification[J]. New Technology of Library and Information Service, 2013(10):27-30.)
|
[13] |
Quinlan J R. Induction of Decision Trees[J]. Machine Learning, 1986, 1(1):81-106.
|
[14] |
杜诗雨, 齐佳音. 基于主成分分析的微博话题影响指数评价研究[J]. 情报杂志, 2014, 33(5):129-135.
|
[14] |
( Du Shiyu, Qi Jiayin. Research on the Evaluation of Microblog Topic Influence Index Based on PCA Methods[J]. Journal of Intelligence, 2014, 33(5):129-135.)
|
[15] |
Kim Y. Convolutional Neural Networks for Sentence Classification[OL]. arXiv Preprint, arXiv: 1408.5882.
|
[16] |
Atrey P K, Hossain M A, El Saddik A, et al. Multimodal Fusion for Multimedia Analysis: A Survey[J]. Multimedia Systems, 2010, 16(6):345-379.
doi: 10.1007/s00530-010-0182-0
|
[17] |
张小川, 余林峰, 桑瑞婷, 等. 融合CNN和LDA的短文本分类研究[J]. 软件工程, 2018, 21(6):17-21.
|
[17] |
( Zhang Xiaochuan, Yu Linfeng, Sang Ruiting, et al. A Study of the Short Text Classification with CNN and LDA[J]. Software Engineering, 2018, 21(6):17-21.)
|
[18] |
聂维民, 陈永洲, 马静. 融合多粒度信息的文本向量表示模型[J]. 数据分析与知识发现, 2019, 3(9):45-52.
|
[18] |
( Nie Weimin, Chen Yongzhou, Ma Jing. A Text Vector Representation Model Merging Multi-Granularity Information[J]. Data Analysis and Knowledge Discovery, 2019, 3(9):45-52.)
|
[19] |
潘常玮. 迁移学习中预训练中文词向量优化方法研究[D]. 北京: 北京交通大学, 2019.
|
[19] |
( Pan Changwei. A Study on Optimization of Pre-trained Chinese Word Embedding in Transfer Learning[D]. Beijing: Beijing Jiaotong University, 2019.)
|
[20] |
Cho K, van Merriënboer B, Gulcehre C, et al. Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation[OL]. arXiv Preprint, arXiv: 1406.1078.
|
[21] |
THUCTC:一个高效的中文文本分类工具包[OL]. [2020-11-11]. http://thuctc.thunlp.org/ .
|
[21] |
(THUCTC: An Efficient Chinese text Classification Tool Kit [OL]. [2020-11-11]. http://thuctc.thunlp.org/
|
[22] |
周志华. 机器学习[M]. 北京: 清华大学出版社, 2016.
|
[22] |
( Zhou Zhihua. Machine Learning[M]. Beijing: Tsinghua University Press, 2016.)
|
[23] |
李航. 统计学习方法[M]. 北京: 清华大学出版社, 2012.
|
[23] |
( Li Hang. Statistical Learning Method[M]. Beijing: Tsinghua University Press, 2012.)
|
[24] |
Li S, Zhao Z, Hu R F, et al. Analogical Reasoning on Chinese Morphological and Semantic Relations [C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics(Short Papers). 2018: 138-143.
|
[25] |
Clark K, Khandelwal U, Levy O, et al. What does BERT Look at? An Analysis of BERT's Attention [C]//Proceedings of the 2nd BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. 2019: 276-286.
|
[26] |
Paszke A, Gross S, Massa F, et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library [C]//Proceedings of the 33rd Conference on Neural Information Processing Systems. 2019: 8024-8035.
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|