Please wait a minute...
Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (9): 21-30    DOI: 10.11925/infotech.2096-3467.2021.0282
Current Issue | Archive | Adv Search |
Short-Text Classification Method with Text Features from Pre-trained Models
Chen Jie,Ma Jing(),Li Xiaofeng
College of Economics and Management, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China
Download: PDF (1106 KB)   HTML ( 19
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper uses word vectors from different pre-trained models to enhance text semantics of Word2Vec, BERT and others, and then significantly improve the news classification. [Objective] We utilized the BERT and ERNIE models to extract context semantics, and the prior knowledge of entities and phrases through Domain-Adaptive Pretraining. Combined with the TextCNN model, the proposed method generated high-order text feature vectors. It also merged these features to achieve semantic enhancement and better short text classification. [Results] We examined the proposed method with public data sets from Today's Headline News and THUCNews. Compared with the traditional Word2Vec word vector representation, the accuracy of our new model improved by 6.37% and 3.50%. Compared with the BERT and ERNIE methods, the accuracy of our new model improved by 1.98% and 1.51% respectively. [Limitations] The news corpus in our study needs to be further expanded. [Conclusions] The proposed method could effectively classify massive short text data, which is of great significance to the follow-up text mining.

Key wordsBERT      ERNIE      Short Text Classification      Text Feature Fusion      Domain-Adaptive Pretraining     
Received: 22 March 2021      Published: 29 June 2021
ZTFLH:  分类号: TP393  
Fund:*National Social Science Fund of China(20ZDA092);Fundamental Research Fund for the Central Universities(NW2020001);Fund for Graduate Innovation Base (Laboratory)(kfjj20200905)
Corresponding Authors: Ma Jing     E-mail: majing5525@126.com

Cite this article:

Chen Jie,Ma Jing,Li Xiaofeng. Short-Text Classification Method with Text Features from Pre-trained Models. Data Analysis and Knowledge Discovery, 2021, 5(9): 21-30.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2021.0282     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2021/V5/I9/21

Structure Diagram of BERT Model
ERNIE's Knowledge Masking Strategies
Difference of Random Masking Strategies Between BERT and ERNIE
Research Framework
Method of Extracting and Fusing Text Feature Vector
实际结果

预测结果
Positive Negative
Positive 正确肯定
(True Positive, TP)
错误否定
(False Negative, FN)
Negative 错误肯定
(False Positive, FP)
正确否定
(True Negative, TN)
Confusion Matrix of Binary Classification Problem
参数名称 BERT ERNIE
Encoder层数(Number of Layer) 12 12
隐藏层单元数(Hidden Size) 768 768
自注意力机制中的头数 (Heads) 12 12
词典大小(Vocab Size) 21 128 18 000
隐藏层激活函数(Hidden_act) ReLU GELU
填充长度(Padding Size) 32 32
Parameters of BERT and ERNIE
参数名称 参数值
卷积核高度(Filter Size) (2,3,4)
卷积核数目(Number of Filter) 256
批尺寸(Batch Size) 128
随机失活率(Dropout) 0.4
学习率(Learning Rate) 5E-4
优化器(Optimizer) Adam
Parameters of TextCNN Network
方法 今日头条新闻数据集
测试集准确率
THUCNews新闻数据集
测试集准确率
Method 1 81.73% 87.93%
Method 2 86.55% 89.92%
Method 3 86.12% 89.99%
Method 4 88.06% 91.43%
Method 5 88.10% 91.13%
Test Set's Accuracy of Five Methods in Two Datasets
F1 of Each Category and F1 Average in Today's Headlines Dataset
F1 of Each Category and F1 Average in THUCNews Dataset
[1] 张野, 杨建林. 基于KNN和SVM的中文文本自动分类研究[J]. 情报科学, 2011, 29(9):1313-1317.
[1] ( Zhang Ye, Yang Jianlin. Reseach on Automatic Classification for Chinese Text Based on KNN and SVM[J]. Information Science, 2011, 29(9):1313-1317.)
[2] 陈巧红, 王磊, 孙麒, 等. 卷积神经网络的短文本分类方法[J]. 计算机系统应用, 2019, 28(5):137-142.
[2] ( Chen Qiaohong, Wang Lei, Sun Qi, et al. Short Text Classification Based on Convolutional Neural Network[J]. Computer Systems & Applications, 2019, 28(5):137-142.)
[3] 汪静, 罗浪, 王德强. 基于Word2Vec的中文短文本分类问题研究[J]. 计算机系统应用, 2018, 27(5):209-215.
[3] ( Wang Jing, Luo Lang, Wang Deqiang. Research on Chinese Short Text Classification Based on Word2Vec[J]. Computer Systems & Applications, 2018, 27(5):209-215.)
[4] 张群, 王红军, 王伦文. 词向量与LDA相融合的短文本分类方法[J]. 现代图书情报技术, 2016(12):27-35.
[4] ( Zhang Qun, Wang Hongjun, Wang Lunwen. Classifying Short Texts with Word Embedding and LDA Model[J]. New Technology of Library and Information Service, 2016(12):27-35.)
[5] 段丹丹, 唐加山, 温勇, 等. 基于BERT模型的中文短文本分类算法[J]. 计算机工程, 2021, 47(1):79-86.
[5] ( Duan Dandan, Tang Jiashan, Wen Yong, et al. Chinese Short Text Classification Algorithm Based on BERT Model[J]. Computer Engineering, 2021, 47(1):79-86.)
[6] 杨彬. 基于BERT词向量和Attention-CNN的智能司法研究[D]. 大连:大连理工大学, 2019.
[6] ( Yang Bin. Intelligent Judicial Research Based on BERT Word Vector and Attention-CNN[D]. Dalian: Dalian University of Technology, 2019.)
[7] Mathew J, Radhakrishnan D. An FIR Digital Filter Using One-Hot Coded Residue Representation [C]//Proceedings of the 10th European Signal Processing Conference. IEEE, 2000.
[8] Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781.
[9] Pennington J, Socher R, Manning C. GloVe: Global Vectors for Word Representation [C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 1532-1543.
[10] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805.
[11] Sun Y, Wang S H, Li Y K, et al. ERNIE 2.0: A Continual Pre-Training Framework for Language Understanding[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(5):8968-8975.
doi: 10.1609/aaai.v34i05.6428
[12] 覃世安, 李法运. 文本分类中TF-IDF方法的改进研究[J]. 现代图书情报技术, 2013(10):27-30.
[12] ( Qin Shian, Li Fayun. Improved TF-IDF Method in Text Classification[J]. New Technology of Library and Information Service, 2013(10):27-30.)
[13] Quinlan J R. Induction of Decision Trees[J]. Machine Learning, 1986, 1(1):81-106.
[14] 杜诗雨, 齐佳音. 基于主成分分析的微博话题影响指数评价研究[J]. 情报杂志, 2014, 33(5):129-135.
[14] ( Du Shiyu, Qi Jiayin. Research on the Evaluation of Microblog Topic Influence Index Based on PCA Methods[J]. Journal of Intelligence, 2014, 33(5):129-135.)
[15] Kim Y. Convolutional Neural Networks for Sentence Classification[OL]. arXiv Preprint, arXiv: 1408.5882.
[16] Atrey P K, Hossain M A, El Saddik A, et al. Multimodal Fusion for Multimedia Analysis: A Survey[J]. Multimedia Systems, 2010, 16(6):345-379.
doi: 10.1007/s00530-010-0182-0
[17] 张小川, 余林峰, 桑瑞婷, 等. 融合CNN和LDA的短文本分类研究[J]. 软件工程, 2018, 21(6):17-21.
[17] ( Zhang Xiaochuan, Yu Linfeng, Sang Ruiting, et al. A Study of the Short Text Classification with CNN and LDA[J]. Software Engineering, 2018, 21(6):17-21.)
[18] 聂维民, 陈永洲, 马静. 融合多粒度信息的文本向量表示模型[J]. 数据分析与知识发现, 2019, 3(9):45-52.
[18] ( Nie Weimin, Chen Yongzhou, Ma Jing. A Text Vector Representation Model Merging Multi-Granularity Information[J]. Data Analysis and Knowledge Discovery, 2019, 3(9):45-52.)
[19] 潘常玮. 迁移学习中预训练中文词向量优化方法研究[D]. 北京: 北京交通大学, 2019.
[19] ( Pan Changwei. A Study on Optimization of Pre-trained Chinese Word Embedding in Transfer Learning[D]. Beijing: Beijing Jiaotong University, 2019.)
[20] Cho K, van Merriënboer B, Gulcehre C, et al. Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation[OL]. arXiv Preprint, arXiv: 1406.1078.
[21] THUCTC:一个高效的中文文本分类工具包[OL]. [2020-11-11]. http://thuctc.thunlp.org/ .
[21] (THUCTC: An Efficient Chinese text Classification Tool Kit [OL]. [2020-11-11]. http://thuctc.thunlp.org/
[22] 周志华. 机器学习[M]. 北京: 清华大学出版社, 2016.
[22] ( Zhou Zhihua. Machine Learning[M]. Beijing: Tsinghua University Press, 2016.)
[23] 李航. 统计学习方法[M]. 北京: 清华大学出版社, 2012.
[23] ( Li Hang. Statistical Learning Method[M]. Beijing: Tsinghua University Press, 2012.)
[24] Li S, Zhao Z, Hu R F, et al. Analogical Reasoning on Chinese Morphological and Semantic Relations [C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics(Short Papers). 2018: 138-143.
[25] Clark K, Khandelwal U, Levy O, et al. What does BERT Look at? An Analysis of BERT's Attention [C]//Proceedings of the 2nd BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. 2019: 276-286.
[26] Paszke A, Gross S, Massa F, et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library [C]//Proceedings of the 33rd Conference on Neural Information Processing Systems. 2019: 8024-8035.
[1] Zhou Zeyu,Wang Hao,Zhao Zibo,Li Yueyan,Zhang Xiaoqin. Construction and Application of GCN Model for Text Classification with Associated Information[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[2] Ma Jiangwei, Lv Xueqiang, You Xindong, Xiao Gang, Han Junmei. Extracting Relationship Among Military Domains with BERT and Relation Position Features[J]. 数据分析与知识发现, 2021, 5(8): 1-12.
[3] Li Wenna, Zhang Zhixiong. Entity Alignment Method for Different Knowledge Repositories with Joint Semantic Representation[J]. 数据分析与知识发现, 2021, 5(7): 1-9.
[4] Wang Hao, Lin Kerou, Meng Zhen, Li Xinlei. Identifying Multi-Type Entities in Legal Judgments with Text Representation and Feature Generation[J]. 数据分析与知识发现, 2021, 5(7): 10-25.
[5] Yu Xuehan, He Lin, Xu Jian. Extracting Events from Ancient Books Based on RoBERTa-CRF[J]. 数据分析与知识发现, 2021, 5(7): 26-35.
[6] Lu Quan, He Chao, Chen Jing, Tian Min, Liu Ting. A Multi-Label Classification Model with Two-Stage Transfer Learning[J]. 数据分析与知识发现, 2021, 5(7): 91-100.
[7] Liu Wenbin, He Yanqing, Wu Zhenfeng, Dong Cheng. Sentence Alignment Method Based on BERT and Multi-similarity Fusion[J]. 数据分析与知识发现, 2021, 5(7): 48-58.
[8] Yin Pengbo,Pan Weimin,Zhang Haijun,Chen Degang. Identifying Clickbait with BERT-BiGA Model[J]. 数据分析与知识发现, 2021, 5(6): 126-134.
[9] Song Ruoxuan,Qian Li,Du Yu. Identifying Academic Creative Concept Topics Based on Future Work of Scientific Papers[J]. 数据分析与知识发现, 2021, 5(5): 10-20.
[10] Hu Haotian,Ji Jinfeng,Wang Dongbo,Deng Sanhong. An Integrated Platform for Food Safety Incident Entities Based on Deep Learning[J]. 数据分析与知识发现, 2021, 5(3): 12-24.
[11] Wang Qian,Wang Dongbo,Li Bin,Xu Chao. Deep Learning Based Automatic Sentence Segmentation and Punctuation Model for Massive Classical Chinese Literature[J]. 数据分析与知识发现, 2021, 5(3): 25-34.
[12] Chang Chengyang,Wang Xiaodong,Zhang Shenglei. Polarity Analysis of Dynamic Political Sentiments from Tweets with Deep Learning Method[J]. 数据分析与知识发现, 2021, 5(3): 121-131.
[13] Dong Miao, Su Zhongqi, Zhou Xiaobei, Lan Xue, Cui Zhigang, Cui Lei. Improving PubMedBERT for CID-Entity-Relation Classification Using Text-CNN[J]. 数据分析与知识发现, 2021, 5(11): 145-152.
[14] Liu Huan,Zhang Zhixiong,Wang Yufei. A Review on Main Optimization Methods of BERT[J]. 数据分析与知识发现, 2021, 5(1): 3-15.
[15] Zhao Yang, Zhang Zhixiong, Liu Huan, Ding Liangping. Classification of Chinese Medical Literature with BERT Model[J]. 数据分析与知识发现, 2020, 4(8): 41-49.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn