Please wait a minute...
Data Analysis and Knowledge Discovery  2022, Vol. 6 Issue (6): 84-94    DOI: 10.11925/infotech.2096-3467.2021.1216
Current Issue | Archive | Adv Search |
Classification Model for Long Texts with Attention Mechanism and Sentence Vector Compression
Ye Han,Sun Haichun,Li Xin(),Jiao Kainan
School of Information and Cyber Security, People’s Public Security University of China, Beijing 102627, China
Download: PDF (1113 KB)   HTML ( 20
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper tries to address the input length issue of the pretraining language model, aiming to improve the accuracy of long text classification. [Methods] We designed an algorithm using punctuation in natural texts to segment sentences and feed them into the pre-trained language model in order. Then, we compressed and encoded the classification feature vectors with the average pooling method and the weighted attention mechanism. Finally, we examined the new algorithm with multiple pre-trained language models. [Results] Compared to methods directly truncating the text contents, the classification accuracy of the proposed method improved by up to 3.74%. After applying the attention mechanism, the classification F1-score on two datasets increasd by 1.61% and 0.83% respectively. [Limitations] The improvements are not significant on some pre-trained language models. [Conclusions] The proposed model can effectively classify long texts without changing the pre-training language model’s architecture.

Key wordsText Classification      Pre-trained Language Model      Featured Vector      Attention Mechanism      Text Segmentation     
Received: 24 October 2021      Published: 25 January 2022
ZTFLH:  TP391  
Fund:Ministry of Public Security Technology Research Program(2020JSYJC22);People’s Public Security University of China Basic Research Fund(2021JKF215)
Corresponding Authors: Li Xin     E-mail: lixin@ppsuc.edu.cn

Cite this article:

Ye Han,Sun Haichun,Li Xin,Jiao Kainan. Classification Model for Long Texts with Attention Mechanism and Sentence Vector Compression. Data Analysis and Knowledge Discovery, 2022, 6(6): 84-94.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2021.1216     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2022/V6/I6/84

Text Classification Model with Sentence Vector Compression
Sentence Vector Attention Weighted Average Mechanism
配置项 配置信息
操作系统 Ubuntu 18.04
处理器型号 Intel(R) Xeon(R) Gold 6240 CPU @2.60GHz * 2
显卡型号 Nvidia Quadro RTX 6000
运行内存大小 64GB
AllenNLP版本 2.4.0
PyTorch版本 1.7.1
Experiment Environment
数据集类别 样本数 平均长度 长度超510的样本数
训练集 12 133 289.04 1 487
测试集 2 599 289.83 313
The Statistical Information of IFLYTEK' Dataset
数据集类别 样本数 平均长度 长度超510的样本数
训练集 10 000 969.13 6 839
测试集 5 000 882.12 3 098
The Statistical Information of THUCNews Dataset
模型 基础架构 隐藏层层数 隐藏层维度 注意力头数
BERT-base BERT 12 768 12
ELECTRA-180g-small ELECTRA 12 256 4
ERNIE-1.0 ERNIE 12 768 12
ALBERT-tiny ALBERT 4 312 12
RoBERTa-small-clue RoBERTa 4 512 8
RBT3 RoBERTa 3 768 12
RBT4 RoBERTa 4 768 12
The Parameters of Pretraining Language Models
模型 Baseline CLS-BOE CLS-ATT
Accuracy F1 Accuracy F1 Accuracy F1
BERT-base 0.586 0 0.572 1 0.586 4 0.570 1 0.594 8 0.581 5
ERNIE-1.0 0.604 4 0.583 4 0.600 6 0.582 4 0.596 4 0.581 9
ELECTRA-180g-small 0.562 5 0.531 5 0.568 6 0.536 4 0.565 2 0.536 0
ALBERT-tiny 0.554 1 0.518 0 0.564 1 0.526 5 0.574 8 0.544 6
RoBERTa-small 0.589 5 0.557 8 0.594 8 0.569 9 0.582 9 0.558 3
RBT3 0.578 6 0.546 7 0.578 7 0.560 6 0.583 3 0.564 1
RBT4 0.581 7 0.561 5 0.578 7 0.549 1 0.576 3 0.565 0
Classification Results on IFLYTEK' Dataset
模型 Baseline CLS-BOE CLS-ATT
Accuracy F1 Accuracy F1 Accuracy F1
BERT-base 0.968 8 0.968 7 0.970 4 0.970 1 0.985 6 0.985 5
ERNIE-1.0 0.978 8 0.978 7 0.976 6 0.976 4 0.985 8 0.985 7
ELECTRA-180g-small 0.957 0 0.956 2 0.973 2 0.973 1 0.975 4 0.975 2
ALBERT-tiny 0.965 2 0.964 8 0.947 6 0.948 0 0.964 6 0.964 5
RoBERTa-small 0.976 8 0.976 7 0.975 8 0.975 7 0.968 8 0.968 7
RBT3 (h:768) 0.971 0 0.970 6 0.974 6 0.974 6 0.984 8 0.984 7
RBT4 (h:768) 0.969 8 0.969 7 0.974 6 0.974 4 0.977 4 0.977 3
Classification Results on THUCNews Dataset
The Accuracy and F1-score Analysis Compared the Baseline Model with Improved Model
The Accuracy on IFLYTEK' Dataset
The Accuracy on THUCNews Dataset
预训练语言模型 CLS-BOE相较于Baseline CLS-ATT相较于Baseline
BERT 0.07% 1.50%
ERNIE-1.0 -0.63% -1.32%
ELECTRA-180g-small 1.08% 0.48%
ALBERT 1.80% 3.74%
RoBERTa 0.90% -1.12%
RBT3 0.02% 0.81%
RBT4 -0.52% -0.93%
The Accuracy Improvement of IFLYTEK' Dataset
预训练语言模型 CLS-BOE相较于Baseline CLS-ATT相较于Baseline
BERT 0.17% 1.73%
ERNIE-1.0 -0.22% 0.72%
ELECTRA-180g-small 1.69% 1.92%
ALBERT -1.82% -0.06%
RoBERTa -0.10% -0.82%
RBT3 0.37% 1.42%
RBT4 0.49% 0.78%
The Accuracy Improvement of THUCNews Dataset
预训练语言模型 在IFLYTEK'数据集上的F1-score提升 在THUCNews数据集上的F1-score提升
BERT 1.64% 1.73%
ERNIE-1.0 -0.26% 0.72%
ELECTRA-180g-small 0.85% 1.99%
ALBERT 5.14% -0.03%
RoBERTa 0.09% -0.82%
RBT3 3.18% 1.45%
RBT4 0.62% 0.78%
(平均) 1.61% 0.83%
The Relative F1-score Improvement on CLS-ATT
[1] Katakis I, Tsoumakas G, Vlahavas I. Multilabel Text Classification for Automated Tag Suggestion[C]// Proceedings of the 2008 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. 2008.
[2] 万家山, 吴云志. 基于深度学习的文本分类方法研究综述[J]. 天津理工大学学报, 2021, 37(2): 41-47.
[2] (Wan Jiashan, Wu Yunzhi. Review of Text Classification Research Based on Deep Learning[J]. Journal of Tianjin University of Technology, 2021, 37(2): 41-47.)
[3] Sun C, Qiu X, Xu Y, et al. How to Fine-Tune BERT for Text Classification?[C]// Proceedings of the 18th China National Conference on Chinese Computational Linguistics. Springer, Cham, 2019: 194-206.
[4] Devlin J, Chang M W, Lee K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2019: 4171-4186.
[5] Liu Y H, Ott M, Goyal N, et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach[OL]. arXiv Preprint, arXiv: 1907.11692.
[6] Peinelt N, Nguyen D, Liakata M. TBERT: Topic Models and BERT Joining Forces for Semantic Similarity Detection[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020: 7047-7055.
[7] Ding M, Zhou C, Yang H, et al. CogLTX: Applying BERT to Long Texts[J]. Advances in Neural Information Processing Systems, 2020, 33: 12792-12804.
[8] Dai Z H, Yang Z L, Yang Y M, et al. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 2978-2988.
[9] Yang Z C, Yang D Y, Dyer C, et al. Hierarchical Attention Networks for Document Classification[C]// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2016: 1480-1489.
[10] 卢玲, 杨武, 王远伦, 等. 结合注意力机制的长文本分类方法[J]. 计算机应用, 2018, 38(5): 1272-1277.
doi: 10.11772/j.issn.1001-9081.2017112652
[10] (Lu Ling, Yang Wu, Wang Yuanlun, et al. Long Text Classification Combined with Attention Mechanism[J]. Journal of Computer Applications, 2018, 38(5): 1272-1277.)
doi: 10.11772/j.issn.1001-9081.2017112652
[11] Adhikari A, Ram A, Tang R, et al. DocBERT: BERT for Document Classification[OL]. arXiv Preprint, arXiv: 1904.08398.
[12] Hinton G, Vinyals O, Dean J. Distilling the Knowledge in a Neural Network[OL]. arXiv Preprint, arXiv: 1503.02531.
[13] Wang W, Yan M, Wu C. Multi-Granularity Hierarchical Attention Fusion Networks for Reading Comprehension and Question Answering[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018: 1705-1714.
[14] Bahdanau D, Cho K, Bengio Y. Neural Machine Translation by Jointly Learning to Align and Translate[OL]. arXiv Preprint, arXiv: 1409.0473.
[15] Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017: 6000-6010.
[16] Xu L, Hu H, Zhang X W, et al. CLUE: A Chinese Language Understanding Evaluation Benchmark[C]// Proceedings of the 28th International Conference on Computational Linguistics. 2020: 4762-4772.
[17] Howard J, Ruder S. Universal Language Model Fine-Tuning for Text Classification[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018: 328-339.
[18] Sun Y, Wang S H, Li Y K, et al. ERNIE: Enhanced Representation Through Knowledge Integration[OL]. arXiv Preprint, arXiv: 1904.09223.
[19] Clark K, Luong M T, Le Q V, et al. ELECTRA: Pre-Training Text Encoders as Discriminators Rather than Generators[OL]. arXiv Preprint, arXiv: 2003.10555.
[20] Lan Z, Chen M, Goodman S, et al. ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations[OL]. arXiv Preprint, arXiv: 1909.11942.
[1] Jing Shenqi, Zhao Youlin. Extracting Medical Entity Relationships with Domain-Specific Knowledge and Distant Supervision[J]. 数据分析与知识发现, 2022, 6(6): 105-114.
[2] Zhang Ruoqi, Shen Jianfang, Chen Pinghua. Session Sequence Recommendation with GNN, Bi-GRU and Attention Mechanism[J]. 数据分析与知识发现, 2022, 6(6): 46-54.
[3] Tu Zhenchao, Ma Jing. Item Categorization Algorithm Based on Improved Text Representation[J]. 数据分析与知识发现, 2022, 6(5): 34-43.
[4] Chen Guo, Ye Chao. News Classification with Semi-Supervised and Active Learning[J]. 数据分析与知识发现, 2022, 6(4): 28-38.
[5] Xiao Yuejun, Li Honglian, Zhang Le, Lv Xueqiang, You Xindong. Classifying Chinese Patent Texts with Feature Fusion[J]. 数据分析与知识发现, 2022, 6(4): 49-59.
[6] Yang Lin, Huang Xiaoshuo, Wang Jiayang, Ding Lingling, Li Zixiao, Li Jiao. Identifying Subtypes of Clinical Trial Diseases with BERT-TextCNN[J]. 数据分析与知识发现, 2022, 6(4): 69-81.
[7] Guo Hangcheng, He Yanqing, Lan Tian, Wu Zhenfeng, Dong Cheng. Identifying Moves from Scientific Abstracts Based on Paragraph-BERT-CRF[J]. 数据分析与知识发现, 2022, 6(2/3): 298-307.
[8] Xu Yuemei, Fan Zuwei, Cao Han. A Multi-Task Text Classification Model Based on Label Embedding of Attention Mechanism[J]. 数据分析与知识发现, 2022, 6(2/3): 105-116.
[9] Xie Xingyu, Yu Bengong. Automatic Classification of E-commerce Comments with Multi-Feature Fusion Model[J]. 数据分析与知识发现, 2022, 6(1): 101-112.
[10] Chen Jie,Ma Jing,Li Xiaofeng. Short-Text Classification Method with Text Features from Pre-trained Models[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[11] Zhou Zeyu,Wang Hao,Zhao Zibo,Li Yueyan,Zhang Xiaoqin. Construction and Application of GCN Model for Text Classification with Associated Information[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[12] Yang Hanxun, Zhou Dequn, Ma Jing, Luo Yongcong. Detecting Rumors with Uncertain Loss and Task-level Attention Mechanism[J]. 数据分析与知识发现, 2021, 5(7): 101-110.
[13] Yin Pengbo,Pan Weimin,Zhang Haijun,Chen Degang. Identifying Clickbait with BERT-BiGA Model[J]. 数据分析与知识发现, 2021, 5(6): 126-134.
[14] Yu Bengong,Zhu Xiaojie,Zhang Ziwei. A Capsule Network Model for Text Classification with Multi-level Feature Extraction[J]. 数据分析与知识发现, 2021, 5(6): 93-102.
[15] Xie Hao,Mao Jin,Li Gang. Sentiment Classification of Image-Text Information with Multi-Layer Semantic Fusion[J]. 数据分析与知识发现, 2021, 5(6): 103-114.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn