Please wait a minute...
Advanced Search
数据分析与知识发现  2022, Vol. 6 Issue (6): 84-94
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
中国人民公安大学信息网络安全学院 北京 102627
Classification Model for Long Texts with Attention Mechanism and Sentence Vector Compression
Ye Han,Sun Haichun,Li Xin(),Jiao Kainan
School of Information and Cyber Security, People’s Public Security University of China, Beijing 102627, China
全文: PDF (1113 KB)   HTML ( 19
输出: BibTeX | EndNote (RIS)      

【目的】 针对预训练语言模型输入长度限制的缺点进行优化,提高长文本分类的准确度。【方法】 设计依据自然文本中存在的标点符号进行分句并按次序输入预训练语言模型的分类模型;提出句向量平均池化法与注意力机制加权法对分类特征向量进行压缩编码,并在多个预训练语言模型上进行实验。【结果】 相比于直接截断文本内容,使用句向量压缩的模型准确率最多提升了3.74个百分点。在两种数据集上,融合注意力机制模型的F1-score相比基线模型分别平均提升1.61%和0.83%。【局限】 在部分预训练语言模型上提升效果不显著。【结论】 在不改变预训练语言模型架构时,结合分句内容信息的文本分类模型在不同预训练语言模型上能够有效提升分类效果。

E-mail Alert
关键词 文本分类预训练语言模型特征向量注意力机制文本分割    

[Objective] This paper tries to address the input length issue of the pretraining language model, aiming to improve the accuracy of long text classification. [Methods] We designed an algorithm using punctuation in natural texts to segment sentences and feed them into the pre-trained language model in order. Then, we compressed and encoded the classification feature vectors with the average pooling method and the weighted attention mechanism. Finally, we examined the new algorithm with multiple pre-trained language models. [Results] Compared to methods directly truncating the text contents, the classification accuracy of the proposed method improved by up to 3.74%. After applying the attention mechanism, the classification F1-score on two datasets increasd by 1.61% and 0.83% respectively. [Limitations] The improvements are not significant on some pre-trained language models. [Conclusions] The proposed model can effectively classify long texts without changing the pre-training language model’s architecture.

Key wordsText Classification    Pre-trained Language Model    Featured Vector    Attention Mechanism    Text Segmentation
收稿日期: 2021-10-24      出版日期: 2022-01-25
ZTFLH:  TP391  
通讯作者: 李欣     E-mail:
叶瀚,孙海春,李欣,焦凯楠. 融合注意力机制与句向量压缩的长文本分类模型[J]. 数据分析与知识发现, 2022, 6(6): 84-94.
Ye Han,Sun Haichun,Li Xin,Jiao Kainan. Classification Model for Long Texts with Attention Mechanism and Sentence Vector Compression. Data Analysis and Knowledge Discovery, 2022, 6(6): 84-94.
链接本文:      或
Fig.1  句向量压缩文本分类模型
Fig.2  句向量注意力平均加权机制
配置项 配置信息
操作系统 Ubuntu 18.04
处理器型号 Intel(R) Xeon(R) Gold 6240 CPU @2.60GHz * 2
显卡型号 Nvidia Quadro RTX 6000
运行内存大小 64GB
AllenNLP版本 2.4.0
PyTorch版本 1.7.1
Table 1  实验环境
数据集类别 样本数 平均长度 长度超510的样本数
训练集 12 133 289.04 1 487
测试集 2 599 289.83 313
Table 2  IFLYTEK'长文本分类数据集统计信息
数据集类别 样本数 平均长度 长度超510的样本数
训练集 10 000 969.13 6 839
测试集 5 000 882.12 3 098
Table 3  THUCNews长文本分类数据集统计信息
模型 基础架构 隐藏层层数 隐藏层维度 注意力头数
BERT-base BERT 12 768 12
ELECTRA-180g-small ELECTRA 12 256 4
ERNIE-1.0 ERNIE 12 768 12
ALBERT-tiny ALBERT 4 312 12
RoBERTa-small-clue RoBERTa 4 512 8
RBT3 RoBERTa 3 768 12
RBT4 RoBERTa 4 768 12
Table 4  预训练语言模型参数
模型 Baseline CLS-BOE CLS-ATT
Accuracy F1 Accuracy F1 Accuracy F1
BERT-base 0.586 0 0.572 1 0.586 4 0.570 1 0.594 8 0.581 5
ERNIE-1.0 0.604 4 0.583 4 0.600 6 0.582 4 0.596 4 0.581 9
ELECTRA-180g-small 0.562 5 0.531 5 0.568 6 0.536 4 0.565 2 0.536 0
ALBERT-tiny 0.554 1 0.518 0 0.564 1 0.526 5 0.574 8 0.544 6
RoBERTa-small 0.589 5 0.557 8 0.594 8 0.569 9 0.582 9 0.558 3
RBT3 0.578 6 0.546 7 0.578 7 0.560 6 0.583 3 0.564 1
RBT4 0.581 7 0.561 5 0.578 7 0.549 1 0.576 3 0.565 0
Table 5  IFLYTEK'长文本分类数据集实验结果
模型 Baseline CLS-BOE CLS-ATT
Accuracy F1 Accuracy F1 Accuracy F1
BERT-base 0.968 8 0.968 7 0.970 4 0.970 1 0.985 6 0.985 5
ERNIE-1.0 0.978 8 0.978 7 0.976 6 0.976 4 0.985 8 0.985 7
ELECTRA-180g-small 0.957 0 0.956 2 0.973 2 0.973 1 0.975 4 0.975 2
ALBERT-tiny 0.965 2 0.964 8 0.947 6 0.948 0 0.964 6 0.964 5
RoBERTa-small 0.976 8 0.976 7 0.975 8 0.975 7 0.968 8 0.968 7
RBT3 (h:768) 0.971 0 0.970 6 0.974 6 0.974 6 0.984 8 0.984 7
RBT4 (h:768) 0.969 8 0.969 7 0.974 6 0.974 4 0.977 4 0.977 3
Table 6  THUCNews长文本分类数据集实验结果
Fig.3  改进模型在Accuracy和F1-score上的提升
Fig.4  IFLYTEK'长文本分类数据集分类准确率
Fig.5  THUCNews文本分类数据集分类准确率
预训练语言模型 CLS-BOE相较于Baseline CLS-ATT相较于Baseline
BERT 0.07% 1.50%
ERNIE-1.0 -0.63% -1.32%
ELECTRA-180g-small 1.08% 0.48%
ALBERT 1.80% 3.74%
RoBERTa 0.90% -1.12%
RBT3 0.02% 0.81%
RBT4 -0.52% -0.93%
Table 7  IFLYTEK'文本分类数据集上Accuracy的提升
预训练语言模型 CLS-BOE相较于Baseline CLS-ATT相较于Baseline
BERT 0.17% 1.73%
ERNIE-1.0 -0.22% 0.72%
ELECTRA-180g-small 1.69% 1.92%
ALBERT -1.82% -0.06%
RoBERTa -0.10% -0.82%
RBT3 0.37% 1.42%
RBT4 0.49% 0.78%
Table 8  THUCNews长文本分类数据集上Accuracy的提升
预训练语言模型 在IFLYTEK'数据集上的F1-score提升 在THUCNews数据集上的F1-score提升
BERT 1.64% 1.73%
ERNIE-1.0 -0.26% 0.72%
ELECTRA-180g-small 0.85% 1.99%
ALBERT 5.14% -0.03%
RoBERTa 0.09% -0.82%
RBT3 3.18% 1.45%
RBT4 0.62% 0.78%
(平均) 1.61% 0.83%
Table 9  CLS-ATT模型F1-score的提升
[1] Katakis I, Tsoumakas G, Vlahavas I. Multilabel Text Classification for Automated Tag Suggestion[C]// Proceedings of the 2008 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. 2008.
[2] 万家山, 吴云志. 基于深度学习的文本分类方法研究综述[J]. 天津理工大学学报, 2021, 37(2): 41-47.
[2] (Wan Jiashan, Wu Yunzhi. Review of Text Classification Research Based on Deep Learning[J]. Journal of Tianjin University of Technology, 2021, 37(2): 41-47.)
[3] Sun C, Qiu X, Xu Y, et al. How to Fine-Tune BERT for Text Classification?[C]// Proceedings of the 18th China National Conference on Chinese Computational Linguistics. Springer, Cham, 2019: 194-206.
[4] Devlin J, Chang M W, Lee K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2019: 4171-4186.
[5] Liu Y H, Ott M, Goyal N, et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach[OL]. arXiv Preprint, arXiv: 1907.11692.
[6] Peinelt N, Nguyen D, Liakata M. TBERT: Topic Models and BERT Joining Forces for Semantic Similarity Detection[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020: 7047-7055.
[7] Ding M, Zhou C, Yang H, et al. CogLTX: Applying BERT to Long Texts[J]. Advances in Neural Information Processing Systems, 2020, 33: 12792-12804.
[8] Dai Z H, Yang Z L, Yang Y M, et al. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 2978-2988.
[9] Yang Z C, Yang D Y, Dyer C, et al. Hierarchical Attention Networks for Document Classification[C]// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2016: 1480-1489.
[10] 卢玲, 杨武, 王远伦, 等. 结合注意力机制的长文本分类方法[J]. 计算机应用, 2018, 38(5): 1272-1277.
doi: 10.11772/j.issn.1001-9081.2017112652
[10] (Lu Ling, Yang Wu, Wang Yuanlun, et al. Long Text Classification Combined with Attention Mechanism[J]. Journal of Computer Applications, 2018, 38(5): 1272-1277.)
doi: 10.11772/j.issn.1001-9081.2017112652
[11] Adhikari A, Ram A, Tang R, et al. DocBERT: BERT for Document Classification[OL]. arXiv Preprint, arXiv: 1904.08398.
[12] Hinton G, Vinyals O, Dean J. Distilling the Knowledge in a Neural Network[OL]. arXiv Preprint, arXiv: 1503.02531.
[13] Wang W, Yan M, Wu C. Multi-Granularity Hierarchical Attention Fusion Networks for Reading Comprehension and Question Answering[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018: 1705-1714.
[14] Bahdanau D, Cho K, Bengio Y. Neural Machine Translation by Jointly Learning to Align and Translate[OL]. arXiv Preprint, arXiv: 1409.0473.
[15] Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017: 6000-6010.
[16] Xu L, Hu H, Zhang X W, et al. CLUE: A Chinese Language Understanding Evaluation Benchmark[C]// Proceedings of the 28th International Conference on Computational Linguistics. 2020: 4762-4772.
[17] Howard J, Ruder S. Universal Language Model Fine-Tuning for Text Classification[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018: 328-339.
[18] Sun Y, Wang S H, Li Y K, et al. ERNIE: Enhanced Representation Through Knowledge Integration[OL]. arXiv Preprint, arXiv: 1904.09223.
[19] Clark K, Luong M T, Le Q V, et al. ELECTRA: Pre-Training Text Encoders as Discriminators Rather than Generators[OL]. arXiv Preprint, arXiv: 2003.10555.
[20] Lan Z, Chen M, Goodman S, et al. ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations[OL]. arXiv Preprint, arXiv: 1909.11942.
[1] 景慎旗, 赵又霖. 基于医学领域知识和远程监督的医学实体关系抽取研究*[J]. 数据分析与知识发现, 2022, 6(6): 105-114.
[2] 张若琦, 申建芳, 陈平华. 结合GNN、Bi-GRU及注意力机制的会话序列推荐*[J]. 数据分析与知识发现, 2022, 6(6): 46-54.
[3] 屠振超, 马静. 基于改进文本表示的商品文本分类算法研究*[J]. 数据分析与知识发现, 2022, 6(5): 34-43.
[4] 陈果, 叶潮. 融合半监督学习与主动学习的细分领域新闻分类研究*[J]. 数据分析与知识发现, 2022, 6(4): 28-38.
[5] 肖悦珺, 李红莲, 张乐, 吕学强, 游新冬. 特征融合的中文专利文本分类方法研究*[J]. 数据分析与知识发现, 2022, 6(4): 49-59.
[6] 杨林, 黄晓硕, 王嘉阳, 丁玲玲, 李子孝, 李姣. 基于BERT-TextCNN的临床试验疾病亚型识别研究*[J]. 数据分析与知识发现, 2022, 6(4): 69-81.
[7] 郭航程, 何彦青, 兰天, 吴振峰, 董诚. 基于Paragraph-BERT-CRF的科技论文摘要语步功能信息识别方法研究*[J]. 数据分析与知识发现, 2022, 6(2/3): 298-307.
[8] 徐月梅, 樊祖薇, 曹晗. 基于标签嵌入注意力机制的多任务文本分类模型*[J]. 数据分析与知识发现, 2022, 6(2/3): 105-116.
[9] 谢星雨, 余本功. 基于MFFMB的电商评论文本分类研究*[J]. 数据分析与知识发现, 2022, 6(1): 101-112.
[10] 周泽聿,王昊,赵梓博,李跃艳,张小琴. 融合关联信息的GCN文本分类模型构建及其应用研究*[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[11] 范涛,王昊,吴鹏. 基于图卷积神经网络和依存句法分析的网民负面情感分析研究*[J]. 数据分析与知识发现, 2021, 5(9): 97-106.
[12] 陈杰,马静,李晓峰. 融合预训练模型文本特征的短文本分类方法*[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[13] 杨晗迅, 周德群, 马静, 罗永聪. 基于不确定性损失函数和任务层级注意力机制的多任务谣言检测研究*[J]. 数据分析与知识发现, 2021, 5(7): 101-110.
[14] 谢豪,毛进,李纲. 基于多层语义融合的图文信息情感分类研究*[J]. 数据分析与知识发现, 2021, 5(6): 103-114.
[15] 尹鹏博,潘伟民,张海军,陈德刚. 基于BERT-BiGA模型的标题党新闻识别研究*[J]. 数据分析与知识发现, 2021, 5(6): 126-134.
Full text



版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190