Please wait a minute...
Advanced Search
数据分析与知识发现  2021, Vol. 5 Issue (10): 1-14     https://doi.org/10.11925/infotech.2096-3467.2021.0228
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于多特征融合的中文文本分类研究*
王艳1,王胡燕2(),余本功2,3
1安徽农业大学经济技术学院 合肥 231200
2合肥工业大学管理学院 合肥 230009
3合肥工业大学过程优化与智能决策教育部重点实验室 合肥 230009
Chinese Text Classification with Feature Fusion
Wang Yan1,Wang Huyan2(),Yu Bengong2,3
1Economic and Technical College, Anhui Agricultural University, Hefei 231200, China
2School of Management, Hefei University of Technology, Hefei 230009, China
3Key Laboratory of Process Optimization & Intelligent Decision-Making, Ministry of Education, Hefei University of Technology, Hefei 230009, China
全文: PDF (1099 KB)   HTML ( 44
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 通过结合拼音字符特征、汉字字符特征、词级别语义特征和词性特征,缓解文本所呈现出的弱结构化、拼写错误及其同音词较多的问题,丰富语义特征,提高模型的分类能力。【方法】 多特征融合的文本分类方法,在词级别特征的基础上进行词性特征、汉字字符特征和拼音字符特征构建多特征语义表示,然后将特征输入BiGRU中获取上下文语义特征,输入CNN中获取局部语义特征,最终将特征进行融合并输入Softmax中进行分类,预测需要的类别标签。【结果】 在两个不同的数据集下,多特征融合的模型的准确率分别达到83.3%和91.1%,比其他分类模型准确率至少提升了7个百分点。【局限】 实验数据数量较少,未在更多的数据集上进行验证。【结论】 所提方法提升了模型的语义表征能力,是一种有效的文本分类模型,为企业进行高效文本分类提供了有效支持。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
王艳
王胡燕
余本功
关键词 词性标记词级别特征文本分类拼音字符特征汉字字符特征    
Abstract

[Objective] This paper proposes a new classification model for Chinese texts, aiming to address the issues of weak structure, spelling errors or homonyms in the texts. [Methods] We constructed a multi-feature fusion method based on the traditional fusion features model for text classification. Then, we combined word level features, part of speech feature extension, the Chinese character features and the Pinyin letters to create multi-feature semantic representation. Third, we introduced the new multi-semantic characteristics into the BiGRU to obtain the context semantics, which were processed with the multi-channel CNN to generate the main features. Finally, we merged these features for the softmax layer to finish the classification tasks, and predicted the required category labels. [Results] The accuracy of our multi-feature fusion model reached 83.3% and 91.1% with two datasets, which was 7% higher than the existing model. [Limitations] More research is needed to examine the model with larger datasets. [Conclusions] The proposed model could effectively finish the Chinese text classification tasks.

Key wordsPart of Speech Tag    Word Level Characteristics    Text Classification    Pinyin Character Features    Chinese Character Features
收稿日期: 2021-03-08      出版日期: 2021-07-01
ZTFLH:  G350  
基金资助:*国家自然科学基金项目(71671057)
通讯作者: 王胡燕,ORCID:0000-0001-8267-6183     E-mail: 1115419302@qq.com
引用本文:   
王艳, 王胡燕, 余本功. 基于多特征融合的中文文本分类研究*[J]. 数据分析与知识发现, 2021, 5(10): 1-14.
Wang Yan, Wang Huyan, Yu Bengong. Chinese Text Classification with Feature Fusion. Data Analysis and Knowledge Discovery, 2021, 5(10): 1-14.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2021.0228      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2021/V5/I10/1
Fig.1  多特征融合文本分类模型结构
Fig.2  CBOW模型结构
Fig.3  GRU节点结构
环境 配置参数
处理器 Intel(R) Core(TM) I5-4200U CPU @1.6GHz
显卡 NVIDIA GeForce GT 740M
内存 12GB
编译器、语言 PyCharm,Python3.7
Table 1  实验环境配置参数
数据项 计算机专利 搜狗新闻
来源 SooPAR专利 搜狗实验室开源
类别数 5 5
数量 10 000 10 000
平均长度(字符) 210 843
最短长度(字符) 150 30
最长长度(字符) 300 400
Table 2  数据集信息
参数 设定值
卷积核宽度 {1,3,5}
卷积核个数 64
GRU单元数 100
Batch Size 32
Epoch 20
Optimizer Adam
Dropout Rate 0.25
Table 3  模型参数设置
数据集 模型 Acc P R F1
计算机专利 POS-BiGRUCNN 57.3% 59.4% 58.1% 60.0%
PY-BiGRUCNN 64.1% 65.3% 63.5% 64.1%
HZ-BiGRUCNN 65.2% 62.7% 62.6% 62.2%
Word-BiGRUCNN 76.1% 76.9% 76.2% 76.5%
POS-Word-BiGRUCNN 78.1% 77.5% 77.1% 77.3%
PY-Word-BiGRUCNN 79.1% 78.8% 78.5% 77.5%
HZ-Word-BiGRUCNN 80.2% 79.5% 79.3% 79.5%
PY-POS-Word-BiGRUCNN 81.3% 80.3% 78.6% 79.4%
HZ-POS-Word-BiGRUCNN 81.2% 81.3% 80.5% 80.9%
PY-HZ-Word-BiGRUCNN 82.2% 82.3% 81.9% 80.9%
PY-POS-HZ-Word-BiGRUCNN(本文) 83.3% 83.6% 82.9% 83.4%
Table 4  多特征融合模型对比(计算机专利)
数据集 模型 Acc P R F1
搜狐新闻 POS-BiGRUCNN 64.9% 61.3% 62.6% 63.3%
PY-BiGRUCNN 72.3% 71.9% 69.8% 71.2%
HZ-BiGRUCNN 75.2% 72.7% 72.6% 72.2%
Word-BiGRUCNN 83.6% 81.6% 80.1% 79.8%
POS-Word-BiGRUCNN 85.1% 84.2% 81.9% 82.1%
PY-Word-BiGRUCNN 86.0% 84.8% 82.2% 83.2%
HZ-Word-BiGRUCNN 87.2% 85.7% 84.3% 84.2%
PY-POS-Word-BiGRUCNN 89.1% 87.6% 89.3% 87.5%
HZ-POS-Word-BiGRUCNN 89.4% 89.6% 89.3% 89.5%
PY-HZ-Word-BiGRUCNN 90.2% 89.6% 89.9% 89.1%
PY-POS-HZ-Word-BiGRUCNN(本文) 91.1% 91.3% 90.8% 89.7%
Table 5  多特征融合模型对比(搜狐新闻)
模型 Acc P R F1
LSTM 74.4% 74.7% 74.7% 74.9%
GRU 75.0% 74.4% 75.7% 75.1%
BiGRU 75.2% 75.5% 75.5% 75.6%
CNN 73.7% 72.1% 71.5% 71.6%
PY-POS-HZ-Word-BGRU(本文) 83.3% 83.6% 82.9% 83.4%
Table 6  基准模型对比结果
Fig.4  单滤波器实验结果
Fig.5  双滤波器实验结果
Fig.6  三种滤波器结合实验结果
Fig.7  词向量维度对模型的影响
[1] 武娇, 洪彩凤, 顾永春, 等. 基于类邻域字典的线性回归文本分类[J/OL]. 计算机工程. [2021-06-18]. https://doi.org/10.19678/j.issn.1000-3428.0058692.
[1] (Wu Jiao, Hong Caifeng, Gu Yongchun, et al. Class-wise Nearest Neighbor Dictionary based Linear Regression Model for Text Classification[J/OL]. Computer Engineering. [2021-06-18]. https://doi.org/10.19678/j.issn.1000-3428.0058692.)
[2] 方秋莲, 王培锦, 隋阳, 等. 朴素Bayes分类器文本特征向量的参数优化[J]. 吉林大学学报(理学版), 2019, 57(6): 1479-1484.
[2] (Fang Qiulian, Wang Peijin, Sui Yang, et al. Parameter Optimization of Text Feature Vector of Naïve Bayesian Classifier[J]. Journal of Jilin University (Science Edition), 2019, 57(6): 1479-1484.)
[3] Kim Y. Convolutional Neural Networks for Sentence Classification [C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2014: 1746-1751.
[4] Kalchbrenner N, Grefenstette E, Blunsom P. A Convolutional Neural Network for Modelling Sentences [C]//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 2014: 655-665.
[5] Johnson R, Zhang T. Deep Pyramid Convolutional Neural Networks for Text Categorization [C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017: 562-570.
[6] 余本功, 许庆堂, 张培行. 基于MAC-LSTM的问题分类研究[J]. 计算机应用研究, 2020, 37(1): 40-43.
[6] (Yu Bengong, Xu Qingtang, Zhang Peixing. Question Classification Based on MAC-LSTM[J]. Application Research of Computers, 2020, 37(1): 40-43.)
[7] 王海涛, 宋文, 王辉. 一种基于LSTM和CNN混合模型的文本分类方法[J]. 小型微型计算机系统, 2020, 41(6): 1163-1168.
[7] (Wang Haitao, Song Wen, Wang Hui. Text Classification Method Based on Hybrid Model of LSTM and CNN[J]. Journal of Chinese Computer Systems, 2020, 41(6): 1163-1168.)
[8] 余本功, 张培行. 基于双通道特征融合的WPOS-GRU专利分类方法[J]. 计算机应用研究, 2020, 37(3): 655-658.
[8] (Yu Bengong, Zhang Peixing. WPOS-GRU Patent Classification Method Based on Two-channel Feature Fusion[J]. Application Research of Computers, 2020, 37(3): 655-658.)
[9] 贺波, 马静, 李驰. 基于融合特征的商品文本分类方法研究[J]. 情报理论与实践, 2020, 43(11): 162-168.
[9] (He Bo, Ma Jing, Li Chi. Research on Commodity Text Classification Based on Fusion Features[J]. Information Studies: Theory & Application, 2020, 43(11): 162-168.)
[10] 郑诚, 薛满意, 洪彤彤, 等. 用于短文本分类的DC-BiGRU_CNN模型[J]. 计算机科学, 2019, 46(11): 186-192.
[10] (Zheng Cheng, Xue Manyi, Hong Tongtong, et al. DC-BiGRU_CNN Model for Short-text Classification[J]. Computer Science, 2019, 46(11): 186-192.)
[11] 胡吉明, 郑翔, 程齐凯, 等. 基于BiLSTM-CRF的政府微博舆论观点抽取与焦点呈现[J]. 情报理论与实践, 2021, 44(1): 174-179.
[11] (Hu Jiming, Zheng Xiang, Cheng Qikai, et al. Public Opinion Extraction and Focus Presentation in Government Microblog Based on BiLSTM-CRF[J]. Information Studies: Theory & Application, 2021, 44(1): 174-179.)
[12] 赵钰潇, 化柏林. 我国省级科技管理部门官网文本数据的主题建模分析研究[J]. 情报理论与实践, 2020, 43(11): 116-121, 168.
[12] (Zhao Yuxiao, Hua Bolin. Research on Topic Modeling of China’s Provincial Scientific and Technology Management Department Based on Official Website Text Data[J]. Information Studies: Theory & Application, 2020, 43(11): 116-121, 168.)
[13] 陈钊, 徐睿峰, 桂林, 等. 结合卷积神经网络和词语情感序列特征的中文情感分析[J]. 中文信息学报, 2015, 29(6): 172-178.
[13] (Chen Zhao, Xu Ruifeng, Gui Lin, et al. Combining Convolutional Neural Networks and Word Sentiment Sequence Features for Chinese Text Sentiment Analysis[J]. Journal of Chinese Information Processing, 2015, 29(6): 172-178.)
[14] 刘敬学, 孟凡荣, 周勇, 等. 字符级卷积神经网络短文本分类算法[J]. 计算机工程与应用, 2019, 55(5): 135-142.
[14] (Liu Jingxue, Meng Fanrong, Zhou Yong, et al. Character-Level Convolutional Neural Networks for Short Text Classification[J]. Computer Engineering and Applications, 2019, 55(5): 135-142.)
[15] 杨路辉, 刘光杰, 翟江涛, 等. 一种改进的卷积神经网络恶意域名检测算法[J]. 西安电子科技大学学报, 2020, 47(1): 37-43.
[15] (Yang Luhui, Liu Guangjie, Zhai Jiangtao, et al. Improved Algorithm for Detection of the Malicious Domain Name Based on the Convolutional Neural Network[J]. Journal of Xidian University, 2020, 47(1): 37-43.)
[16] 聂维民, 陈永洲, 马静. 融合多粒度信息的文本向量表示模型[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[16] (Nie Weimin, Chen Yongzhou, Ma Jing. A Text Vector Representation Model Merging Multi-Granularity Information[J]. Data Analysis and Knowledge Discovery, 2019, 3(9): 45-52.)
[17] 刘龙飞, 杨亮, 张绍武, 等. 基于卷积神经网络的微博情感倾向性分析[J]. 中文信息学报, 2015, 29(6): 159-165.
[17] (Liu Longfei, Yang Liang, Zhang Shaowu, et al. Convolutional Neural Networks for Chinese Micro-blog Sentiment Analysis[J]. Journal of Chinese Information Processing, 2015, 29(6): 159-165.)
[18] 余本功, 张连彬. 基于CP-CNN的中文短文本分类研究[J]. 计算机应用研究, 2018, 35(4): 1001-1004.
[18] (Yu Bengong, Zhang Lianbin. Chinese Short Text Classification Based on CP-CNN[J]. Application Research of Computers, 2018, 35(4): 1001-1004.)
[19] Tian J, Zhu D J, Long H. Chinese Short Text Multi-Classification Based on Word and Part-of-Speech Tagging Embedding [C]//Proceedings of the 2018 International Conference on Algorithms, Computing and Artificial Intelligence. 2018: 62.
[20] Kalarani P, Selva Brunda S. Sentiment Analysis by POS and Joint Sentiment Topic Features Using SVM and ANN[J]. Soft Computing, 2019, 23(16): 7067-7079.
doi: 10.1007/s00500-018-3349-9
[21] 何鸿业, 郑瑾, 张祖平. 结合词性特征与卷积神经网络的文本情感分析[J]. 计算机工程, 2018, 44(11): 209-214, 221.
[21] (He Hongye, Zheng Jin, Zhang Zuping. Text Sentiment Analysis Combined with Part of Speech Features and Convolutional Neural Network[J]. Computer Engineering, 2018, 44(11): 209-214, 221.)
[22] Harikrishna D M, Rao K S. Classification of Children Stories in Hindi Using Keywords and POS Density[C]// Proceedings of 2015 International Conference on Computer, Communication and Control (IC4). 2015. DOI: 10.1109/IC4.2015.7375666.
doi: 10.1109/IC4.2015.7375666
[23] 路永和, 王鸿滨. 文本分类中受词性影响的特征权重计算方法[J]. 现代图书情报技术, 2015(4): 18-25.
[23] (Lu Yonghe, Wang Hongbin. Feature Weighting Method Affected by Part of Speech in Text Classification[J]. New Technology of Library and Information Service, 2015(4): 18-25.)
[24] Jiang T, Yu H Z, Zhang B. Tibetan Text Classification Using Distributed Representations of Words [C]//Proceedings of International Conference on Asian Language Processing. 2015: 123-126.
[25] Rohit, Singh A K. Accuracy Enhancement of Collaborative Filtering Recommender System for Blogs Using Latent Semantic Indexing[C]// Proceedings of 2017 Conference Information and Communication Technology. DOI: 10.1109/INFOCOMTECH.2017.8340646.
doi: 10.1109/INFOCOMTECH.2017.8340646
[26] Dalaorao G A, Sison A M, Medina R P. Integrating Collocation as TF-IDF Enhancement to Improve Classification Accuracy [C]//Proceedings of the 13th IEEE International Conference on Telecommunication Systems Services and Applications. 2019: 282-285.
[27] Saikia L P, Singh S. Feature Extraction and Performance Measure of Requirement Engineering (RE) Document Using Text Classification Technique [C]//Proceedings of the 4th International Conference on Recent Advances in Information Technology. 2018: 1-6.
[28] Cheng K F, Yue Y N, Song Z W, et al. Sentiment Classification Based on Part-of-Speech and Self-Attention Mechanism[J]. IEEE Access, 2020, 8: 16387-16396.
doi: 10.1109/Access.6287639
[29] Bektaş Y, Özel S A. The Effect of POS Tag Information on Sentence Boundary Detection in Turkish Texts [C]//Proceedings of 2018 Innovations in Intelligent Systems and Applications Conference (ASYU). IEEE, 2018: 1-5.
[30] Hoesen D, Purwarianti A. Investigating Bi-LSTM and CRF with POS Tag Embedding for Indonesian Named Entity Tagger [C]//Proceedings of 2018 International Conference on Asian Language Processing. 2018: 35-38.
[31] Yuwana R S, Suryawati E, Pardede H F. On Empirical Evaluation of Deep Architectures for Indonesian POS Tagging Problem [C]//Proceedings of 2018 International Conference on Computer Control Informatics and Its Applications. 2018: 204-208.
[32] 成彬, 施水才, 都云程, 等. 基于融合词性的BiLSTM-CRF的期刊关键词抽取方法[J]. 数据分析与知识发现, 2021, 5(3): 101-108.
[32] (Cheng Bin, Shi Shuicai, Du Yuncheng, et al. Keyword Extraction for Journals Based on Part-of-Speech and BiLSTM-CRF Combined Model[J]. Data Analysis and Knowledge Discovery, 2021, 5(3): 101-108.)
[33] Huang M L, Qian Q, Zhu X Y, et al. Encoding Syntactic Knowledge in Neural Networks for Sentiment Classification[J]. ACM Transactions on Information Systems, 2017(3): No.26.
[34] Shiguihara-Juárez P, Murrugarra-Llerena N, Andrade Lopes A D. POS-Tags Features for Protein-Protein Interaction Extraction from Biomedical Articles[C]// Proceedings of 2018 IEEE 25th International Conference on Electronics, Electrical Engineering and Computing (INTERCON). DOI: 10.1109/INTERCON.2018.8526370.
doi: 10.1109/INTERCON.2018.8526370
[35] 王义, 沈洋, 戴月明. 基于细粒度多通道卷积神经网络的文本情感分析[J]. 计算机工程, 2020, 46(5): 102-108.
[35] (Wang Yi, Shen Yang, Dai Yueming. Sentiment Analysis of Texts Based on Fine-Grained Multi-Channel Convolutional Neural Network[J]. Computer Engineering, 2020, 46(5): 102-108.)
[36] Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality [C]// Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2. 2013: 3111-3119.
[37] SogouCS.reduced [DB/OL]. http://www.sogou.com/labs/resource/cs.php.
[1] 陈杰,马静,李晓峰. 融合预训练模型文本特征的短文本分类方法*[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2] 周泽聿,王昊,赵梓博,李跃艳,张小琴. 融合关联信息的GCN文本分类模型构建及其应用研究*[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[3] 余本功,朱晓洁,张子薇. 基于多层次特征提取的胶囊网络文本分类研究*[J]. 数据分析与知识发现, 2021, 5(6): 93-102.
[4] 唐晓波,高和璇. 基于关键词词向量特征扩展的健康问句分类研究 *[J]. 数据分析与知识发现, 2020, 4(7): 66-75.
[5] 王思迪,胡广伟,杨巳煜,施云. 基于文本分类的政府网站信箱自动转递方法研究*[J]. 数据分析与知识发现, 2020, 4(6): 51-59.
[6] 徐月梅,刘韫文,蔡连侨. 基于深度融合特征的政务微博转发规模预测模型*[J]. 数据分析与知识发现, 2020, 4(2/3): 18-28.
[7] 徐彤彤,孙华志,马春梅,姜丽芬,刘逸琛. 基于双向长效注意力特征表达的少样本文本分类模型研究*[J]. 数据分析与知识发现, 2020, 4(10): 113-123.
[8] 余本功,曹雨蒙,陈杨楠,杨颖. 基于nLD-SVM-RF的短文本分类研究*[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[9] 聂维民,陈永洲,马静. 融合多粒度信息的文本向量表示模型 *[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[10] 邵云飞,刘东苏. 基于类别特征扩展的短文本分类方法研究 *[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[11] 秦贺然,刘浏,李斌,王东波. 融入实体特征的典籍自动分类研究 *[J]. 数据分析与知识发现, 2019, 3(9): 68-76.
[12] 陈果,许天祥. 基于主动学习的科技论文句子功能识别研究 *[J]. 数据分析与知识发现, 2019, 3(8): 53-61.
[13] 余本功,陈杨楠,杨颖. 基于nBD-SVM模型的投诉短文本分类*[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[14] 谭章禄,王兆刚,胡翰. 一种基于χ2统计的特征分类选择方法研究*[J]. 数据分析与知识发现, 2019, 3(2): 72-78.
[15] 陶志勇,李小兵,刘影,刘晓芳. 基于双向长短时记忆网络的改进注意力短文本分类方法 *[J]. 数据分析与知识发现, 2019, 3(12): 21-29.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn