Please wait a minute...
Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (10): 1-14    DOI: 10.11925/infotech.2096-3467.2021.0228
Current Issue | Archive | Adv Search |
Chinese Text Classification with Feature Fusion
Wang Yan1,Wang Huyan2(),Yu Bengong2,3
1Economic and Technical College, Anhui Agricultural University, Hefei 231200, China
2School of Management, Hefei University of Technology, Hefei 230009, China
3Key Laboratory of Process Optimization & Intelligent Decision-Making, Ministry of Education, Hefei University of Technology, Hefei 230009, China
Download: PDF (1099 KB)   HTML ( 44
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a new classification model for Chinese texts, aiming to address the issues of weak structure, spelling errors or homonyms in the texts. [Methods] We constructed a multi-feature fusion method based on the traditional fusion features model for text classification. Then, we combined word level features, part of speech feature extension, the Chinese character features and the Pinyin letters to create multi-feature semantic representation. Third, we introduced the new multi-semantic characteristics into the BiGRU to obtain the context semantics, which were processed with the multi-channel CNN to generate the main features. Finally, we merged these features for the softmax layer to finish the classification tasks, and predicted the required category labels. [Results] The accuracy of our multi-feature fusion model reached 83.3% and 91.1% with two datasets, which was 7% higher than the existing model. [Limitations] More research is needed to examine the model with larger datasets. [Conclusions] The proposed model could effectively finish the Chinese text classification tasks.

Key wordsPart of Speech Tag      Word Level Characteristics      Text Classification      Pinyin Character Features      Chinese Character Features     
Received: 08 March 2021      Published: 01 July 2021
ZTFLH:  G350  
Fund:National Natural Science Foundation of China(71671057)
Corresponding Authors: Wang Huyan,ORCID:0000-0001-8267-6183     E-mail: 1115419302@qq.com

Cite this article:

Wang Yan, Wang Huyan, Yu Bengong. Chinese Text Classification with Feature Fusion. Data Analysis and Knowledge Discovery, 2021, 5(10): 1-14.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2021.0228     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2021/V5/I10/1

Structure of Text Classification Model Based on Multi-Feature Fusion
CBOW Model Structure
GRU Node Structure
环境 配置参数
处理器 Intel(R) Core(TM) I5-4200U CPU @1.6GHz
显卡 NVIDIA GeForce GT 740M
内存 12GB
编译器、语言 PyCharm,Python3.7
Configuration Parameters of Experiment
数据项 计算机专利 搜狗新闻
来源 SooPAR专利 搜狗实验室开源
类别数 5 5
数量 10 000 10 000
平均长度(字符) 210 843
最短长度(字符) 150 30
最长长度(字符) 300 400
Data Set
参数 设定值
卷积核宽度 {1,3,5}
卷积核个数 64
GRU单元数 100
Batch Size 32
Epoch 20
Optimizer Adam
Dropout Rate 0.25
Model Parameter Setting
数据集 模型 Acc P R F1
计算机专利 POS-BiGRUCNN 57.3% 59.4% 58.1% 60.0%
PY-BiGRUCNN 64.1% 65.3% 63.5% 64.1%
HZ-BiGRUCNN 65.2% 62.7% 62.6% 62.2%
Word-BiGRUCNN 76.1% 76.9% 76.2% 76.5%
POS-Word-BiGRUCNN 78.1% 77.5% 77.1% 77.3%
PY-Word-BiGRUCNN 79.1% 78.8% 78.5% 77.5%
HZ-Word-BiGRUCNN 80.2% 79.5% 79.3% 79.5%
PY-POS-Word-BiGRUCNN 81.3% 80.3% 78.6% 79.4%
HZ-POS-Word-BiGRUCNN 81.2% 81.3% 80.5% 80.9%
PY-HZ-Word-BiGRUCNN 82.2% 82.3% 81.9% 80.9%
PY-POS-HZ-Word-BiGRUCNN(本文) 83.3% 83.6% 82.9% 83.4%
Comparison of Multi-Feature Fusion Models (Computer Data)
数据集 模型 Acc P R F1
搜狐新闻 POS-BiGRUCNN 64.9% 61.3% 62.6% 63.3%
PY-BiGRUCNN 72.3% 71.9% 69.8% 71.2%
HZ-BiGRUCNN 75.2% 72.7% 72.6% 72.2%
Word-BiGRUCNN 83.6% 81.6% 80.1% 79.8%
POS-Word-BiGRUCNN 85.1% 84.2% 81.9% 82.1%
PY-Word-BiGRUCNN 86.0% 84.8% 82.2% 83.2%
HZ-Word-BiGRUCNN 87.2% 85.7% 84.3% 84.2%
PY-POS-Word-BiGRUCNN 89.1% 87.6% 89.3% 87.5%
HZ-POS-Word-BiGRUCNN 89.4% 89.6% 89.3% 89.5%
PY-HZ-Word-BiGRUCNN 90.2% 89.6% 89.9% 89.1%
PY-POS-HZ-Word-BiGRUCNN(本文) 91.1% 91.3% 90.8% 89.7%
Comparison of Multi-Feature Fusion Models (Sohu News)
模型 Acc P R F1
LSTM 74.4% 74.7% 74.7% 74.9%
GRU 75.0% 74.4% 75.7% 75.1%
BiGRU 75.2% 75.5% 75.5% 75.6%
CNN 73.7% 72.1% 71.5% 71.6%
PY-POS-HZ-Word-BGRU(本文) 83.3% 83.6% 82.9% 83.4%
Comparison Results of the Benchmark Model
Single Filter Experimental Results
Double Filters Experimental Results
The Three Filters are Combined with Experimental Results
The Effects of Word Vector Dimensions on Models
[1] 武娇, 洪彩凤, 顾永春, 等. 基于类邻域字典的线性回归文本分类[J/OL]. 计算机工程. [2021-06-18]. https://doi.org/10.19678/j.issn.1000-3428.0058692.
[1] (Wu Jiao, Hong Caifeng, Gu Yongchun, et al. Class-wise Nearest Neighbor Dictionary based Linear Regression Model for Text Classification[J/OL]. Computer Engineering. [2021-06-18]. https://doi.org/10.19678/j.issn.1000-3428.0058692.)
[2] 方秋莲, 王培锦, 隋阳, 等. 朴素Bayes分类器文本特征向量的参数优化[J]. 吉林大学学报(理学版), 2019, 57(6): 1479-1484.
[2] (Fang Qiulian, Wang Peijin, Sui Yang, et al. Parameter Optimization of Text Feature Vector of Naïve Bayesian Classifier[J]. Journal of Jilin University (Science Edition), 2019, 57(6): 1479-1484.)
[3] Kim Y. Convolutional Neural Networks for Sentence Classification [C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2014: 1746-1751.
[4] Kalchbrenner N, Grefenstette E, Blunsom P. A Convolutional Neural Network for Modelling Sentences [C]//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 2014: 655-665.
[5] Johnson R, Zhang T. Deep Pyramid Convolutional Neural Networks for Text Categorization [C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017: 562-570.
[6] 余本功, 许庆堂, 张培行. 基于MAC-LSTM的问题分类研究[J]. 计算机应用研究, 2020, 37(1): 40-43.
[6] (Yu Bengong, Xu Qingtang, Zhang Peixing. Question Classification Based on MAC-LSTM[J]. Application Research of Computers, 2020, 37(1): 40-43.)
[7] 王海涛, 宋文, 王辉. 一种基于LSTM和CNN混合模型的文本分类方法[J]. 小型微型计算机系统, 2020, 41(6): 1163-1168.
[7] (Wang Haitao, Song Wen, Wang Hui. Text Classification Method Based on Hybrid Model of LSTM and CNN[J]. Journal of Chinese Computer Systems, 2020, 41(6): 1163-1168.)
[8] 余本功, 张培行. 基于双通道特征融合的WPOS-GRU专利分类方法[J]. 计算机应用研究, 2020, 37(3): 655-658.
[8] (Yu Bengong, Zhang Peixing. WPOS-GRU Patent Classification Method Based on Two-channel Feature Fusion[J]. Application Research of Computers, 2020, 37(3): 655-658.)
[9] 贺波, 马静, 李驰. 基于融合特征的商品文本分类方法研究[J]. 情报理论与实践, 2020, 43(11): 162-168.
[9] (He Bo, Ma Jing, Li Chi. Research on Commodity Text Classification Based on Fusion Features[J]. Information Studies: Theory & Application, 2020, 43(11): 162-168.)
[10] 郑诚, 薛满意, 洪彤彤, 等. 用于短文本分类的DC-BiGRU_CNN模型[J]. 计算机科学, 2019, 46(11): 186-192.
[10] (Zheng Cheng, Xue Manyi, Hong Tongtong, et al. DC-BiGRU_CNN Model for Short-text Classification[J]. Computer Science, 2019, 46(11): 186-192.)
[11] 胡吉明, 郑翔, 程齐凯, 等. 基于BiLSTM-CRF的政府微博舆论观点抽取与焦点呈现[J]. 情报理论与实践, 2021, 44(1): 174-179.
[11] (Hu Jiming, Zheng Xiang, Cheng Qikai, et al. Public Opinion Extraction and Focus Presentation in Government Microblog Based on BiLSTM-CRF[J]. Information Studies: Theory & Application, 2021, 44(1): 174-179.)
[12] 赵钰潇, 化柏林. 我国省级科技管理部门官网文本数据的主题建模分析研究[J]. 情报理论与实践, 2020, 43(11): 116-121, 168.
[12] (Zhao Yuxiao, Hua Bolin. Research on Topic Modeling of China’s Provincial Scientific and Technology Management Department Based on Official Website Text Data[J]. Information Studies: Theory & Application, 2020, 43(11): 116-121, 168.)
[13] 陈钊, 徐睿峰, 桂林, 等. 结合卷积神经网络和词语情感序列特征的中文情感分析[J]. 中文信息学报, 2015, 29(6): 172-178.
[13] (Chen Zhao, Xu Ruifeng, Gui Lin, et al. Combining Convolutional Neural Networks and Word Sentiment Sequence Features for Chinese Text Sentiment Analysis[J]. Journal of Chinese Information Processing, 2015, 29(6): 172-178.)
[14] 刘敬学, 孟凡荣, 周勇, 等. 字符级卷积神经网络短文本分类算法[J]. 计算机工程与应用, 2019, 55(5): 135-142.
[14] (Liu Jingxue, Meng Fanrong, Zhou Yong, et al. Character-Level Convolutional Neural Networks for Short Text Classification[J]. Computer Engineering and Applications, 2019, 55(5): 135-142.)
[15] 杨路辉, 刘光杰, 翟江涛, 等. 一种改进的卷积神经网络恶意域名检测算法[J]. 西安电子科技大学学报, 2020, 47(1): 37-43.
[15] (Yang Luhui, Liu Guangjie, Zhai Jiangtao, et al. Improved Algorithm for Detection of the Malicious Domain Name Based on the Convolutional Neural Network[J]. Journal of Xidian University, 2020, 47(1): 37-43.)
[16] 聂维民, 陈永洲, 马静. 融合多粒度信息的文本向量表示模型[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[16] (Nie Weimin, Chen Yongzhou, Ma Jing. A Text Vector Representation Model Merging Multi-Granularity Information[J]. Data Analysis and Knowledge Discovery, 2019, 3(9): 45-52.)
[17] 刘龙飞, 杨亮, 张绍武, 等. 基于卷积神经网络的微博情感倾向性分析[J]. 中文信息学报, 2015, 29(6): 159-165.
[17] (Liu Longfei, Yang Liang, Zhang Shaowu, et al. Convolutional Neural Networks for Chinese Micro-blog Sentiment Analysis[J]. Journal of Chinese Information Processing, 2015, 29(6): 159-165.)
[18] 余本功, 张连彬. 基于CP-CNN的中文短文本分类研究[J]. 计算机应用研究, 2018, 35(4): 1001-1004.
[18] (Yu Bengong, Zhang Lianbin. Chinese Short Text Classification Based on CP-CNN[J]. Application Research of Computers, 2018, 35(4): 1001-1004.)
[19] Tian J, Zhu D J, Long H. Chinese Short Text Multi-Classification Based on Word and Part-of-Speech Tagging Embedding [C]//Proceedings of the 2018 International Conference on Algorithms, Computing and Artificial Intelligence. 2018: 62.
[20] Kalarani P, Selva Brunda S. Sentiment Analysis by POS and Joint Sentiment Topic Features Using SVM and ANN[J]. Soft Computing, 2019, 23(16): 7067-7079.
doi: 10.1007/s00500-018-3349-9
[21] 何鸿业, 郑瑾, 张祖平. 结合词性特征与卷积神经网络的文本情感分析[J]. 计算机工程, 2018, 44(11): 209-214, 221.
[21] (He Hongye, Zheng Jin, Zhang Zuping. Text Sentiment Analysis Combined with Part of Speech Features and Convolutional Neural Network[J]. Computer Engineering, 2018, 44(11): 209-214, 221.)
[22] Harikrishna D M, Rao K S. Classification of Children Stories in Hindi Using Keywords and POS Density[C]// Proceedings of 2015 International Conference on Computer, Communication and Control (IC4). 2015. DOI: 10.1109/IC4.2015.7375666.
doi: 10.1109/IC4.2015.7375666
[23] 路永和, 王鸿滨. 文本分类中受词性影响的特征权重计算方法[J]. 现代图书情报技术, 2015(4): 18-25.
[23] (Lu Yonghe, Wang Hongbin. Feature Weighting Method Affected by Part of Speech in Text Classification[J]. New Technology of Library and Information Service, 2015(4): 18-25.)
[24] Jiang T, Yu H Z, Zhang B. Tibetan Text Classification Using Distributed Representations of Words [C]//Proceedings of International Conference on Asian Language Processing. 2015: 123-126.
[25] Rohit, Singh A K. Accuracy Enhancement of Collaborative Filtering Recommender System for Blogs Using Latent Semantic Indexing[C]// Proceedings of 2017 Conference Information and Communication Technology. DOI: 10.1109/INFOCOMTECH.2017.8340646.
doi: 10.1109/INFOCOMTECH.2017.8340646
[26] Dalaorao G A, Sison A M, Medina R P. Integrating Collocation as TF-IDF Enhancement to Improve Classification Accuracy [C]//Proceedings of the 13th IEEE International Conference on Telecommunication Systems Services and Applications. 2019: 282-285.
[27] Saikia L P, Singh S. Feature Extraction and Performance Measure of Requirement Engineering (RE) Document Using Text Classification Technique [C]//Proceedings of the 4th International Conference on Recent Advances in Information Technology. 2018: 1-6.
[28] Cheng K F, Yue Y N, Song Z W, et al. Sentiment Classification Based on Part-of-Speech and Self-Attention Mechanism[J]. IEEE Access, 2020, 8: 16387-16396.
doi: 10.1109/Access.6287639
[29] Bektaş Y, Özel S A. The Effect of POS Tag Information on Sentence Boundary Detection in Turkish Texts [C]//Proceedings of 2018 Innovations in Intelligent Systems and Applications Conference (ASYU). IEEE, 2018: 1-5.
[30] Hoesen D, Purwarianti A. Investigating Bi-LSTM and CRF with POS Tag Embedding for Indonesian Named Entity Tagger [C]//Proceedings of 2018 International Conference on Asian Language Processing. 2018: 35-38.
[31] Yuwana R S, Suryawati E, Pardede H F. On Empirical Evaluation of Deep Architectures for Indonesian POS Tagging Problem [C]//Proceedings of 2018 International Conference on Computer Control Informatics and Its Applications. 2018: 204-208.
[32] 成彬, 施水才, 都云程, 等. 基于融合词性的BiLSTM-CRF的期刊关键词抽取方法[J]. 数据分析与知识发现, 2021, 5(3): 101-108.
[32] (Cheng Bin, Shi Shuicai, Du Yuncheng, et al. Keyword Extraction for Journals Based on Part-of-Speech and BiLSTM-CRF Combined Model[J]. Data Analysis and Knowledge Discovery, 2021, 5(3): 101-108.)
[33] Huang M L, Qian Q, Zhu X Y, et al. Encoding Syntactic Knowledge in Neural Networks for Sentiment Classification[J]. ACM Transactions on Information Systems, 2017(3): No.26.
[34] Shiguihara-Juárez P, Murrugarra-Llerena N, Andrade Lopes A D. POS-Tags Features for Protein-Protein Interaction Extraction from Biomedical Articles[C]// Proceedings of 2018 IEEE 25th International Conference on Electronics, Electrical Engineering and Computing (INTERCON). DOI: 10.1109/INTERCON.2018.8526370.
doi: 10.1109/INTERCON.2018.8526370
[35] 王义, 沈洋, 戴月明. 基于细粒度多通道卷积神经网络的文本情感分析[J]. 计算机工程, 2020, 46(5): 102-108.
[35] (Wang Yi, Shen Yang, Dai Yueming. Sentiment Analysis of Texts Based on Fine-Grained Multi-Channel Convolutional Neural Network[J]. Computer Engineering, 2020, 46(5): 102-108.)
[36] Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality [C]// Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2. 2013: 3111-3119.
[37] SogouCS.reduced [DB/OL]. http://www.sogou.com/labs/resource/cs.php.
[1] Chen Jie,Ma Jing,Li Xiaofeng. Short-Text Classification Method with Text Features from Pre-trained Models[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2] Zhou Zeyu,Wang Hao,Zhao Zibo,Li Yueyan,Zhang Xiaoqin. Construction and Application of GCN Model for Text Classification with Associated Information[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[3] Yu Bengong,Zhu Xiaojie,Zhang Ziwei. A Capsule Network Model for Text Classification with Multi-level Feature Extraction[J]. 数据分析与知识发现, 2021, 5(6): 93-102.
[4] Wang Sidi,Hu Guangwei,Yang Siyu,Shi Yun. Automatic Transferring Government Website E-Mails Based on Text Classification[J]. 数据分析与知识发现, 2020, 4(6): 51-59.
[5] Xu Yuemei,Liu Yunwen,Cai Lianqiao. Predicitng Retweets of Government Microblogs with Deep-combined Features[J]. 数据分析与知识发现, 2020, 4(2/3): 18-28.
[6] Xu Tongtong,Sun Huazhi,Ma Chunmei,Jiang Lifen,Liu Yichen. Classification Model for Few-shot Texts Based on Bi-directional Long-term Attention Features[J]. 数据分析与知识发现, 2020, 4(10): 113-123.
[7] Bengong Yu,Yumeng Cao,Yangnan Chen,Ying Yang. Classification of Short Texts Based on nLD-SVM-RF Model[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[8] Weimin Nie,Yongzhou Chen,Jing Ma. A Text Vector Representation Model Merging Multi-Granularity Information[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[9] Yunfei Shao,Dongsu Liu. Classifying Short-texts with Class Feature Extension[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[10] Heran Qin,Liu Liu,Bin Li,Dongbo Wang. Automatic Classification of Ancient Classics with Entity Features[J]. 数据分析与知识发现, 2019, 3(9): 68-76.
[11] Guo Chen,Tianxiang Xu. Sentence Function Recognition Based on Active Learning[J]. 数据分析与知识发现, 2019, 3(8): 53-61.
[12] Bengong Yu,Yangnan Chen,Ying Yang. Classifying Short Text Complaints with nBD-SVM Model[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[13] Zhiyong Tao,Xiaobing Li,Ying Liu,Xiaofang Liu. Classifying Short Texts with Improved-Attention Based Bidirectional Long Memory Network[J]. 数据分析与知识发现, 2019, 3(12): 21-29.
[14] Yuman Li,Zhibo Chen,Fu Xu. Classifying Texts with KACC Model[J]. 数据分析与知识发现, 2019, 3(10): 89-97.
[15] Zixuan Zhang,Hao Wang,Liping Zhu,Sanhong eng. Identifying Risks of HS Codes by China Customs[J]. 数据分析与知识发现, 2019, 3(1): 72-84.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn