|
|
Chinese Text Classification with Feature Fusion |
Wang Yan1,Wang Huyan2(),Yu Bengong2,3 |
1Economic and Technical College, Anhui Agricultural University, Hefei 231200, China 2School of Management, Hefei University of Technology, Hefei 230009, China 3Key Laboratory of Process Optimization & Intelligent Decision-Making, Ministry of Education, Hefei University of Technology, Hefei 230009, China |
|
|
Abstract [Objective] This paper proposes a new classification model for Chinese texts, aiming to address the issues of weak structure, spelling errors or homonyms in the texts. [Methods] We constructed a multi-feature fusion method based on the traditional fusion features model for text classification. Then, we combined word level features, part of speech feature extension, the Chinese character features and the Pinyin letters to create multi-feature semantic representation. Third, we introduced the new multi-semantic characteristics into the BiGRU to obtain the context semantics, which were processed with the multi-channel CNN to generate the main features. Finally, we merged these features for the softmax layer to finish the classification tasks, and predicted the required category labels. [Results] The accuracy of our multi-feature fusion model reached 83.3% and 91.1% with two datasets, which was 7% higher than the existing model. [Limitations] More research is needed to examine the model with larger datasets. [Conclusions] The proposed model could effectively finish the Chinese text classification tasks.
|
Received: 08 March 2021
Published: 01 July 2021
|
|
Fund:National Natural Science Foundation of China(71671057) |
Corresponding Authors:
Wang Huyan,ORCID:0000-0001-8267-6183
E-mail: 1115419302@qq.com
|
[1] |
武娇, 洪彩凤, 顾永春, 等. 基于类邻域字典的线性回归文本分类[J/OL]. 计算机工程. [2021-06-18]. https://doi.org/10.19678/j.issn.1000-3428.0058692.
|
[1] |
(Wu Jiao, Hong Caifeng, Gu Yongchun, et al. Class-wise Nearest Neighbor Dictionary based Linear Regression Model for Text Classification[J/OL]. Computer Engineering. [2021-06-18]. https://doi.org/10.19678/j.issn.1000-3428.0058692.)
|
[2] |
方秋莲, 王培锦, 隋阳, 等. 朴素Bayes分类器文本特征向量的参数优化[J]. 吉林大学学报(理学版), 2019, 57(6): 1479-1484.
|
[2] |
(Fang Qiulian, Wang Peijin, Sui Yang, et al. Parameter Optimization of Text Feature Vector of Naïve Bayesian Classifier[J]. Journal of Jilin University (Science Edition), 2019, 57(6): 1479-1484.)
|
[3] |
Kim Y. Convolutional Neural Networks for Sentence Classification [C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2014: 1746-1751.
|
[4] |
Kalchbrenner N, Grefenstette E, Blunsom P. A Convolutional Neural Network for Modelling Sentences [C]//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 2014: 655-665.
|
[5] |
Johnson R, Zhang T. Deep Pyramid Convolutional Neural Networks for Text Categorization [C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017: 562-570.
|
[6] |
余本功, 许庆堂, 张培行. 基于MAC-LSTM的问题分类研究[J]. 计算机应用研究, 2020, 37(1): 40-43.
|
[6] |
(Yu Bengong, Xu Qingtang, Zhang Peixing. Question Classification Based on MAC-LSTM[J]. Application Research of Computers, 2020, 37(1): 40-43.)
|
[7] |
王海涛, 宋文, 王辉. 一种基于LSTM和CNN混合模型的文本分类方法[J]. 小型微型计算机系统, 2020, 41(6): 1163-1168.
|
[7] |
(Wang Haitao, Song Wen, Wang Hui. Text Classification Method Based on Hybrid Model of LSTM and CNN[J]. Journal of Chinese Computer Systems, 2020, 41(6): 1163-1168.)
|
[8] |
余本功, 张培行. 基于双通道特征融合的WPOS-GRU专利分类方法[J]. 计算机应用研究, 2020, 37(3): 655-658.
|
[8] |
(Yu Bengong, Zhang Peixing. WPOS-GRU Patent Classification Method Based on Two-channel Feature Fusion[J]. Application Research of Computers, 2020, 37(3): 655-658.)
|
[9] |
贺波, 马静, 李驰. 基于融合特征的商品文本分类方法研究[J]. 情报理论与实践, 2020, 43(11): 162-168.
|
[9] |
(He Bo, Ma Jing, Li Chi. Research on Commodity Text Classification Based on Fusion Features[J]. Information Studies: Theory & Application, 2020, 43(11): 162-168.)
|
[10] |
郑诚, 薛满意, 洪彤彤, 等. 用于短文本分类的DC-BiGRU_CNN模型[J]. 计算机科学, 2019, 46(11): 186-192.
|
[10] |
(Zheng Cheng, Xue Manyi, Hong Tongtong, et al. DC-BiGRU_CNN Model for Short-text Classification[J]. Computer Science, 2019, 46(11): 186-192.)
|
[11] |
胡吉明, 郑翔, 程齐凯, 等. 基于BiLSTM-CRF的政府微博舆论观点抽取与焦点呈现[J]. 情报理论与实践, 2021, 44(1): 174-179.
|
[11] |
(Hu Jiming, Zheng Xiang, Cheng Qikai, et al. Public Opinion Extraction and Focus Presentation in Government Microblog Based on BiLSTM-CRF[J]. Information Studies: Theory & Application, 2021, 44(1): 174-179.)
|
[12] |
赵钰潇, 化柏林. 我国省级科技管理部门官网文本数据的主题建模分析研究[J]. 情报理论与实践, 2020, 43(11): 116-121, 168.
|
[12] |
(Zhao Yuxiao, Hua Bolin. Research on Topic Modeling of China’s Provincial Scientific and Technology Management Department Based on Official Website Text Data[J]. Information Studies: Theory & Application, 2020, 43(11): 116-121, 168.)
|
[13] |
陈钊, 徐睿峰, 桂林, 等. 结合卷积神经网络和词语情感序列特征的中文情感分析[J]. 中文信息学报, 2015, 29(6): 172-178.
|
[13] |
(Chen Zhao, Xu Ruifeng, Gui Lin, et al. Combining Convolutional Neural Networks and Word Sentiment Sequence Features for Chinese Text Sentiment Analysis[J]. Journal of Chinese Information Processing, 2015, 29(6): 172-178.)
|
[14] |
刘敬学, 孟凡荣, 周勇, 等. 字符级卷积神经网络短文本分类算法[J]. 计算机工程与应用, 2019, 55(5): 135-142.
|
[14] |
(Liu Jingxue, Meng Fanrong, Zhou Yong, et al. Character-Level Convolutional Neural Networks for Short Text Classification[J]. Computer Engineering and Applications, 2019, 55(5): 135-142.)
|
[15] |
杨路辉, 刘光杰, 翟江涛, 等. 一种改进的卷积神经网络恶意域名检测算法[J]. 西安电子科技大学学报, 2020, 47(1): 37-43.
|
[15] |
(Yang Luhui, Liu Guangjie, Zhai Jiangtao, et al. Improved Algorithm for Detection of the Malicious Domain Name Based on the Convolutional Neural Network[J]. Journal of Xidian University, 2020, 47(1): 37-43.)
|
[16] |
聂维民, 陈永洲, 马静. 融合多粒度信息的文本向量表示模型[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
|
[16] |
(Nie Weimin, Chen Yongzhou, Ma Jing. A Text Vector Representation Model Merging Multi-Granularity Information[J]. Data Analysis and Knowledge Discovery, 2019, 3(9): 45-52.)
|
[17] |
刘龙飞, 杨亮, 张绍武, 等. 基于卷积神经网络的微博情感倾向性分析[J]. 中文信息学报, 2015, 29(6): 159-165.
|
[17] |
(Liu Longfei, Yang Liang, Zhang Shaowu, et al. Convolutional Neural Networks for Chinese Micro-blog Sentiment Analysis[J]. Journal of Chinese Information Processing, 2015, 29(6): 159-165.)
|
[18] |
余本功, 张连彬. 基于CP-CNN的中文短文本分类研究[J]. 计算机应用研究, 2018, 35(4): 1001-1004.
|
[18] |
(Yu Bengong, Zhang Lianbin. Chinese Short Text Classification Based on CP-CNN[J]. Application Research of Computers, 2018, 35(4): 1001-1004.)
|
[19] |
Tian J, Zhu D J, Long H. Chinese Short Text Multi-Classification Based on Word and Part-of-Speech Tagging Embedding [C]//Proceedings of the 2018 International Conference on Algorithms, Computing and Artificial Intelligence. 2018: 62.
|
[20] |
Kalarani P, Selva Brunda S. Sentiment Analysis by POS and Joint Sentiment Topic Features Using SVM and ANN[J]. Soft Computing, 2019, 23(16): 7067-7079.
doi: 10.1007/s00500-018-3349-9
|
[21] |
何鸿业, 郑瑾, 张祖平. 结合词性特征与卷积神经网络的文本情感分析[J]. 计算机工程, 2018, 44(11): 209-214, 221.
|
[21] |
(He Hongye, Zheng Jin, Zhang Zuping. Text Sentiment Analysis Combined with Part of Speech Features and Convolutional Neural Network[J]. Computer Engineering, 2018, 44(11): 209-214, 221.)
|
[22] |
Harikrishna D M, Rao K S. Classification of Children Stories in Hindi Using Keywords and POS Density[C]// Proceedings of 2015 International Conference on Computer, Communication and Control (IC4). 2015. DOI: 10.1109/IC4.2015.7375666.
doi: 10.1109/IC4.2015.7375666
|
[23] |
路永和, 王鸿滨. 文本分类中受词性影响的特征权重计算方法[J]. 现代图书情报技术, 2015(4): 18-25.
|
[23] |
(Lu Yonghe, Wang Hongbin. Feature Weighting Method Affected by Part of Speech in Text Classification[J]. New Technology of Library and Information Service, 2015(4): 18-25.)
|
[24] |
Jiang T, Yu H Z, Zhang B. Tibetan Text Classification Using Distributed Representations of Words [C]//Proceedings of International Conference on Asian Language Processing. 2015: 123-126.
|
[25] |
Rohit, Singh A K. Accuracy Enhancement of Collaborative Filtering Recommender System for Blogs Using Latent Semantic Indexing[C]// Proceedings of 2017 Conference Information and Communication Technology. DOI: 10.1109/INFOCOMTECH.2017.8340646.
doi: 10.1109/INFOCOMTECH.2017.8340646
|
[26] |
Dalaorao G A, Sison A M, Medina R P. Integrating Collocation as TF-IDF Enhancement to Improve Classification Accuracy [C]//Proceedings of the 13th IEEE International Conference on Telecommunication Systems Services and Applications. 2019: 282-285.
|
[27] |
Saikia L P, Singh S. Feature Extraction and Performance Measure of Requirement Engineering (RE) Document Using Text Classification Technique [C]//Proceedings of the 4th International Conference on Recent Advances in Information Technology. 2018: 1-6.
|
[28] |
Cheng K F, Yue Y N, Song Z W, et al. Sentiment Classification Based on Part-of-Speech and Self-Attention Mechanism[J]. IEEE Access, 2020, 8: 16387-16396.
doi: 10.1109/Access.6287639
|
[29] |
Bektaş Y, Özel S A. The Effect of POS Tag Information on Sentence Boundary Detection in Turkish Texts [C]//Proceedings of 2018 Innovations in Intelligent Systems and Applications Conference (ASYU). IEEE, 2018: 1-5.
|
[30] |
Hoesen D, Purwarianti A. Investigating Bi-LSTM and CRF with POS Tag Embedding for Indonesian Named Entity Tagger [C]//Proceedings of 2018 International Conference on Asian Language Processing. 2018: 35-38.
|
[31] |
Yuwana R S, Suryawati E, Pardede H F. On Empirical Evaluation of Deep Architectures for Indonesian POS Tagging Problem [C]//Proceedings of 2018 International Conference on Computer Control Informatics and Its Applications. 2018: 204-208.
|
[32] |
成彬, 施水才, 都云程, 等. 基于融合词性的BiLSTM-CRF的期刊关键词抽取方法[J]. 数据分析与知识发现, 2021, 5(3): 101-108.
|
[32] |
(Cheng Bin, Shi Shuicai, Du Yuncheng, et al. Keyword Extraction for Journals Based on Part-of-Speech and BiLSTM-CRF Combined Model[J]. Data Analysis and Knowledge Discovery, 2021, 5(3): 101-108.)
|
[33] |
Huang M L, Qian Q, Zhu X Y, et al. Encoding Syntactic Knowledge in Neural Networks for Sentiment Classification[J]. ACM Transactions on Information Systems, 2017(3): No.26.
|
[34] |
Shiguihara-Juárez P, Murrugarra-Llerena N, Andrade Lopes A D. POS-Tags Features for Protein-Protein Interaction Extraction from Biomedical Articles[C]// Proceedings of 2018 IEEE 25th International Conference on Electronics, Electrical Engineering and Computing (INTERCON). DOI: 10.1109/INTERCON.2018.8526370.
doi: 10.1109/INTERCON.2018.8526370
|
[35] |
王义, 沈洋, 戴月明. 基于细粒度多通道卷积神经网络的文本情感分析[J]. 计算机工程, 2020, 46(5): 102-108.
|
[35] |
(Wang Yi, Shen Yang, Dai Yueming. Sentiment Analysis of Texts Based on Fine-Grained Multi-Channel Convolutional Neural Network[J]. Computer Engineering, 2020, 46(5): 102-108.)
|
[36] |
Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality [C]// Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2. 2013: 3111-3119.
|
[37] |
SogouCS.reduced [DB/OL]. http://www.sogou.com/labs/resource/cs.php.
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|