Data Analysis and Knowledge Discovery  2019, Vol. 3 Issue (9): 45-52    DOI: 10.11925/infotech.2096-3467.2018.1161
A Text Vector Representation Model Merging Multi-Granularity Information
Weimin Nie,Yongzhou Chen,Jing Ma()
College of Economics and Management, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China
[Objective] This paper proposed a model to extract semantic features from texts more comprehensively and to improve the representation of semantics by text vectors. [Methods] We obtained the word-granularity, topic-granularity and character-granularity feature vectors with the help of convolutional neural networks. Then, the three feature vectors were combined by the “merging gate” mechanism to generate the final text vectors. Finally, we examined the model with text classification experiment. [Results] The accuracy (92.56%), the precision (92.33%), the recall (92.07%) and the F-score (92.20%), were 2.40%, 2.05%, 1.77% and 1.91% higher than the results of Text-CNN. [Limitations] The Long-distance dependency features need to be included and the corpus size needs to be expanded. [Conclusions] The proposed model could better represent the text semantics.

Key wordsText Classification      Word Vector      Convolutional Neural Network      Topic Model     
Received: 19 October 2018      Published: 23 October 2019
ZTFLH:  TP393 G35  

Weimin Nie,Yongzhou Chen,Jing Ma. A Text Vector Representation Model Merging Multi-Granularity Information. Data Analysis and Knowledge Discovery, 2019, 3(9): 45-52.

真实情况 预测结果
正例 反例
(True Positive, TP)
(False Negative, FN)
(False Positive, FP)
(True Negative, TN)
参数 主题
窗口大小 (3,4,5) (3,4,5) (12,13,14)
每个窗口过滤器数量 20 20 20
批尺寸(Batch Size) 50 50 50
丢弃率(Dropout) 0.5 0.5 0.5
l2正则化参数 0 0 0.01
目标函数 引入l2正则化的交叉熵损失函数
优化器(Optimizer) Adam
模型 Accuracy Precision Recall F-score
词向量-CNN (基准模型) 0.9016 0.9027 0.9030 0.9029
字向量-CNN 0.8848 0.8896 0.8855 0.8875
LDA-CNN 0.9172 0.9212 0.9182 0.9197
模型 Accuracy Precision Recall F-score
词-主题 0.9113 0.9124 0.9127 0.9125
字-主题 0.9027 0.9041 0.9034 0.9038
词-字 0.8917 0.8974 0.8926 0.8950
模型 Accuracy Precision Recall F-score
词-主题 0.9183 0.9205 0.9197 0.9200
字-主题 0.9043 0.9068 0.9061 0.9064
词-字 0.9010 0.9050 0.9020 0.9035
模型 Accuracy Precision Recall F-score
0.9016 0.9028 0.9030 0.9029
引入拼接的模型 0.9160 0.9176 0.9186 0.9181
0.9256 0.9233 0.9207 0.9220
