Please wait a minute...
Advanced Search
数据分析与知识发现  2019, Vol. 3 Issue (9): 45-52     https://doi.org/10.11925/infotech.2096-3467.2018.1161
     研究论文 本期目录 | 过刊浏览 | 高级检索 |
融合多粒度信息的文本向量表示模型 *
聂维民,陈永洲,马静()
南京航空航天大学经济与管理学院 南京 211106
A Text Vector Representation Model Merging Multi-Granularity Information
Weimin Nie,Yongzhou Chen,Jing Ma()
College of Economics and Management, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China
全文: PDF (491 KB)   HTML ( 19
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】更加全面地提取文本语义特征, 提高文本向量对文本语义的表示能力。【方法】通过卷积神经网络提取词粒度、主题粒度和字粒度文本特征向量, 通过“融合门”机制将三种特征向量融合得到最终的文本向量, 并进行文本分类实验。【结果】该模型在搜狗语料库文本分类实验上的准确率为92.56%, 查准率为92.33%, 查全率为92.07%, F1值为92.20%, 较基准模型Text-CNN分别提高2.40%, 2.05%, 1.77%, 1.91%。【局限】词序关系范围较小, 语料库规模较小。【结论】该模型可以更加全面地提取文本语义特征, 得到的文本向量对文本语义表示能力更强。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
聂维民
陈永洲
马静
关键词 文本分类词向量卷积神经网络主题模型    
Abstract

[Objective] This paper proposed a model to extract semantic features from texts more comprehensively and to improve the representation of semantics by text vectors. [Methods] We obtained the word-granularity, topic-granularity and character-granularity feature vectors with the help of convolutional neural networks. Then, the three feature vectors were combined by the “merging gate” mechanism to generate the final text vectors. Finally, we examined the model with text classification experiment. [Results] The accuracy (92.56%), the precision (92.33%), the recall (92.07%) and the F-score (92.20%), were 2.40%, 2.05%, 1.77% and 1.91% higher than the results of Text-CNN. [Limitations] The Long-distance dependency features need to be included and the corpus size needs to be expanded. [Conclusions] The proposed model could better represent the text semantics.

Key wordsText Classification    Word Vector    Convolutional Neural Network    Topic Model
收稿日期: 2018-10-19      出版日期: 2019-10-23
ZTFLH:  TP393 G35  
基金资助:*本文系中央高校基本科研业务费专项前瞻性发展策略研究资助项目“基于大数据技术的跨境电商政府管理范式研究”(项目编号: NW2018004);国家自然科学基金面上项目“基于演化本体的网络舆情自适应话题跟踪方法研究”(项目编号: 71373123)
引用本文:   
聂维民,陈永洲,马静. 融合多粒度信息的文本向量表示模型 *[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
Weimin Nie,Yongzhou Chen,Jing Ma. A Text Vector Representation Model Merging Multi-Granularity Information. Data Analysis and Knowledge Discovery, 2019, 3(9): 45-52.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2018.1161      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2019/V3/I9/45
  基于词向量和卷积神经网络的文本表示模型
  融合多粒度信息的文本向量表示模型
真实情况 预测结果
正例 反例
正例
(Positive)
真正例
(True Positive, TP)
假反例
(False Negative, FN)
反例
(Negative)
假正例
(False Positive, FP)
真反例
(True Negative, TN)
  混淆矩阵
  LDA-CNN不同主题数下分类F1值
参数 主题
窗口大小 (3,4,5) (3,4,5) (12,13,14)
每个窗口过滤器数量 20 20 20
批尺寸(Batch Size) 50 50 50
丢弃率(Dropout) 0.5 0.5 0.5
l2正则化参数 0 0 0.01
目标函数 引入l2正则化的交叉熵损失函数
优化器(Optimizer) Adam
  卷积神经网络参数
模型 Accuracy Precision Recall F-score
词向量-CNN (基准模型) 0.9016 0.9027 0.9030 0.9029
字向量-CNN 0.8848 0.8896 0.8855 0.8875
LDA-CNN 0.9172 0.9212 0.9182 0.9197
  单粒度信息实验结果
模型 Accuracy Precision Recall F-score
词-主题 0.9113 0.9124 0.9127 0.9125
字-主题 0.9027 0.9041 0.9034 0.9038
词-字 0.8917 0.8974 0.8926 0.8950
  三种粒度信息两两一组简单拼接实验结果
模型 Accuracy Precision Recall F-score
词-主题 0.9183 0.9205 0.9197 0.9200
字-主题 0.9043 0.9068 0.9061 0.9064
词-字 0.9010 0.9050 0.9020 0.9035
  三种粒度信息两两一组且引入融合门实验结果
模型 Accuracy Precision Recall F-score
词向量-CNN
(基准模型)
0.9016 0.9028 0.9030 0.9029
引入拼接的模型 0.9160 0.9176 0.9186 0.9181
引入融合门的模型
(本文模型)
0.9256 0.9233 0.9207 0.9220
  不同特征向量融合方式实验结果
[1] 宗成庆 . 统计自然语言处理[M]. 第2版. 北京: 清华大学出版社, 2013: 416-419.
[1] ( Zong Chengqing. Statistical Natural Language Processing[M]. The 2nd Edition. Beijing: Tsinghua University Press, 2013: 416-419.)
[2] 芮伟康 . 基于语义的文本向量表示方法研究[D]. 合肥: 中国科学技术大学, 2017.
[2] ( Rui Weikang . A Research on Text Vector Representation Based on Semantics[D]. Hefei: University of Science and Technology of China, 2017.)
[3] 牛力强 . 基于神经网络的文本向量表示与建模研究[D]. 南京: 南京大学, 2016.
[3] ( Niu Liqiang . A Research on Text Vector Representations and Modelling Based on Neural Networks[D]. Nanjing: Nanjing University, 2016.)
[4] Salton G, Wong A, Yang C S . A Vector Space Model for Automatic Indexing[J]. Communications of the ACM, 1975,18(11):613-620.
[5] Blei D M, Ng A Y, Jordan M I . Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003,3:993-1022.
[6] 姚全珠, 宋志理, 彭程 . 基于LDA模型的文本分类研究[J]. 计算机工程与应用, 2011,47(13):150-153.
[6] ( Yao Quanzhu, Song Zhili, Peng Cheng . Research on Text Categorization Based on LDA[J]. Computer Engineering and Applications, 2011,47(13):150-153.)
[7] 徐艳华, 苗雨洁, 苗琳 , 等. 基于LDA模型的HSK作文生成[J]. 数据分析与知识发现, 2018,2(9):80-87.
[7] ( Xu Yanhua, Miao Yujie, Miao Lin , et al. Generating HSK Writing Essays with LDA Model[J]. Data Analysis and Knowledge Discovery, 2018,2(9):80-87.)
[8] Kim Y, Shim K . TWILITE: A Recommendation System for Twitter Using a Probabilistic Model Based on Latent Dirichlet Allocation[J]. Information Systems, 2014,42:59-77.
[9] Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality [C]// Proceedings of the Neural Information Processing Systems 2013. 2013.
[10] Mikolov T, Chen K, Corrado G , et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781.
[11] Tang D, Qin B, Liu T . Aspect Level Sentiment Classification with Deep Memory Network[OL]. arXiv Preprint, arXiv: 1605.08900.
[12] 杜慧, 徐学可, 伍大勇 , 等. 基于情感词向量的微博情感分类[J]. 中文信息学报, 2017,31(3):170-176.
[12] ( Du Hui, Xu Xueke, Wu Dayong , et al. A Sentiment Classification Method Based on Sentiment-Specific Word Embedding[J]. Journal of Chinese Information Processing, 2017,31(3):170-176.)
[13] 李心蕾, 王昊, 刘小敏 , 等. 面向微博短文本分类的文本向量化方法比较研究[J]. 数据分析与知识发现, 2018,2(8):41-50.
[13] ( Li Xinlei, Wang Hao, Liu Xiaomin , et al. Comparing Text Vector Generators for Weibo Short Text Classification[J]. Data Analysis and Knowledge Discovery, 2018,2(8):41-50.)
[14] LeCun Y, Bengio Y . Convolutional Networks for Images, Speech, and Time Series[J]. The Handbook of Brain Theory and Neural Networks, 1995: 3361.
[15] Deng L, Liu Y . Deep Learning in Natural Language Processing[M]. Singapore: Springer Singapore, 2018: 226-229.
[16] Collobert R, Weston J, Bottou L , et al. Natural Language Processing (Almost) from Scratch[J]. Journal of Machine Learning Research, 2011,12:2493-2537.
[17] Lei T, Barzilay R, Jaakkola T. Molding CNNs for Text: Non-linear, Non-consecutive Convolutions [C]// Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal. 2015.
[18] Zhang Y, Roller S, Wallace B C. MGNC-CNN: A Simple Approach to Exploiting Multiple Word Embeddings for Sentence Classification [C]// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, USA. Stroudsburg, Pennsylvania, USA: Association for Computational Linguistics, 2016: 1522-1527.
[19] Yin W, Kann K, Yu M , et al. Comparative Study of CNN and RNN for Natural Language Processing[OL]. arXiv Preprint, arXiv: 1702.01923.
[20] Dos Santos C, Gatti M. Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts [C]// Proceedings of the 25th International Conference on Computational Linguistics, Dublin, Ireland. Dublin, Ireland: Dublin City University and Association for Computational Linguistics, 2014: 69-78.
[21] Zhang X, Zhao J, LeCun Y . Character-Level Convolutional Networks for Text Classification [C]// Proceedings of the 2015 Neural Information Processing Systems. 2015.
[22] 余本功, 张连彬 . 基于CP-CNN的中文短文本分类研究[J]. 计算机应用研究, 2018,35(4):1001-1004.
[22] ( Yu Bengong, Zhang Lianbin . Chinese Short Text Classification Based on CP-CNN[J]. Application Research of Computers, 2018,35(4):1001-1004.)
[23] Zheng X, Chen H, Xu T. Deep Learning for Chinese Word Segmentation and POS Tagging [C]// Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA. Stroudsburg, Pennsylvania, USA: Association for Computational Linguistics, 2013: 647-657.
[24] 王毅, 谢娟, 成颖 . 结合LSTM和CNN混合架构的深度神经网络语言模型[J]. 情报学报, 2018,37(2):194-205.
[24] ( Wang Yi, Xie Juan, Cheng Ying . Deep Neural Networks Language Model Based on CNN and LSTM Hybrid Architecture[J]. Journal of the China Society for Scientific and Technical Information, 2018,37(2):194-205.)
[25] Cho K, Van Merrienboer B, Gulcehre C , et al. Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation[OL]. arXiv Preprint, arXiv: 1406.1078.
[26] Wang C, Zhang M, Ma S, et al. Automatic Online News Issue Construction in Web Environment [C]// Proceedings of the 17th International Conference on World Wide Web, Beijing, China. New York, USA: Association for Computational Linguistics, 2008: 457-466.
[27] “结巴”中文分词: 做最好的Python中文分词组件[EB/OL]. (2017-08-28). [2018-12-25]. https://pypi.org/project/jieba/.
[27] ( “Jieba” Chinese Text Segmentation: Built to be the Best Python Chinese Word Segmentation Module[EB/OL]. (2017-08-28). [2018-12-25]. https://pypi.org/project/jieba/.)
[28] 中文数据预处理材料[EB/OL]. [2018-12-25].https://github.com/foowaa/Chinese_from_dongxiexidian.
[28] ( Chinese Data Preprocessing Material[EB/OL]. [2018-12-25].https://github.com/foowaa/Chinese_from_dongxiexidian.)
[29] Pedregosa F, Varoquaux G, Gramfort A , et al. Scikit-Learn: Machine Learning in Python[J]. Journal of Machine Learning Research, 2011,12:2825-2830.
[30] Phan X H, Nguyen C T . GibbsLDA++: A C/C++ Implementation of Latent Dirichlet Allocation (LDA)[EB/OL]. [ 2018- 12- 25]. http://gibbslda.sourceforge.net/.
[31] Řehůřek R, Sojka P. Software Framework for Topic Modelling with Large Corpora [C]// Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta. Luxembourg: European Language Resources Association, 2010: 45-50.
[32] Abadi M, Agarwal A, Barham P , et al. TensorFlow: Large-scale Machine Learning on Heterogeneous Systems[EB/OL]. [2018-12-25].https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf.
[1] 范少萍,赵雨宣,安新颖,吴清强. 基于卷积神经网络的医学实体关系分类模型研究*[J]. 数据分析与知识发现, 2021, 5(9): 75-84.
[2] 陈杰,马静,李晓峰. 融合预训练模型文本特征的短文本分类方法*[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[3] 周泽聿,王昊,赵梓博,李跃艳,张小琴. 融合关联信息的GCN文本分类模型构建及其应用研究*[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[4] 范涛,王昊,吴鹏. 基于图卷积神经网络和依存句法分析的网民负面情感分析研究*[J]. 数据分析与知识发现, 2021, 5(9): 97-106.
[5] 张建东, 陈仕吉, 徐小婷, 左文革. 基于词向量的PDF表格抽取研究*[J]. 数据分析与知识发现, 2021, 5(8): 34-44.
[6] 余本功,朱晓洁,张子薇. 基于多层次特征提取的胶囊网络文本分类研究*[J]. 数据分析与知识发现, 2021, 5(6): 93-102.
[7] 韩普,张展鹏,张明淘,顾亮. 基于多特征融合的中文疾病名称归一化研究*[J]. 数据分析与知识发现, 2021, 5(5): 83-94.
[8] 孟镇,王昊,虞为,邓三鸿,张宝隆. 基于特征融合的声乐分类研究*[J]. 数据分析与知识发现, 2021, 5(5): 59-70.
[9] 伊惠芳,刘细文. 一种专利技术主题分析的IPC语境增强Context-LDA模型研究[J]. 数据分析与知识发现, 2021, 5(4): 25-36.
[10] 张鑫,文奕,许海云. 一种融合表示学习与主题表征的作者合作预测模型*[J]. 数据分析与知识发现, 2021, 5(3): 88-100.
[11] 赵天资, 段亮, 岳昆, 乔少杰, 马子娟. 基于Biterm主题模型的新闻线索生成方法 *[J]. 数据分析与知识发现, 2021, 5(2): 1-13.
[12] 王艳, 王胡燕, 余本功. 基于多特征融合的中文文本分类研究*[J]. 数据分析与知识发现, 2021, 5(10): 1-14.
[13] 戴志宏, 郝晓玲. 上下位关系抽取方法及其在金融市场的应用*[J]. 数据分析与知识发现, 2021, 5(10): 60-70.
[14] 陈浩, 张梦毅, 程秀峰. 融合主题模型与决策树的跨地区专利合作关系发现与推荐*——以广东省和武汉市高校专利库为例[J]. 数据分析与知识发现, 2021, 5(10): 37-50.
[15] 邱尔丽,何鸿魏,易成岐,李慧颖. 基于字符级CNN技术的公共政策网民支持度研究 *[J]. 数据分析与知识发现, 2020, 4(7): 28-37.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn