Please wait a minute...
Data Analysis and Knowledge Discovery  2020, Vol. 4 Issue (8): 28-40    DOI: 10.11925/infotech.2096-3467.2019.1222
Current Issue | Archive | Adv Search |
A Comparative Study of Word Representation Models Based on Deep Learning
Yu Chuanming1(),Wang Manyi2,Lin Hongjun1,Zhu Xingyu1,Huang Tingting2,An Lu3
1School of Information and Safety Engineering, Zhongnan University of Economics and Law, Wuhan 430073, China
2School of Statistics and Mathematics, Zhongnan University of Economics and Law, Wuhan 430073, China
3School of Information Management, Wuhan University, Wuhan 430072, China
Download: PDF (1029 KB)   HTML ( 6
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This study systematically explores the principles of traditional deep representation models and the latest pre-training ones, aiming to examine their performance in text mining tasks. [Methods] We compared these models’ data mining results from the model side and the experimental side. All tests were conducted with six datasets of CR, MR, MPQA, Subj, SST-2 and TREC. [Results] The XLNet model achieved the best average F1 value (0.918 6), which was higher than ELMo (0.809 0), BERT (0.898 3), Word2Vec (0.769 2), GloVe (0.757 6) and FastText (0.750 6). [Limitations] Our research focused on classification tasks of text mining, which did not compare the performance of vocabulary representation methods in machine translation, Q&A and other tasks. [Conclusions] The traditional deep representation learning models and the latest pre-training ones yield different results in text mining tasks.

Key wordsWord Representation Learning      Knowledge Representation      Deep Learning      Text Mining     
Received: 08 November 2019      Published: 14 September 2020
ZTFLH:  TP391  
Corresponding Authors: Yu Chuanming     E-mail: yucm@zuel.edu.cn

Cite this article:

Yu Chuanming, Wang Manyi, Lin Hongjun, Zhu Xingyu, Huang Tingting, An Lu. A Comparative Study of Word Representation Models Based on Deep Learning. Data Analysis and Knowledge Discovery, 2020, 4(8): 28-40.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2019.1222     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2020/V4/I8/28

符号表示 说明
W 模型中的权重矩阵
xi 模型输入句子中第i个单词的词向量表征
pos Transformer算法单词在句子中的位置
dmodel Transformer算法设定的模型输入、输出的维度大小
Trans. BERT模型中Transformer结构的简写
Ei,Ti BERT模型中输入句子第i个位置的输入和输出
Description of Related Symbols
Research Framework of Word Representation Learning
数据集 CR MR MPQA Subj SST-2 TREC
样本量 3 774 10 662 10 605 10 000 9 613 5 952
Statistical Table of Text Mining Datasets
相关操作 详细说明
目标任务 文本分类
传统的深度表示模型 Word2Vec、GloVe、FastText
最新的预训练模型 ELMo、BERT、XLNet
特征抽取模块 TextCNN、Transformer
选取变量 词汇表示学习算法
特征抽取方式
词嵌入维度
词嵌入训练方式
是否加入位置信息
多头注意力机制头数
训练集和测试比例 8∶2
评估指标 准确率、精确率、召回率、F1值、AUC
Experiment Related Information
算法 参数名 参数值
Word2Vec[3] 迭代次数 15
窗口大小
训练方法
8
CBOW、Negative Sample
嵌入维度
词嵌入训练方式
50、100、200、300
预训练词向量(更新/不更新)、随机初始化词向量
GloVe[4] 最大迭代次数 10
窗口大小 5
嵌入维度
词嵌入训练方式
50、100、200、300
预训练词向量(更新/不更新)、随机初始化词向量
FastText[5,6] 迭代次数 10
窗口大小 5
训练方法 CBOW、Negative Sample
嵌入维度
词嵌入训练方式
50、100、200、300
预训练词向量(更新/不更新)、随机初始化词向量
ELMo[7] 迭代次数
学习率
词向量大小
dropout_prob
LSTM隐藏层大小
batch size
10
0.001
256
0.5
128
128
BERT[10] dropout_prob
隐藏层大小
最大位置嵌入大小
多头注意力机制头数
位置嵌入
0.1
768
512
2、4、6、8、12(标准)、16
有(标准)、无
XLNet[11] 迭代次数
学习率
最大序列长度
batch size
3
0.00002
128
8
TextCNN[13] 迭代次数
学习率
词向量大小
dropout_prob
卷积核窗口大小
batch size
10
0.001
50、100、200、300
0.5(全连接层)
[2,3,4,5]
128
Transformer[12] 迭代次数
学习率
词向量大小
dropout_prob

block数
多头注意力机制头数
batch size
10
0.001
50、100、200、300
0.5(全连接层)、0.9(Attention层)
1
8
128
Parameter Configuration for Algorithms
方法 CR MR MPQA Subj SST-2 TREC 平均
Word2Vec+TextCNN 0.768 9 0.745 4 0.651 5 0.862 9 0.777 7 0.687 5 0.749 0
Word2Vec+Trans. 0.841 1 0.757 4 0.640 8 0.874 9 0.796 0 0.704 9 0.769 2
GloVe+TextCNN 0.791 7 0.734 3 0.652 4 0.858 2 0.783 9 0.687 5 0.751 3
GloVe+Trans. 0.767 6 0.759 8 0.640 4 0.872 3 0.794 6 0.710 9 0.757 6
FastText+TextCNN 0.774 3 0.742 5 0.648 4 0.863 4 0.783 8 0.676 8 0.748 2
FastText+Trans. 0.756 0 0.751 2 0.637 6 0.866 4 0.800 5 0.691 8 0.750 6
ELMo 0.823 2 0.720 1 0.795 9 0.899 6 0.796 6 0.818 6 0.809 0
BERT 0.905 2 0.933 6 0.832 5 0.969 8 0.859 9 0.943 1 0.898 3
XLNet 0.942 4 0.872 1 0.856 2 0.968 1 0.941 0 0.931 5 0.918 6
Results of Different Word Representation Learning Methods (F1 Value)
Influence of Different Dimensions on the Word2Vec Model
Influence of Different Dimensions on the GloVe Model
Influence of Different Dimensions on the FastText Model
Influence of Different Training Methods (Word2Vec) on Classification Results
Influence of Different Training Methods (GloVe) on Classification Results
Influence of Different Training Methods (FastText) on Classification Results
是否加入位置信息 Accuracy AUC Loss Precision Recall F1
0.882 0 0.877 5 0.389 3 0.915 9 0.894 7 0.905 2
0.836 9 0.826 9 0.503 5 0.874 5 0.865 3 0.869 9
Impact of Location Information on the Performance
头数 Accuracy AUC Loss Precision Recall F1
2 0.630 0 0.500 0 0.659 5 0.630 0 1.000 0 0.773 0
4 0.832 9 0.821 5 0.441 7 0.868 9 0.865 3 0.867 1
6 0.850 1 0.836 7 0.420 4 0.875 5 0.888 4 0.881 9
8 0.830 2 0.790 6 0.445 7 0.816 0 0.943 2 0.875 0
12 0.882 0 0.877 5 0.389 3 0.915 9 0.894 7 0.905 2
16 0.878 0 0.862 5 0.422 7 0.888 4 0.922 1 0.904 9
Impact of Multi-attention HeadNumbers on the Performance
Line Chart of Classification Results of Multi-attention Heads
[1] 袁书寒, 向阳. 词汇语义表示研究综述[J]. 中文信息学报, 2016,30(5):1-8.
[1] ( Yuan Shuhan, Xiang Yang. A Review of Lexical Semantic Representation[J]. Journal of Chinese Information Processing, 2016,30(5):1-8.)
[2] Turian J P, Ratinov L A, Bengio Y. Word Representations: A Simple and General Method for Semi-supervised Learning[C]// Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. 2010: 384-394.
[3] Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint,arXiv:1301. 3781.
[4] Pennington J, Socher R, Manning C. GloVe: Global Vectors for Word Representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 1532-1543.
[5] Bojanowski P, Grave E, Joulin A, et al. Enriching Word Vectors with Subword Information[J]. Transactions of the Association for Computational Linguistics, 2017(5):135-146.
[6] Joulin A, Grave E, Bojanowski P, et al. Bag of Tricks for Efficient Text Classification[C]// Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. 2017: 427-431.
[7] Peters M E, Neumann M, Iyyer M, et al. Deep Contextualized Word Representations[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume1 (Long Papers). 2018: 2227-2237.
[8] Radford A, Narasimhan K, Salimans T, et al. Improving Language Understanding by Generative Pre-Training[EB/OL].[2019-10-13].https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018 improving.pdf.
[9] Radford A, Wu J, Child R, et al. Language Models are Unsupervised Multitask Learners[EB/OL]. [2019-10-01].https://d4mucfpksywv.cloudfront.net/better-language-models/language_ models_are_unsupervised_multitask_learners.pdf.
[10] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv:1810.04805.
[11] Yang Z L, Dai Z H, Yang Y M, et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding[OL]. arXiv Preprint,arXiv: 1906. 08237.
[12] Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017: 6000-6010.
[13] Kim Y. Convolutional Neural Networks for Sentence Classification[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 1746-1751.
[14] 周练. Word2vec的工作原理及应用探究[J]. 科技情报开发与经济, 2015,25(2):145-148.
[14] ( Zhou Lian. Exploration of the Working Principle and Application of Word2vec[J]. Sci-Tech Information Development & Economy, 2015,25(2):145-148.)
[15] Bellman R E. Dynamic Programming[M]. New York: Dover Publications, Inc., 2003.
[16] Bordag S. A Comparison of Co-occurrence and Similarity Measures as Simulations of Context[C]// Proceedings of the 9th International Conference on Computational Linguistics and Intelligent Text Processing. 2008: 52-63.
[17] Aharon M, Elad M, Bruckstein A. K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation[J]. IEEE Transactions on Signal Processing, 2006,54(11):4311-4322.
[18] Bengio Y, Ducharme R, Vincent P, et al. A Neural Probabilistic Language Model[J]. Journal of Machine Learning Research, 2003,3:1137-1155.
[19] Dai Z H, Yang Z L, Yang Y M, et al. Transformer-XL: Attentive Language Models Beyond a Fixed-length Context[OL]. arXiv Preprint,arXiv: 1901. 02860.
[20] 余传明. 基于深度循环神经网络的跨领域文本情感分析[J]. 图书情报工作, 2018,62(11):23-34.
[20] ( Yu Chuanming. A Cross-domain Text Sentiment Analysis Based on Deep Recurrent Neural Network[J]. Library and Information Service, 2018,62(11):23-34.)
[21] 赵亚欧, 张家重, 李贻斌. 等 融合ELMo和多尺度卷积神经网络的情感分析[J/OL]. 计算机应用. http://kns.cnki.net/kcms/detail/51.1307.TP.20190927.0949.004.html.
[21] ( Zhao Yaou, Zhang Jiachong, Li Yibin. et al. Sentiment Analysis Using ELMo and Multi-scale Convolutional Neural Networks [J/OL]. Journal of Computer Applications. http://kns.cnki.net/kcms/detail/51.1307.TP.20190927.0949.004.html
[22] 李琳, 李辉. 一种基于概念向量空间的文本相似度计算方法[J]. 数据分析与知识发现, 2018,2(5):48-58.
[22] ( Li Lin, Li Hui. Computing Text Similarity Based on Concept Vector Space[J]. Data Analysis and Knowledge Discovery, 2018,2(5):48-58.)
[23] 赵洪, 王芳, 王晓宇, 等. 基于大规模政府公文智能处理的知识发现及应用研究[J]. 情报学报, 2018,37(8):805-812.
[23] ( Zhao Hong, Wang Fang, Wang Xiaoyu, et al. Research on Construction and Application of a Knowledge Discovery System Based on Intelligent Processing of Large-scale Governmental Documents[J]. Journal of the China Society for Scientific and Technical Information, 2018,37(8):805-812.)
[24] 张晓娟. 利用嵌入方法实现个性化查询重构[J]. 情报学报, 2018,37(6):621-630.
[24] ( Zhang Xiaojuan. Personalized Query Reformulations with Embeddings[J]. Journal of the China Society of Scientific and Technology, 2018,37(6):621-630.)
[25] 杨飘, 董文永. 基于BERT嵌入的中文命名实体识别方法[J/OL]. 计算机工程, https://doi.org/10.19678/j.issn.1000-3428. 0054272.
[25] ( Yang Piao, Dong Wenyong, Chinese NER Based on BERT Embedding[J/OL]. Computer Engineering, https://doi.org/10.19678/j.issn 0054272.)
[26] Hu M Q, Liu B. Mining and Summarizing Customer Reviews[C]// Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2004: 168-177.
[27] Pang B, Lee L. Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales[C]// Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. 2005: 115-124.
[28] Wiebe J, Wilson T, Cardie C. Annotating Expressions of Opinions and Emotions in Language[J]. Language Resources and Evaluation, 2005,39(2-3):165-210.
[29] Pang B, Lee L. A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts[C]// Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics. 2004: 271-278.
[30] Socher R, Perelygin A, Wu J Y, et al. Recursive Deep Models for Semantic Compositionality over a Sentiment Treebank[C]// Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 2013: 1631-1642.
[31] Li X, Roth D. Learning Question Classifiers[C]// Proceedings of the 19th International Conference on Computational Linguistics-Volume 1. 2002: 1-7.
[1] Zhao Yang,Zhang Zhixiong,Liu Huan,Ding Liangping. Classification of Chinese Medical Literature with BERT Model[J]. 数据分析与知识发现, 2020, 4(8): 41-49.
[2] Xu Chenfei,Ye Haiying,Bao Ping. Automatic Recognition of Produce Entities from Local Chronicles with Deep Learning[J]. 数据分析与知识发现, 2020, 4(8): 86-97.
[3] Wang Xinyun,Wang Hao,Deng Sanhong,Zhang Baolong. Classification of Academic Papers for Periodical Selection[J]. 数据分析与知识发现, 2020, 4(7): 96-109.
[4] Xia Tian. Extracting Key-phrases from Chinese Scholarly Papers[J]. 数据分析与知识发现, 2020, 4(7): 76-86.
[5] Jiao Qihang,Le Xiaoqiu. Generating Sentences of Contrast Relationship[J]. 数据分析与知识发现, 2020, 4(6): 43-50.
[6] Wang Mo,Cui Yunpeng,Chen Li,Li Huan. A Deep Learning-based Method of Argumentative Zoning for Research Articles[J]. 数据分析与知识发现, 2020, 4(6): 60-68.
[7] Deng Siyi,Le Xiaoqiu. Coreference Resolution Based on Dynamic Semantic Attention[J]. 数据分析与知识发现, 2020, 4(5): 46-53.
[8] Yu Chuanming,Yuan Sai,Zhu Xingyu,Lin Hongjun,Zhang Puliang,An Lu. Research on Deep Learning Based Topic Representation of Hot Events[J]. 数据分析与知识发现, 2020, 4(4): 1-14.
[9] Su Chuandong,Huang Xiaoxi,Wang Rongbo,Chen Zhiqun,Mao Junyu,Zhu Jiaying,Pan Yuhao. Identifying Chinese / English Metaphors with Word Embedding and Recurrent Neural Network[J]. 数据分析与知识发现, 2020, 4(4): 91-99.
[10] Liu Tong,Ni Weijian,Sun Yujian,Zeng Qingtian. Predicting Remaining Business Time with Deep Transfer Learning[J]. 数据分析与知识发现, 2020, 4(2/3): 134-142.
[11] Peng Guan,Yuefen Wang. Advances in Patent Network[J]. 数据分析与知识发现, 2020, 4(1): 26-39.
[12] Chuanming Yu,Haonan Li,Manyi Wang,Tingting Huang,Lu An. Knowledge Representation Based on Deep Learning:Network Perspective[J]. 数据分析与知识发现, 2020, 4(1): 63-75.
[13] Mingxuan Huang,Shoudong Lu,Hui Xu. Cross-Language Information Retrieval Based on Weighted Association Patterns and Rule Consequent Expansion[J]. 数据分析与知识发现, 2019, 3(9): 77-87.
[14] Yanan Yang,Wenhui Zhao,Jian Zhang,Shen Tan,Beibei Zhang. Visualizing Policy Texts Based on Multi-View Collaboration[J]. 数据分析与知识发现, 2019, 3(6): 30-41.
[15] Mengji Zhang,Wanyu Du,Nan Zheng. Predicting Stock Trends Based on News Events[J]. 数据分析与知识发现, 2019, 3(5): 11-18.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn