Please wait a minute...
Data Analysis and Knowledge Discovery  2017, Vol. 1 Issue (7): 73-81    DOI: 10.11925/infotech.2096-3467.2017.0506
Orginal Article Current Issue | Archive | Adv Search |
Sentiment Analysis in Cross-Domain Environment with Deep Representative Learning
Yu Chuanming1, Feng Bolin1, An Lu2()
1 School of Information and Safety Engineering, Zhongnan University of Economics and Law, Wuhan 430073, China
2 School of Information Management, Wuhan University, Wuhan 430072, China
Download: PDF (543 KB)   HTML ( 2
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] The study trains the model with the source domain of rich labeling/tagging data and to project the source and target domain documents into the same feature space. This paper tries to solve the performance issue facing the target domain due to the lack of data. [Methods] First, we collected the Chinese, English and Japanese comments on books, DVDs and music from Amazon. Then, we proposed a Cross Domain Deep Representation Model (CDDRM) based on the Convolutional Neural Network (CNN) and Structural Correspondence Learning (SCL) techniques. Finally, we conducted cross-domain knowledge transfer and sentiment analysis. [Results] We found the best F value of CDDRM was 0.7368, which indicated the effectiveness of the proposed model. [Limitations] The F1 value of our model on long articles needs to be improved. [Conclusions] Transfer learning could help supervised learning obtain good classification results with small training sets. Compared with traditional methods, CDDRM does not require the training and testing sets having same or similar data structure.

Key wordsCross Domain      Transfer Learning      Deep Representation Learning      Sentiment Analysis     
Received: 31 May 2017      Published: 26 July 2017
ZTFLH:  TP391  

Cite this article:

Yu Chuanming,Feng Bolin,An Lu. Sentiment Analysis in Cross-Domain Environment with Deep Representative Learning. Data Analysis and Knowledge Discovery, 2017, 1(7): 73-81.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2017.0506     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2017/V1/I7/73

符号 说明
下标s, t 下标s表示源领域(或语言), 下标t表示目标领域(或语言)。
D 领域集合。包括书籍、DVD和音乐, 分别使用BDM来表示。
L 语言。包括英文、中文和日语, 分别使用ECJ表示。
LD 指定的语言及领域。比如EB表示英文书籍领域; CM表示中文音乐领域; JD表示日文DVD领域。
LSDS-LtDt 源语言源领域到目标语言目标领域的语言、领域迁移学习(本文中的源语言即为目标语言, 只是领域不同)。
V 词典。
X 文档集。以源领域为例, ${{X}_{S}}=\{x_{S}^{1},x_{S}^{2},\cdots x_{S}^{n}\}$表示源语言文档集, 其中有n篇文档。
Y 标注集。以源领域为例, ${{Y}_{S}}=\{y_{S}^{1}, y_{S}^{2},\cdots y_{S}^{n}\}$表示源语言文档集中每篇文档对应的类别。
TrainTesttU 分别表示训练集、测试集和无标注文档。
F 文档使用到的特征集。
P 特征的分布概率。
方法 参数
Upper 将训练集和测试集中的文档以无标注文档中词频在前5 000的词项进行表示; 以标准化的TF-IDF作为文档表示权重。
DCI 将训练集和测试集中的文档以无标注文档中词频在前5 000的词项进行表示; 英文、日文评论中, 词频阈值为30; 中文评论中, 词频阈值为10; 核心词对数量为100, DCF选择余弦相似度(Cosine)进行计算; 英日翻译词典使用Prettenhofer等[20]提供的一对一翻译词典; 英中翻译词典使用百度翻译API提取一对一翻译。
参数 取值
Filter大小 由于英文和日文评论较长, 将英文和日文在模型中两个卷积层的卷积核大小分别设为(7, 1)和(5, 1); 由于中文评论较短, 因此将两个卷积层的卷积核大小分别设为(5, 1)和(3, 1)
Filter数量 Filter数量分别设置为6和14
词向量维度 词向量维度d=100
Dropout 0.3
文档表示维度 256
学习率 初始值为0.1, 使用Adagrad优化算法在模型训练过程中, 为参数分配不同的学习率[22]
初始化方法 Xavier[23]
激活函数 tanh
Epoch 预训练阶段将模型训练50个Epoch, 有监督调整阶段训练100个Epoch
${{L}_{S}}{{D}_{S}}-{{L}_{t}}{{D}_{t}}$ Upper DCI CDDRM
EB-ED 0.7755 0.7563 0.6114
EB-EM 0.7787 0.6340 0.5709
ED-EB 0.7868 0.7888 0.6269
ED-EM 0.7787 0.7858 0.6142
EM-EB 0.7868 0.7665 0.5774
EM-ED 0.7755 0.7813 0.6292
JB-JD 0.7653 0.7775 0.5745
JB-JM 0.7857 0.7575 0.6585
JD-JB 0.7401 0.7037 0.5912
JD-JM 0.7857 0.7757 0.6546
JM-JB 0.7401 0.6349 0.5904
JM-JD 0.7653 0.7790 0.6466
CB-CD 0.7330 0.7192 0.6979
CB-CM 0.7705 0.7825 0.6426
CD-CB 0.7900 0.7449 0.5965
CD-CM 0.7705 0.7590 0.6663
CM-CB 0.7900 0.7935 0.6456
CM-CD 0.7330 0.7578 0.7368
${{L}_{S}}{{D}_{S}}-{{L}_{t}}{{D}_{t}}$ F1 ${{L}_{S}}{{D}_{S}}-{{L}_{t}}{{D}_{t}}$ F1
EB-ED 0.5787(3.27%) JB-JD 0.5703(0.42%)
EB-EM 0.5417(2.92%) JB-JM 0.6184(4.01%)
ED-EB 0.5937(3.37%) JD-JB 0.5714(1.98%)
ED-EM 0.6072(0.7%) JD-JM 0.6345(2.01%)
EM-EB 0.5673(1.01%) JM-JB 0.5473(4.31%)
EM-ED 0.5832(4.6%) JM-JD 0.6225(2.41%)
CB-CD 0.6327(6.52%) CD-CM 0.6360(3.03%)
CB-CM 0.6346(0.8%) CM-CB 0.6165(2.91%)
CD-CB 0.5873(0.92%) CM-CD 0.7141(2.27%)
词项 相关度top 5
unforgettable hatred(0.583995), loyalty(0.570119), Gerard(0.565999), hostage(0.555899), streets(0.544472)
honesty imaginative(0.493847), hours(0.470681), Their(0.458407), enjoyment(0.456892), fiction(0.446363)
famous rising(0.499682), tv(0.48933), Douglas(0.470298), sister(0.468961), insane(0.465427)
concert band(0.954735), interviews(0.869103), pink(0.831658), footage(0.773775), standup(0.762183)
worried ric(0.581101), towns(0.544049), astonishing(0.541486), Harvey(0.54041), terrific(0.538101)
清澈 动人(0.922085), 炙热(0.900552), 沉淀(0.888065), 2009年(0.868131), 目光(0.858073)
终极 物超所值(0.778811), 蜕变(0.773224), 李宇春(0.763001), mini(0.756182), 性别(0.755634)
演义 封神(0.952194), 毕竟(0.886214), 钢琴(0.882066), 外观(0.876688), 栩栩如生(0.872374)
平凡 伦理(0.96488), 烂(0.765675), one(0.756881), 抄袭(0.755563), 平淡(0.754948)
排版 伦理(0.967311), one(0.943011), 文笔(0.939356), 主角(0.934818), 好看(0.931655)
ベストセラー
(畅销)
作家(作家)(0.47599), 採用(采用)(0.464169), ニューヨーク(纽约)(0.440792), ワイヤー(钢丝)(0.42608),
西部(西部)(0.404709)
ジャンプ
(jump漫画杂志)
本誌(本刊)(0.796235), 一切(完全)(0.558918), かけ(核心)(0.55353), 読者(读者)(0.540428),
脇(腋下)(0.498789)
BGM
(轻音乐)
作曲(作曲)(0.744679), 繰り広げる(展开)(0.538222), ムービー(电影)(0.528672), 倒す(打败)(0.504091),
一生(一生)(0.497166)
挿絵
(插画)
植物(植物)(0.710477), 繊細(纤细)(0.671644), 大好き(喜欢)(0.613376), 描写(描绘)(0.598009),
冴え(鲜明)(0.522175)
家族
(家人)
開始(开始)(0.383957), 権威(权威)(0.376475), 兄(哥哥)(0.364545), 際(...之时)(0.362299),
進展(进展)(0.356651)
[1] Blitzer J, Dredze M, Pereira F.Domain Adaptation for Sentiment Classification[C]//Proceedings of Association for Computational Linguistics - ACL 2007.2007.
[2] Denecke K.Are SentiWordNet Scores Suited for Multi- domain Sentiment Classification?[C]//Proceedings of International Conference on Digital Information Management. IEEE, 2009: 1-6.
[3] Li F, Pan S J, Jin O, et al.Cross-domain Co-extraction of Sentiment and Topic Lexicons[C]// Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers. 2012: 410-419.
[4] Bollegala D, Weir D, Carroll J.Using Multiple Sources to Construct a Sentiment Sensitive Thesaurus for Cross-domain Sentiment Classification[C]// Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2011: 132-141.
[5] Glorot X, Bordes A, Bengio Y.Domain Adaptation for Large-scale Sentiment Classification: A Deep Learning Approach[C]// Proceedings of the 28th International Conference on Machine Learning. 2011: 513-520.
[6] Ando R K, Zhang T.A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data[J]. Journal of Machine Learning Research, 2005, 6(3): 1817-1853.
doi: 10.1002/cem.976
[7] Blitzer J, McDonald R, Pereira F. Domain Adaptation with Structural Correspondence Learning[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2006: 120-128.
[8] Pan S J, Yang Q.A Survey on Transfer Learning[J]. IEEE Transactions on Knowledge & Data Engineering, 2010, 22(10): 1345-1359.
doi: 10.1109/TKDE.2009.191
[9] Fernández A M, Esuli A, Sebastiani F.Distributional Correspondence Indexing for Cross-lingual and Cross-domain Sentiment Classification[J]. Journal of Artificial Intelligence Research, 2016, 55: 131-163.
doi: 10.1613/jair.4762
[10] Kim Y.Convolutional Neural Networks for Sentence Classification[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 1746-1751.
[11] Kalchbrenner N, Grefenstette E, Blunsom P.A Convolutional Neural Network for Modelling Sentences[OL]. arxiv PrePrint, arXiv: 1404.2188.
[12] Collobert R, Weston J.A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning[C]//Proceedings of the 25th International Conference on Machine Learning.2008: 160-167.
[13] Gao J, Pantel P, Gamon M, et al.Modeling Interestingness with Deep Neural Networks[C]// Proceedings of Conference on Empirical Methods in Natural Language Processing. 2014: 2-13.
[14] Yan C, Zhang B, Coenen F.Driving Posture Recognition by Convolutional Neural Networks[C]// Proceedings of International Conference on Natural Computation. 2015: 680-685.
[15] Ngiam J, Koh P, Chen Z, et al.Sparse Filtering[C]// Proceedings of the Neural Information Processing Systems Conference. 2011: 1125-1133.
[16] Dahl G E, Ranzato M, Mohamed A R, et al.Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine.[C]// Proceedings of the Neural Information Processing Systems Conference, British Columbia, Canada. DBLP, 2010: 469-477.
[17] Krizhevsky A, Sutskever I, Hinton G E.ImageNet Classification with Deep Convolutional Neural Networks[C]// Proceedings of International Conference on Neural Information Processing Systems. Curran Associates Inc, 2012: 1097-1105.
[18] Boser B E, Guyon I M, Vapnik V N.A Training Algorithm for Optimal Margin Classifiers[C]// Proceedings of the 5th Annual Workshop on Computational Learning Theory. 1996: 144-152.
[19] Tang Y.Deep Learning Using Linear Support Vector Machines [OL]. arxiv PrePrint, arXiv: 1306.0239.
[20] Prettenhofer P, Stein B. Cross-Lingual Adaptation Using Structural Correspondence Learning[J]. ACM Transactions on Intelligent Systems & Technology, 2011, 3(1): Article No. 13.
doi: 10.1145/2036264.2036277
[21] Prettenhofer P, Stein B.Webis-cls-10 Dataset [OL].
[22] Duchi J, Hazan E, Singer Y.Adaptive Subgradient Methods for Online Learning and Stochastic Optimization[J]. Journal of Machine Learning Research, 2011, 12(7): 2121-2159.
doi: 10.1109/TNN.2011.2146788
[23] Glorot X, Bengio Y.Understanding the Difficulty of Training Deep Feedforward Neural Networks[J]. Journal of Machine Learning Research, 2010, 9: 249-256.
[1] Xu Hongxia,Yu Qianqian,Qian Li. Studying Content Interaction Data with Topic Model and Sentiment Analysis[J]. 数据分析与知识发现, 2020, 4(7): 110-117.
[2] Jiang Lin,Zhang Qilin. Research on Academic Evaluation Based on Fine-Grain Citation Sentimental Quantification[J]. 数据分析与知识发现, 2020, 4(6): 129-138.
[3] Shi Lei,Wang Yi,Cheng Ying,Wei Ruibin. Review of Attention Mechanism in Natural Language Processing[J]. 数据分析与知识发现, 2020, 4(5): 1-14.
[4] Zhao Ping,Sun Lianying,Tu Shuai,Bian Jianling,Wan Ying. Identifying Scenic Spot Entities Based on Improved Knowledge Transfer[J]. 数据分析与知识发现, 2020, 4(5): 118-126.
[5] Li Tiejun,Yan Duanwu,Yang Xiongfei. Recommending Microblogs Based on Emotion-Weighted Association Rules[J]. 数据分析与知识发现, 2020, 4(4): 27-33.
[6] Shen Zhuo,Li Yan. Mining User Reviews with PreLM-FT Fine-Grain Sentiment Analysis[J]. 数据分析与知识发现, 2020, 4(4): 63-71.
[7] Liu Tong,Ni Weijian,Sun Yujian,Zeng Qingtian. Predicting Remaining Business Time with Deep Transfer Learning[J]. 数据分析与知识发现, 2020, 4(2/3): 134-142.
[8] Xue Fuliang,Liu Lifang. Fine-Grained Sentiment Analysis with CRF and ATAE-LSTM[J]. 数据分析与知识发现, 2020, 4(2/3): 207-213.
[9] Xiang Fei,Xie Yaotan. Recognition Model of Patient Reviews Based on Mixed Sampling and Transfer Learning[J]. 数据分析与知识发现, 2020, 4(2/3): 39-47.
[10] Ying Tan,Jin Zhang,Lixin Xia. A Survey of Sentiment Analysis on Social Media[J]. 数据分析与知识发现, 2020, 4(1): 1-11.
[11] Hui Nie,Huan He. Identifying Implicit Features with Word Embedding[J]. 数据分析与知识发现, 2020, 4(1): 99-110.
[12] Yonghua Cen,Zhihao Tan,Chengyao Wu. Impacts of Financial Media Information on Stock Market: An Empirical Study of Sentiment Analysis[J]. 数据分析与知识发现, 2019, 3(9): 98-114.
[13] Weicong Lu,Jian Xu. Sentiment Analysis for Online User Reviews Based on Tripartite Network[J]. 数据分析与知识发现, 2019, 3(8): 10-20.
[14] Zhongxi You,Weina Hua,Xuelian Pan. Matching Book Reviews and Essential Sentiment Lexicons with Chinese Word Segmenters[J]. 数据分析与知识发现, 2019, 3(7): 23-33.
[15] Cuiqing Jiang,Yibo Guo,Yao Liu. Constructing a Domain Sentiment Lexicon Based on Chinese Social Media Text[J]. 数据分析与知识发现, 2019, 3(2): 98-107.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn