Please wait a minute...
Advanced Search
数据分析与知识发现  2020, Vol. 4 Issue (5): 54-65     https://doi.org/10.11925/infotech.2096-3467.2019.1006
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于依存关系嵌入与条件随机场的商品属性抽取方法*
李成梁,赵中英(),李超,亓亮,温彦
山东科技大学计算机科学与工程学院 青岛 266590
Extracting Product Properties with Dependency Relationship Embedding and Conditional Random Field
Li Chengliang,Zhao Zhongying(),Li Chao,Qi Liang,Wen Yan
College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, China
全文: PDF (1028 KB)   HTML ( 10
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 基于依存关系嵌入设计多种单词表示,获取单词的潜在语义特征,提高条件随机场对评论中商品属性的抽取能力。【方法】 提出一种基于依存关系嵌入与条件随机场的商品属性抽取方法。基于单词属性、单词依存关系及其词嵌入形式构建三类单词语义信息,包括:基本语义信息、结构语义信息和类别语义信息;结合三类语义信息与条件随机场模型抽取商品的属性。【结果】 与不加入语义信息相比,融合三类语义信息的方法在准确率上提高3.97%;与已有的代表性模型相比,本文方法在F1值上最多提高7.65%。【局限】 情感词和属性关系紧密,未对评论中属性和情感词之间的关系进行深入挖掘。【结论】 本文方法能够有效地抽取商品评论数据的属性,为基于属性的细粒度情感分析奠定良好的基础。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
李成梁
赵中英
李超
亓亮
温彦
关键词 属性抽取依存关系条件随机场评论分析关系嵌入    
Abstract

[Objective] This paper designs multiple word representation methods, aiming to obtain the latent semantic features and extract product properties from reviews.[Methods] First, we used word properties, dependency relationship and embedding techniques to construct three types of word representations, which included basic, structural and category semantic information. Then, we applied conditional random field model to extract product properties with these semantic information.[Results] The accuracy of the proposed method was 3.97% higher than that of the DepREm-CRF.Its F1 value was up to 7.65% better than the popular ones.[Limitations] More research is needed to investigate the relationship between online sentiments and properties.[Conclusions] The proposed method is able to effectively extract properties from product reviews, and lays good foundation for fine-grained sentiment analysis research.

Key wordsAspect Extraction    Dependency Relationship    Conditional Random Field    Comments Analysis    Relationship Embedding
收稿日期: 2019-09-05      出版日期: 2020-06-15
ZTFLH:  TP393 G35  
基金资助:*本文系国家自然科学基金重点项目子课题“大数据环境下的复杂网络行为分析”(61433012);山东省自然科学基金项目“动态社交网络中用户群体行为的多尺度分析及其与网络拓扑的协同演化机制研究”(ZR2018BF013);教育部人文社会科学青年基金项目“大数据环境下基于学习者行为挖掘的个性化用户建模研究”的研究成果之一(17YJCZH262)
通讯作者: 赵中英     E-mail: zzysuin@163.com
引用本文:   
李成梁,赵中英,李超,亓亮,温彦. 基于依存关系嵌入与条件随机场的商品属性抽取方法*[J]. 数据分析与知识发现, 2020, 4(5): 54-65.
Li Chengliang,Zhao Zhongying,Li Chao,Qi Liang,Wen Yan. Extracting Product Properties with Dependency Relationship Embedding and Conditional Random Field. Data Analysis and Knowledge Discovery, 2020, 4(5): 54-65.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2019.1006      或      http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2020/V4/I5/54
Fig.1  DepREm-CRF模型框架
符号 含义
D 数据集
Ss D中的第s条评论
wsn s条评论中的第n个单词
wsnpos 单词wsn的词性
wsnlemm 单词wsn的词形
(wsm,wsn) 单词wsm和单词wsn之间存在依存关系
wsn_w 单词wsn的依存关系权重
Gs 基于依存关系得到的第s条评论的依存关系图
SubSentsi 基于Gs得到的第s条评论的第i条依存关系子句
ewsn 单词wsn的依存关系词向量
Cwsn1 单词wsn的依存关系词向量的聚类类别
b 依存类别向量
Cwsn2 单词wsn的多义性聚类类别
Table 1  主要符号及其含义
语义信息类别 内容
基本语义信息 词性标注、词形还原、依存关系权重
结构语义信息 依存关系词向量-聚类
类别语义信息 单词语义类别
Table 2  单词语义信息表示
类型 文本
原始数据 I love the operating system and the preloaded software
词性标注 PRP VBP DT VBG NN CC DT JJ NN
Table 3  词性标注示例
类型 文本
原始数据 I love the operating system and the preloaded software
词形还原 I love the operate system and the load software
Table 4  词形还原示例
Fig.2  某商品评论语句的BIO标注集形式
数据集名称 训练集规模(条) 测试集规模(条) 属性规模(条)
L-14 3 045 800 3 012
R-15 1 315 685 2 499
R-16 2 000 676 3 367
Yelp 800 000 200 000 5 867 511
Table 5  DepREm-CRF模型训练和测试所用数据集
模型 L-14数据集
P(%) R(%) F1(%)
CRF 83.89 69.42 75.97
CRF+基本 87.02 76.73 81.55
CRF+结构 87.48 76.48 81.61
CRF+类别 86.66 76.19 81.09
DepREm-CRF 87.86 78.31 82.81
Table 6  不同语义信息对DepREm-CRF模型的影响
属性类别 模型 属性词集
高频 CRF+基本 price/features/performance/OS/screen/operating system/USB ports/hard drive/speed
CRF+结构 price/features/performance/OS/screen/operating system/USB ports/hard drive/speed/battery life/works
CRF+类别 price/features/performance/OS/screen/operating system/USB ports/hard drive
DepREm-CRF price/features/performance/OS/screen/operating system/USB ports/hard drive/speed/battery life/works/retina display
低频 CRF+基本 battery/Keyboard/itune/screen display/configure/components
CRF+结构 battery/Keyboard/itune/screen display/configure/
components/Microsoft Windows/Microsoft Office
CRF+类别 battery/Keyboard/itune/screen display
DepREm-CRF battery/Keyboard/itune/screen display/configure/components/Microsoft Windows/Microsoft Office/aluminum casing/Screen resolution
Table 7  不同语义信息对属性抽取的示例
模型 L-14 R-15 R-16 Yelp
BiLSTM+CRF 80.57 70.83 74.49 80.45
Unsupervised-CRF 75.16 69.73 - -
DE-CNN 81.59 - 74.37 -
MFE-CRF 76.53 70.31 73.81 79.38
DepREm-CRF 82.81 71.96 74.67 84.29
Table 8  DepREm-CRF模型与其他模型的比较(F1:%)
Fig.3  不同类型、不同规模的额外语料对属性抽取精度的影响
[1] Luo H, Li T, Liu B, et al. Improving Aspect Term Extraction with Bidirectional Dependency Tree Representation[J]. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 2019,27(7):1201-1212.
[2] Yin Y, Wei F, Dong L , et al. Unsupervised Word and Dependency Path Embeddings for Aspect Term Extraction [C]// Proceedings of the 25th International Joint Conference on Artificial Intelligence. 2016: 2979-2985.
[3] Hu M, Liu B . Mining and Summarizing Customer Reviews [C]// Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2004: 168-177.
[4] Liu C L, Hsaio W H, Lee C H, et al. Movie Rating and Review Summarization in Mobile Environment[J]. IEEE Transactions on Systems Man and Cybernetics Part C:Applications and Reviews, 2012,42(3):397-407.
[5] Ghadery E, Movahedi S, Faili H, et al. An Unsupervised Approach for Aspect Category Detection Using Soft Cosine Similarity Measure[OL]. arXivPreprint, arXiv:1812.03361.
[6] Zhang J, Chen D, Lu M. Combining Sentiment Analysis with a Fuzzy Kano Model for Product Aspect Preference Recommendation[J]. IEEE Access, 2018,6:59163-59172.
[7] 郭博, 李守光, 王昊, 等. 电商评论综合分析系统的设计与实现——情感分析与观点挖掘的研究与应用[J]. 数据分析与知识发现, 2017,1(12):1-9.
[7] ( Guo Bo, Li Shouguang, Wang Hao, et al. Examining Product Reviews with Sentiment Analysis and Opinion Mining[J]. Data Analysis and Knowledge Discovery, 2017,1(12):1-9.)
[8] 李伟卿, 王伟军. 基于大规模评论数据的产品特征词典构建方法研究[J]. 数据分析与知识发现, 2018,2(1):41-50.
[8] ( Li Weiqing, Wang Weijun. Building Product Feature Dictionary with Large-Scale Review Data[J]. Data Analysis and Knowledge Discovery, 2018,2(1):41-50.)
[9] 张震, 曾金. 面向用户评论的关键词抽取研究——以美团为例[J]. 数据分析与知识发现, 2019,3(3):36-44.
[9] ( Zhang Zhen, Zeng Jin. Extracting Keywords from User Comments: Case Study of Meituan[J]. Data Analysis and Knowledge Discovery, 2019,3(3):36-44.)
[10] Poria S, Cambria E, Ku L W , et al. A Rule-Based Approach to Aspect Extraction from Product Reviews [C]// Proceedings of the 2nd Workshopon Natural Language Processing for Social Media. 2014: 28-37.
[11] 彭云, 万常选, 江腾蛟, 等. 基于语义约束LDA的商品特征和情感词抽取[J]. 软件学报, 2017,28(3):676-693.
[11] ( Peng Yun, Wan Changxuan, Jiang Tengjiao, et al. Extracting Product Aspects and User Opinions Based on Semantic Constrained LDA Model[J]. Journal of Software, 2017,28(3):676-693.)
[12] Mukherjee A, Liu B . Aspect Extraction ThroughSemi-Supervised Modeling [C]// Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. 2012,1:339-348.
[13] Li Y, Qin Z, Xu W, et al. A Holistic Model of Mining Product Aspects and Associated Sentiments from Online Reviews[J]. Multimedia Tools and Applications, 2015,74(23):10177-10194.
[14] Liu Q, Gao Z, Liu B , et al. Automated Rule Selection for Aspect Extraction in Opinion Mining [C]// Proceedings of the 24th International Joint Conference on Artificial Intelligence. 2015: 1291-1297.
[15] 周清清, 章成志. 在线用户评论细粒度属性抽取[J]. 情报学报, 2017,36(5):484-493.
[15] ( Zhou Qingqing, Zhang Chengzhi. Fine-grained Aspect Extraction from Online Customer Reviews[J]. Journal of the China Society for Scientific and Technical Information, 2017,36(5):484-493.)
[16] Peng H, Ma Y, Li Y, et al. Learning Multi-Grained Aspect Target Sequence for Chinese Sentiment Analysis[J]. Knowledge-Based Systems, 2018,148:167-176.
[17] 赵杨, 李齐齐, 陈雨涵, 等. 基于在线评论情感分析的海淘APP用户满意度研究[J]. 数据分析与知识发现, 2018,2(11):19-27.
[17] ( Zhao Yang, Li Qiqi, Chen Yuhan, et al. Examining Consumer Reviews of Overseas Shopping APP with Sentiment Analysis[J]. Data Analysis and Knowledge Discovery, 2018,2(11):19-27.)
[18] Xu H, Liu B, Shu L , et al. Double Embeddings and CNN-Based Sequence Labeling for Aspect Extraction [C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018: 592-598.
[19] Lafferty J D, McCallum A, Pereira F C N . Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data [C]// Proceedings of the 18th International Conference on Machine Learning. Morgan Kaufmann Publishers Inc., 2001: 282-289.
[20] Xiang Y, He H, Zheng J. Aspect Term Extraction Based on MFE-CRF[J]. Information, 2018,9(8):198-213.
[21] Le Q, Mikolov T . Distributed Representations of Sentences and Documents [C]// Proceedings of the 31st International Conference on Machine Learning. 2014: 1188-1196.
[22] Dhingra B, Zhou Z, Fitzpatrick D , et al. Tweet2Vec: Character-Based Distributed Representations for Social Media [C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016: 269-274.
[23] Moody C E. Mixing DirichletTopic Models and Word Embeddings to Make LDA2Vec[OL]. arXivPreprint,arXiv:1605.02019.
[24] 曾庆田, 戴明弟, 李超, 等. 轨迹数据融合用户表示方法的重要位置发现[J]. 数据分析与知识发现, 2019,3(6):75-82.
[24] ( Zeng Qingtian, Dai Mingdi, Li Chao, et al. Discovering Important Locations with User Representation and Trace Data[J]. Data Analysis and Knowledge Discovery, 2019,3(6):75-82.)
[25] MacAvaney S, Zeldes A . A Deeper Look into Dependency-Based Word Embeddings [C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop. 2018: 40-45.
[26] Ye Z, Zhao H. Syntactic Word Embedding Based on Dependency Syntax and PolysemousAnalysis[J]. Frontiers of Information Technology & Electronic Engineering, 2018,19(4):524-535.
[27] Levy O, Goldberg Y . Dependency-Based Word Embeddings [C]// Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 2014: 302-308.
[28] Zhao Y, Qin B, Liu T. Encoding Syntactic Representations with a Neural Network for Sentiment Collocation Extraction[J]. Science China-Information Sciences, 2017, 60(11): Article No. 110101.
[29] Li C, Li J, Song Y , et al. Training and Evaluating Improved Dependency-Based Word Embeddings [C]// Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018: 5836-5843.
[30] Blei D M, Ng A Y, Jordan M I. Latent DirichletAllocation[J]. Journal of Machine Learning Research, 2003,3:993-1022.
[1] 赵平,孙连英,涂帅,卞建玲,万莹. 改进的知识迁移景点实体识别算法研究及应用*[J]. 数据分析与知识发现, 2020, 4(5): 118-126.
[2] 黄菡,王宏宇,王晓光. 结合主动学习的条件随机场模型用于法律术语的自动识别*[J]. 数据分析与知识发现, 2019, 3(6): 66-74.
[3] 丁晟春,侯琳琳,王颖. 基于电商数据的产品知识图谱构建研究*[J]. 数据分析与知识发现, 2019, 3(3): 45-56.
[4] 肖连杰,孟涛,王伟,吴志祥. 基于深度学习的情报分析方法识别研究 * ——以安全情报领域为例[J]. 数据分析与知识发现, 2019, 3(10): 20-28.
[5] 唐慧慧, 王昊, 张紫玄, 王雪颖. 基于汉字标注的中文历史事件名抽取研究*[J]. 数据分析与知识发现, 2018, 2(7): 89-100.
[6] 王东波, 吴毅, 叶文豪, 刘睿伦. 多特征知识下的食品安全事件实体抽取研究*[J]. 数据分析与知识发现, 2017, 1(3): 54-61.
[7] 祁瑞华. 基于依存关系的中文微博作者性别识别*[J]. 数据分析与知识发现, 2017, 1(2): 58-63.
[8] 张越, 王东波, 朱丹浩. 面向食品安全突发事件汉语分词的特征选择及模型优化研究*[J]. 数据分析与知识发现, 2017, 1(2): 64-72.
[9] 张琳, 秦策, 叶文豪. 基于条件随机场的法言法语实体自动识别模型研究*[J]. 数据分析与知识发现, 2017, 1(11): 46-52.
[10] 王密平,王昊,邓三鸿,吴志祥. 基于CRFs的冶金领域中文专利术语抽取研究*[J]. 现代图书情报技术, 2016, 32(6): 28-36.
[11] 贺惠新,刘丽娟. 主动学习的科技文献研究对象标引体系研究*[J]. 现代图书情报技术, 2016, 32(3): 67-73.
[12] 隋明爽,崔雷. 结合多种特征的CRF模型用于化学物质-疾病命名实体识别[J]. 现代图书情报技术, 2016, 32(10): 91-97.
[13] 段宇锋, 朱雯晶, 陈巧, 刘伟, 刘凤红. 条件随机场与领域本体元素集相结合的未登录词识别研究[J]. 现代图书情报技术, 2015, 31(4): 41-49.
[14] 姜春涛. 自动标注中文专利的引文信息[J]. 现代图书情报技术, 2015, 31(10): 81-87.
[15] 何宇, 吕学强, 徐丽萍. 新能源汽车领域中文术语抽取方法[J]. 现代图书情报技术, 2015, 31(10): 88-94.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn