Please wait a minute...
Data Analysis and Knowledge Discovery  2019, Vol. 3 Issue (7): 23-33    DOI: 10.11925/infotech.2096-3467.2018.0898
Current Issue | Archive | Adv Search |
Matching Book Reviews and Essential Sentiment Lexicons with Chinese Word Segmenters
Zhongxi You1,2(),Weina Hua1,Xuelian Pan1
1(School of Information Management, Nanjing University, Nanjing 210023, China)
2(School of Education Science, Nantong University, Nantong 226007, China)
Download: PDF(976 KB)   HTML ( 7
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper aims to compare the impacts of Chinese word segmenters on the degree of matching between the corpus and the sentiment lexicons. [Methods] We used six Chinese segmenters to process the self-built corpus of book reviews, which were also filtered with four Sentiment Lexicons. Then, we calculated the coverage and the matchings of corpus to each sentiment lexicon, the negative word list and the degree word list. Finally, we computed the ratio of neutral corpus and low-frequency words to the lexicons. [Results] For different sentiment lexicons, the segmenters yielded various results in corpus-lexicon matching, proportion of low-frequency in lexicons, as well as proportion of neutral part in corpus. [Limitations] The corpus size needs to be expanded, and the sentence-level and rule-based testing need to be added. [Conclusions] The word segmenter has significant impacts on the matching between the corpus and sentiment lexicons.

Key wordsChinese Word Segmenter      Sentiment Lexicon      Sentiment Analysis     
Received: 13 August 2018      Published: 06 September 2019
:  TP391 G35  
Corresponding Authors: Zhongxi You     E-mail: dafuh@163.com

Cite this article:

Zhongxi You,Weina Hua,Xuelian Pan. Matching Book Reviews and Essential Sentiment Lexicons with Chinese Word Segmenters. Data Analysis and Knowledge Discovery, 2019, 3(7): 23-33.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2018.0898     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2019/V3/I7/23

分词器 全称 版本 URL
NLPIR ICTCLAS/NLPIR汉语分词系统 2015 http://ictclas.nlpir.org/
Jieba “结巴”中文分词 0.39 https://github.com/fxsjy/jieba
HITLTP* 哈尔滨工业大学语言技术平台 3.4.0 https://www.ltp-cloud.com/
THULAC THU Lexical Analyzer for Chinese 2017 http://thulac.thunlp.org/
HanLP 汉语言处理包 1.6.4 https://github.com/hankcs/HanLP
SFNLP* Stanford NLP Chinese Word Segmenter 3.9.1 https://nlp.stanford.edu/software/segmenter.shtml
名称 极性 词数 合计数 重叠词数
HowNet
http://www.keenage.com/
Positive 4 528 8 746 102
Negative 4 320
NTUSD
http://academiasinicanlplab.github.io/
Positive 2 647 10 339 49
Negative 7 741
DLUTEO
http://ir.dlut.edu.cn/
Positive 13 505 27 351 20
Negative 13 866
TUHLJ
http://nlp.csai.tsinghua.edu.cn/site2/
Positive 5 567 10 034 1
Negative 4 468
词典 NLPIR Jieba HITLTP THULAC HanLP SFNLP
HowNet 62.76 80.18 73.63 75.22 72.74 77.87
NTUSD 40.35 51.52 48.37 46.82 45.99 54.44
DLUTEO 39.24 53.46 46.74 51.10 45.67 48.96
THULJ 71.80 84.02 82.28 81.73 78.81 83.21
总体 38.20 52.44 46.73 49.35 45.00 49.96
词典 NLPIR Jieba HITLTP THULAC HanLP SFNLP
HowNet 12.35 3.77 6.85 6.52 7.12 2.24
NTUSD 12.10 4.91 6.15 7.86 7.56 5.10
DLUTEO 3.23 0.35 1.58 1.52 0.48 0.14
THULJ 1.89 -1.40 0.44 0.26 -1.02 -0.52
总体 8.26 5.26 6.04 6.98 5.65 3.82
词典 NLPIR Jieba HITLTP THULAC HanLP SFNLP
HowNet 9.30 4.48 7.04 6.36 4.91 6.80
NTUSD 1.14 0.28 0.65 1.30 0.41 0.59
DLUTEO 7.87 4.27 6.33 5.37 4.65 6.20
THULJ 5.99 3.48 4.46 4.67 3.88 4.37
总体 6.46 3.52 5.17 4.64 3.81 5.07
词典 NLPIR Jieba HITLTP THULAC HanLP SFNLP
HowNet -3.62 -0.11 -1.78 -1.37 -0.31 0.64
NTUSD -0.73 -0.21 -0.51 -0.99 -0.19 -0.20
DLUTEO -2.65 -0.80 -1.72 -1.49 -1.01 -0.75
THULJ -4.50 -1.70 -2.78 -2.44 -2.07 -2.22
总体 -1.03 0.04 -0.41 -0.48 -0.03 0.44
NLPIR Jieba HITLTP THULAC HanLP SFNLP
HowNet 9.95 18.34 19.67 19.84 13.91 20.79
NTUSD 3.79 9.69 12.64 12.23 5.78 13.01
DLUTEO 14.98 26.27 24.67 28.41 19.08 24.95
THULJ 11.52 17.95 19.50 20.64 14.31 17.66
总体 13.50 24.40 24.41 26.97 17.57 24.48
词典 NLPIR Jieba HITLTP THULAC HanLP SFNLP
HowNet -0.47 -4.59 -4.39 -4.25 -2.78 -4.81
NTUSD -2.13 -7.15 -9.51 -8.59 -4.05 -6.32
DLUTEO -0.70 -1.71 -2.67 -2.44 -2.49 -2.67
THULJ 0.45 0.47 -0.39 0.19 -0.35 -0.07
总体 0.67 -0.91 0.56 -1.81 -0.80 -1.88
NLPIR Jieba HITLTP THULAC HanLP SFNLP
否定词 89.74 97.44 89.74 61.54 94.87 100.00
程度副词 78.97 90.65 85.05 78.50 84.58 93.93
词典 NLPIR Jieba HITLTP THULAC HanLP SFNLP
HowNet 31.64 35.39 33.77 32.45 33.69 35.13
NTUSD 36.16 35.46 36.22 38.32 34.85 35.95
DLUTEO 34.49 34.29 34.46 37.74 34.21 34.70
THULJ 33.64 34.39 33.56 36.73 33.98 33.98
总体 16.86 18.17 17.84 18.74 16.91 18.43
词典 NLPIR Jieba HITLTP THULAC HanLP SFNLP
HowNet -25.54 -26.15 -27.62 -20.80 -26.16 -26.47
NTUSD -16.05 -14.02 -16.09 -17.50 -14.02 -14.60
DLUTEO -24.35 -23.14 -24.33 -22.99 -23.59 -24.22
THULJ -26.95 -26.34 -26.83 -26.06 -26.36 -26.64
总体 -21.68 -21.24 -22.99 -19.99 -21.04 -21.49
[1] 杨超, 冯时, 王大玲 , 等. 基于情感词典扩展技术的网络舆情倾向性分析[J]. 小型微型计算机系统, 2010,31(4):691-695.
[1] ( Yang Chao, Feng Shi, Wang Daling , et al. Analysis on Web Public Opinion Orientation Based on Extending Sentiment Lexicon[J]. Journal of Chinese Computer Systems, 2010,31(4):691-695.)
[2] 郭顺利, 张向先 . 面向中文图书评论的情感词典构建方法研究[J]. 现代图书情报技术, 2016(2):67-74.
[2] ( Guo Shunli, Zhang Xiangxian . Building Sentiment Analysis Dictionary for Chinese Book Reviews[J]. New Technology of Library and Information Service, 2016(2):67-74.)
[3] 姜杰, 夏睿 . 机器学习与语义规则融合的微博情感分类方法[J]. 北京大学学报: 自然科学版, 2017,53(2):247-254.
[3] ( Jiang Jie, Xia Rui . Microblog Sentiment Classification via Combining Rule-based and Machine Learning Methods[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2017,53(2):247-254.)
[4] 杨小平, 张中夏, 王良 , 等. 基于Word2Vec的情感词典自动构建与优化[J]. 计算机科学, 2017,44(1):42-47, 74.
[4] ( Yang Xiaoping, Zhang Zhongxia, Wang Liang , et al. Automatic Construction and Optimization of Sentiment Lexicon Based on Word2Vec[J]. Computer Science, 2017,44(1):42-47, 74.)
[5] 赵妍妍, 秦兵, 石秋慧 , 等. 大规模情感词典的构建及其在情感分类中的应用[J]. 中文信息学报, 2017,31(2):187-193.
[5] ( Zhao Yanyan, Qin Bing, Shi Qiuhui , et al. Large-Scale Sentiment Lexicon Collection and Its Application in Sentiment Classification[J]. Journal of Chinese Information Processing, 2017,31(2):187-193.)
[6] 张仰森, 孙旷怡, 杜翠兰 , 等. 一种级联式微博情感分类器的构建方法[J]. 中文信息学报, 2017,31(5):178-184.
[6] ( Zhang Yangsen, Sun Kuangyi, Du Cuilan , et al. A Cascaded Construction of Sentiment Classifier for Micro-Blogs[J]. Journal of Chinese Information Processing, 2017,31(5):178-184.)
[7] 黄翼彪 . 开源中文分词器的比较研究[D]. 郑州: 郑州大学, 2013.
[7] ( Huang Yibiao . Comparative Research on Open-Source Chinese Word Segmentation Machines[D]. Zhengzhou: Zhengzhou University, 2013.)
[8] 杨海丰, 陈明亮, 赵臻 . 常用中文分词软件在中医文本文献研究领域的适用性研究[J]. 世界科学技术: 中医药现代化, 2017,19(3):536-541.
[8] ( Yang Haifeng, Chen Mingliang, Zhao Zhen . Analysis on Applicability of Common Chinese Word Segmentation Software in Literature Study of Traditional Chinese Medicine Text[J]. World Science and Technology: Modernization of Traditional Chinese Medicine and Materia Medica, 2017,19(3):536-541.)
[9] 李湘东, 高凡, 丁丛 . LDA模型下不同分词方法对文本分类性能的影响研究[J]. 计算机应用研究, 2017,34(1):62-66.
[9] ( Li Xiangdong, Gao Fan, Ding Cong . Study on Influences of Different Chinese Word Segmentation Methods to Text Automatic Classification Based on LDA Model[J]. Application Research of Computers, 2017,34(1):62-66.)
[10] Zeng Y, Yang H, Feng Y , et al. A Convolution BiLSTM Neural Network Model for Chinese Event Extraction[J]. Natural Language Understanding and Intelligent Applications, 2016: 275-287.
[11] Peng H, Cambria E, Hussain A . A Review of Sentiment Analysis Research in Chinese Language[J]. Cognitive Computation, 2017,9(4):423-435.
[12] Zhang S, Zhang X, Wang H , et al. Chinese Medical Question Answer Matching Using End-to-End Character-Level Multi-Scale CNNs[J]. Applied Sciences, 2017,7(8):767.
[13] 倪维健, 孙浩浩, 刘彤 , 等. 面向领域文献的无监督中文分词自动优化方法[J]. 数据分析与知识发现, 2018,2(2):96-104.
[13] ( Ni Weijian, Sun Haohao, Liu Tong , et al. An Unsupervised Approach to Optimize Chinese Word Segmentation on Domain Literature[J]. Data Analysis and Knowledge Discovery, 2018,2(2):96-104.)
[14] 陈钊, 徐睿峰, 桂林 , 等. 结合卷积神经网络和词语情感序列特征的中文情感分析[J]. 中文信息学报, 2015,29(6):172-178.
[14] ( Chen Zhao, Xu Ruifeng, Gui Lin , et al. Combining Convolutional Neural Networks and Word Sentiment Sequence Features for Chinese Text Sentiment Analysis[J]. Journal of Chinese Information Processing, 2015,29(6):172-178.)
[15] 刘德喜, 聂建云, 张晶 , 等. 中文微博情感词提取:N-Gram为特征的分类方法[J]. 中文信息学报, 2016,30(4):193-205.
[15] ( Liu Dexi, Nie Jianyun, Zhang Jing , et al. Extracting Sentimental Lexicons from Chinese Microblog: A Classification Method Using N-Gram Features[J]. Journal of Chinese Information Processing, 2016,30(4):193-205.)
[16] 敦欣卉, 张云秋, 杨铠西 . 基于微博的细粒度情感分析[J]. 数据分析与知识发现, 2017,1(7):61-72.
[16] ( Dun Xinhui, Zhang Yunqiu, Yang Kaixi . Fine-grained Sentiment Analysis Based on Weibo[J]. Data Analysis and Knowledge Discovery, 2017,1(7):61-72.)
[17] 陈珂, 梁斌, 柯文德 , 等. 基于多通道卷积神经网络的中文微博情感分析[J]. 计算机研究与发展, 2018,55(5):945-957.
[17] ( Chen Ke, Liang Bin, Ke Wende , et al. Chinese Micro-Blog Sentiment Analysis Based on Multi-Channels Convolutional Neural Networks[J]. Journal of Computer Research and Development, 2018,55(5):945-957.)
[18] 程翠琼, 徐健 . 面向网络游记时间特征的情感分析模型[J]. 数据分析与知识发现, 2017,1(2):87-95.
[18] ( Cheng Cuiqiong, Xu Jian . A Sentiment Analysis Model Based on Temporal Characteristics of Travel Blogs[J]. Data Analysis and Knowledge Discovery, 2017,1(2):87-95.)
[19] 刘相臣, 丁崇明 . 近百年现代汉语否定副词研究述论[J]. 江西师范大学学报: 哲学社会科学版, 2014(6):91-100.
[19] ( Liu Xiangchen, Ding Chongming . A Review of the Researches of Modern Chinese Negative Adverbs in the Recent 100 Years[J]. Journal of Jiangxi Normal University: Philosophy and Social Sciences Edition, 2014(6):91-100.)
[20] 张成功, 刘培玉, 朱振方 , 等. 一种基于极性词典的情感分析方法[J]. 山东大学学报: 理学版, 2012,47(3):50-53.
[20] ( Zhang Chenggong, Liu Peiyu, Zhu Zhenfang , et al. A Sentiment Analysis Method Based on a Polarity Lexicon[J]. Journal of Shandong University: Natural Science, 2012,47(3):50-53.)
[21] Taboada M, Brooke J, Tofiloski M , et al. Lexicon-Based Methods for Sentiment Analysis[J]. Computational Linguistics, 2011,37(2):267-307.
[22] Liu B . Sentiment Analysis and Opinion Mining[M]. Morgan & Claypool Publishers, 2012.
[1] Cuiqing Jiang,Yibo Guo,Yao Liu. Constructing a Domain Sentiment Lexicon Based on Chinese Social Media Text[J]. 数据分析与知识发现, 2019, 3(2): 98-107.
[2] Bengong Yu,Peihang Zhang,Qingtang Xu. Selecting Products Based on F-BiGRU Sentiment Analysis[J]. 数据分析与知识发现, 2018, 2(9): 22-30.
[3] Ziming Zeng,Qianwen Yang. Sentiment Analysis for Micro-blogs with LDA and AdaBoost[J]. 数据分析与知识发现, 2018, 2(8): 51-59.
[4] Xiufang Wang,Shu Sheng,Yan Lu. Analyzing Public Opinion from Microblog with Topic Clustering and Sentiment Intensity[J]. 数据分析与知识发现, 2018, 2(6): 37-47.
[5] Sinan Yang,Jian Xu,Pingping Ye. Review of Online Sentiment Visualization Techniques[J]. 数据分析与知识发现, 2018, 2(5): 77-87.
[6] Tingting Wang,Kaiping Wang,Guijie Qi. Analyzing Implemented Ideas from Open Innovation Platform with Sentiment Analysis: Case Study of Salesforce[J]. 数据分析与知识发现, 2018, 2(4): 38-47.
[7] Yang Zhao,Qiqi Li,Yuhan Chen,Wenhang Cao. Examining Consumer Reviews of Overseas Shopping APP with Sentiment Analysis[J]. 数据分析与知识发现, 2018, 2(11): 19-27.
[8] Yue He,Can Zhu. Sentiment Analysis of Weibo Opinion Leaders——Case Study of “Illegal Vaccine” Event[J]. 数据分析与知识发现, 2017, 1(9): 65-73.
[9] Hongli Zhang,Jiying Liu,Sinan Yang,Jian Xu. Predicting Online Users’ Ratings with Comments[J]. 数据分析与知识发现, 2017, 1(8): 48-58.
[10] Ge Gao,Junmei Luo,Yu Wang. Analyzing Textual Sentiment Based on HNC Theory[J]. 数据分析与知识发现, 2017, 1(8): 85-91.
[11] Huanrong Shou,Shuqing Deng,Jian Xu. Detecting Online Rumors with Sentiment Analysis[J]. 数据分析与知识发现, 2017, 1(7): 44-51.
[12] Chuanming Yu,Bolin Feng,Lu An. Sentiment Analysis in Cross-Domain Environment with Deep Representative Learning[J]. 数据分析与知识发现, 2017, 1(7): 73-81.
[13] Xinhui Dun,Yunqiu Zhang,Kaixi Yang. Fine-grained Sentiment Analysis Based on Weibo[J]. 数据分析与知识发现, 2017, 1(7): 61-72.
[14] Weifang Wu,Baojun Gao,Haixia Yang,Hanlin Sun. The Impacts of Reviews on Hotel Satisfaction: A Sentiment Analysis Method[J]. 数据分析与知识发现, 2017, 1(3): 62-71.
[15] Shuang Yang,Fen Chen. Analyzing Sentiments of Micro-blog Posts Based on Support Vector Machine[J]. 数据分析与知识发现, 2017, 1(2): 73-79.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn