Please wait a minute...
Data Analysis and Knowledge Discovery  2022, Vol. 6 Issue (9): 77-85    DOI: 10.11925/infotech.2096-3467.2021.1369
Current Issue | Archive | Adv Search |
CNN-SM: Identifying Words on Defective Products with Sememe and Multi-features
You Xindong,Yuan Menglong,Zhang Le(),Lv Xueqiang
Beijing Key Laboratory of Internet Culture and Digital Dissemination Research, Beijing Information Science and Technology University, Beijing 100101, China
Download: PDF (1018 KB)   HTML ( 28
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a CNN model based on the sememe and multi-features, aiming to improve the recognition accuracy of words on defected consumer products. [Methods] First, we created the model’s input with a distributed word vector fused with sememe. Then, we added part-of-speech features and randomly embedded word position vectors to the input. Finally, we removed the max pooling and increased the information contained in the depth vector output by the convolution kernel, which provided sufficient information for word classification. [Results] Compared with the CNN model only adding word position vectors, the proposed method improved the precision, recall and F1 values by 0.021, 0.002 and 0.012, respectively. [Limitations] We need to improve the polarity recognition of the same expression in different scenarios. [Conclusions] The sememe, part-of-speech, and the removal of pooling layer could improve the performance of model for domain word recognition.

Key wordsConsumer Product      Domain Words      Sememe      Word Vector      CNN     
Received: 02 December 2021      Published: 26 October 2022
ZTFLH:  TP391  
Fund:Natural Science Foundation of Beijing(4212020);National Natural Science Foundation of China(62171043);President Foundation of China National Institute of Standardization(282020Y-7511)
Corresponding Authors: Zhang Le,ORCID:0000-0002-9620-511X     E-mail: zhangle@bistu.edu.cn

Cite this article:

You Xindong, Yuan Menglong, Zhang Le, Lv Xueqiang. CNN-SM: Identifying Words on Defective Products with Sememe and Multi-features. Data Analysis and Knowledge Discovery, 2022, 6(9): 77-85.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2021.1369     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2022/V6/I9/77

Example of Sememes, Senses and Words
SAT Word Vector
Model Structure
数据样本 所属数据集 缺陷实体词 缺陷描述词
为了连接手机充电了,但是刚拔下来的时候有点烫手,明天开始用了。 A - 烫手
充电器插头给充电宝充电发热发烫,不知道咋回事。 A 充电宝 发热,发烫
冷柜冷冻室无法冷冻,产品不结冰,食物无法冷冻,经销商态度蛮横。 B 冷柜,冷冻室 无法冷冻
买来两周时间,中间因为出差,前后就没用几次,今天突然发现壶底已出锈迹,这质量害我喝了这么久的锈水。 B 壶底 锈迹
吹风机到啦!没有吵人的声音,吹出来的风好舒服,头发干的也快,质量也很好。 B - -
Corpus Samples and Manual Domain Words
参数名
卷积核大小 [2,3,4]
输入通道 1
输出通道 32
词向量维度 300
位置嵌入维度 30
Epoch 8
学习率 0.001
Hyperparameter
模型 Precision Recall F1
LSTM 0.845 0.750 0.794
LSTM+WPE 0.831 0.769 0.799
CNN-PD 0.852 0.831 0.841
CNN-SM 0.873 0.833 0.853
The Results of Experiment
对比项目 数据样本
原句
分词
充电口断裂
充电 口 断裂
CNN-SM B-ENY I-ENY B-DET
CNN-PD B-ENY B-ENY B-DET
原句 质量不好太臭了。
分词 质量 不好 太 臭
CNN-SM B-DET I-DET B-DET I-DET
CNN-PD O O B-DET I-DET
原句 物流挺快,风力很大,有轻微塑料味
分词 物流 挺快 风力 很大 有 轻微 塑料 味
CNN-SM O O B-DET I-DET O O B-DET I-DET
CNN-PD O O B-DET O O O B-DET I-DET
原句 产品挺漂亮的,唯一不满的就是盖子扣不严,店家说是就是这样的 以前用的老式的还挺严实。不理解
分词 产品 挺 漂亮 唯一 不满 盖子 扣 不严 店家 说 以前 用 <unk> 挺 严实 不 理解
CNN-SM O O O O O B-ENY B-DET I-DET O O O O O O O O O
CNN-PD O O O O O B-ENY O I-DET O O O O O O O O O
The Results of Experiment
额外特征 词向量 池化 Precision Recall F1
WPE Rand MAX 0.852 0.831 0.841
WPE Rand - 0.862 0.830 0.846
WPE SAT MAX 0.847 0.825 0.836
WPE SAT - 0.872 0.831 0.851
WPE+POS Rand MAX 0.849 0.835 0.842
WPE+POS Rand - 0.862 0.821 0.841
WPE+POS SAT MAX 0.843 0.830 0.836
WPE+POS SAT - 0.873 0.833 0.853
Ablation Experiment
[1] 彭郴, 吕学强, 孙宁, 等. 基于CNN的消费品缺陷领域词典构建方法研究[J]. 数据分析与知识发现, 2020, 4(11): 112-120.
[1] ( Peng Chen, Lv Xueqiang, Sun Ning, et al. Building Phrase Dictionary for Defective Products with Convolutional Neural Network[J]. Data Analysis and Knowledge Discovery, 2020, 4(11): 112-120.)
[2] Li G Y, Wang H F. Improved Automatic Keyword Extraction Based on TextRank Using Domain Knowledge[C]// Proceedings of the 2014 CCF International Conference on Natural Language Processing and Chinese Computing. 2014: 403-413.
[3] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. The Journal of Machine Learning Research, 2003, 3: 993-1022.
[4] Hearst M A, Dumais S T, Osuna E, et al. Support Vector Machines[J]. IEEE Intelligent Systems and Their Applications, 1998, 13(4): 18-28.
[5] Hu B T, Lu Z D, Li H, et al. Convolutional Neural Network Architectures for Matching Natural Language Sentences[C]// Proceedings of the 27th International Conference on Neural Information Processing Systems. 2014: 2042-2050.
[6] Hochreiter S, Schmidhuber J. Long Short-Term Memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
pmid: 9377276
[7] Lample G, Ballesteros M, Subramanian S, et al. Neural Architectures for Named Entity Recognition[C]// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2016: 260-270.
[8] 闫强, 张笑妍, 周思敏. 基于义原相似度的关键词抽取方法[J]. 数据分析与知识发现, 2021, 5(4): 80-89.
[8] ( Yan Qiang, Zhang Xiaoyan, Zhou Simin. Extracting Keywords Based on Sememe Similarity[J]. Data Analysis and Knowledge Discovery, 2021, 5(4): 80-89.)
[9] 邵卫, 化柏林. 基于依存句法分析的科技政策领域主题词表无监督构建[J]. 情报工程, 2020, 6(6): 33-44.
[9] ( Shao Wei, Hua Bolin. Unsupervised Construction of Thesaurus in the Science and Technology Policy Based on Dependency Syntax Analysis[J]. Technology Intelligence Engineering, 2020, 6(6): 33-44.)
[10] 陈可嘉, 黄思翌. 中文短文本自动关键词提取的改进RAKE算法[J]. 小型微型计算机系统, 2021, 42(6): 1171-1175.
[10] ( Chen Kejia, Huang Siyi. Improved RAKE Algorithm for Automatic Keyword Extraction in Chinese Short Text[J]. Journal of Chinese Computer Systems, 2021, 42(6): 1171-1175.)
[11] 黄睿智, 黄德才. 词间关系的不确定图模型与关键词自动抽取方法[J]. 小型微型计算机系统, 2019, 40(2): 300-304.
[11] ( Huang Ruizhi, Huang Decai. Words’ Relation Based on Uncertain Graph and Automatic Keyword Extraction[J]. Journal of Chinese Computer Systems, 2019, 40(2): 300-304.)
[12] 张震, 曾金. 面向用户评论的关键词抽取研究——以美团为例[J]. 数据分析与知识发现, 2019, 3(3): 36-44.
[12] ( Zhang Zhen, Zeng Jin. Extracting Keywords from User Comments: Case Study of Meituan[J]. Data Analysis and Knowledge Discovery, 2019, 3(3): 36-44.)
[13] Zhang Q, Wang Y, Gong Y Y, et al. Keyphrase Extraction Using Deep Recurrent Neural Networks on Twitter[C]// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016: 836-845.
[14] 段建勇, 游世薪, 张梅, 等. 基于多特征融合的关键词抽取[J]. 计算机科学, 2020, 47(S2): 73-77.
[14] ( Duan Jianyong, You Shixin, Zhang Mei, et al. Keyword Extraction Based on Multi-Feature Fusion[J]. Computer Science, 2020, 47(S2): 73-77.)
[15] Rumelhart D E, Hinton G E, Williams R J. Learning Representations by Back Propagating Errors[J]. Nature, 1986, 323(6088): 533-536.
doi: 10.1038/323533a0
[16] Mikolov T, Chen K, Corrado G S, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781.
[17] Pennington J, Socher R, Manning C D. GloVe: Global Vectors for Word Representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 1532-1543.
[18] Sonkar S, Waters A E, Baraniuk R G. Attention Word Embedding[C]// Proceedings of the 28th International Conference on Computational Linguistics. 2020: 6894-6902.
[19] Tan M H, Jiang J. A BERT-Based Dual Embedding Model for Chinese Idiom Prediction[C]// Proceedings of the 28th International Conference on Computational Linguistics. 2020:1312-1322.
[20] Niu Y L, Xie R B, Liu Z Y, et al. Improved Word Representation Learning with Sememes[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017: 2049-2058.
[21] 董振东, 董强, 郝长伶. 知网的理论发现[J]. 中文信息学报, 2007, 21(4): 3-9.
[21] ( Dong Zhendong, Dong Qiang, Hao Changling. Theoretical Findings of HowNet[J]. Journal of Chinese Information Processing, 2007, 21(4): 3-9.)
[22] 郗亚辉. 产品评论中领域情感词典的构建[J]. 中文信息学报, 2016, 30(5): 136-144.
[22] ( Xi Yahui. Construction of Domain-specific Sentiment Lexicon in Product Reviews[j]. Journal of Chinese Information Processing, 2016, 30(5): 136-144.)
[23] 张琴, 张智雄. 基于PhraseLDA模型的主题短语挖掘方法研究[J]. 图书情报工作, 2017, 61(8): 120-125.
doi: 10.13266/j.issn.0252-3116.2017.08.015
[23] ( Zhang Qin, Zhang Zhixiong. Topical Phrase Mining Based on the PhraseLDA Model[J]. Library and Information Service, 2017, 61(8): 120-125.)
doi: 10.13266/j.issn.0252-3116.2017.08.015
[24] 蒋翠清, 郭轶博, 刘尧. 基于中文社交媒体文本的领域情感词典构建方法研究[J]. 数据分析与知识发现, 2019, 3(2): 98-107.
[24] ( Jiang Cuiqing, Guo Yibo, Liu Yao. Constructing a Domain Sentiment Lexicon Based on Chinese Social Media Text[J]. Data Analysis and Knowledge Discovery, 2019, 3(2): 98-107.)
[25] 郑新曼, 董瑜. 基于科技政策文本的程度词典构建研究[J]. 数据分析与知识发现, 2021, 5(10): 81-93.
[25] ( Zheng Xinman, Dong Yu. Constructing Degree Lexicon for STI Policy Texts[J]. Data Analysis and Knowledge Discovery, 2021, 5(10): 81-93.)
[26] 叶霞, 曹军博, 许飞翔, 等. 中文领域情感词典自适应学习方法[J]. 计算机工程与设计, 2020, 41(8): 2231-2237.
[26] ( Ye Xia, Cao Junbo, Xu Feixiang, et al. Sentiment Dictionary Adaptive Learning Method in Chinese Domain[J]. Computer Engineering and Design, 2020, 41(8): 2231-2237.)
[27] Zeng D J, Liu K, Lai S W, et al. Relation Classification via Convolutional Deep Neural Network[C]// Proceedings of the 25th International Conference on Computational Linguistics:Technical Papers. 2014: 2335-2344.
[28] Goodfellow I J, Bengio Y, Courville A C. Deep Learning[J]. Nature, 2015, 521: 436-444.
doi: 10.1038/nature14539
[29] van Rijsbergen C J. Information Retrieval[M]. Butterworths, 1975.
[1] Zhang Le, Du Yifan, Lü Xueqiang, Dong Zhian. STNLTP: Generating Chinese Patent Abstracts Based on Integrated Strategy[J]. 数据分析与知识发现, 2022, 6(7): 107-117.
[2] Zhang Shunxiang, Zhang Zhenjiang, Zhu Guangli, Zhao Tong, Huang Ju. Identifying Financial Text Causality with Bi-LSTM and Two-way CNN[J]. 数据分析与知识发现, 2022, 6(7): 118-127.
[3] Duan Jianyong, Xu Lishan, Liu Jie, Li Xin, Zhang Jiaming, Wang Hao. Question Generation Based on Sememe Knowledge and Bidirectional Attention Flow[J]. 数据分析与知识发现, 2022, 6(5): 44-53.
[4] Yang Lin, Huang Xiaoshuo, Wang Jiayang, Ding Lingling, Li Zixiao, Li Jiao. Identifying Subtypes of Clinical Trial Diseases with BERT-TextCNN[J]. 数据分析与知识发现, 2022, 6(4): 69-81.
[5] Fan Shaoping,Zhao Yuxuan,An Xinying,Wu Qingqiang. Classification Model for Medical Entity Relations with Convolutional Neural Network[J]. 数据分析与知识发现, 2021, 5(9): 75-84.
[6] Zhang Jiandong, Chen Shiji, Xu Xiaoting, Zuo Wenge. Extracting PDF Tables Based on Word Vectors[J]. 数据分析与知识发现, 2021, 5(8): 34-44.
[7] Wang Hao, Lin Kerou, Meng Zhen, Li Xinlei. Identifying Multi-Type Entities in Legal Judgments with Text Representation and Feature Generation[J]. 数据分析与知识发现, 2021, 5(7): 10-25.
[8] Meng Zhen,Wang Hao,Yu Wei,Deng Sanhong,Zhang Baolong. Vocal Music Classification Based on Multi-category Feature Fusion[J]. 数据分析与知识发现, 2021, 5(5): 59-70.
[9] Yan Qiang,Zhang Xiaoyan,Zhou Simin. Extracting Keywords Based on Sememe Similarity[J]. 数据分析与知识发现, 2021, 5(4): 80-89.
[10] Zhou Wenyuan, Wang Mingyang, Jing Yu. Automatic Classification of Citation Sentiment and Purposes with AttentionSBGMC Model[J]. 数据分析与知识发现, 2021, 5(12): 48-59.
[11] Liu Fang, Li Huabiao, Ma Jin, Yan Sheng, Jin Peiran. Automatic Detection and Recognition of Oracle Rubbings Based on Mask R-CNN[J]. 数据分析与知识发现, 2021, 5(12): 88-97.
[12] Dong Miao, Su Zhongqi, Zhou Xiaobei, Lan Xue, Cui Zhigang, Cui Lei. Improving PubMedBERT for CID-Entity-Relation Classification Using Text-CNN[J]. 数据分析与知识发现, 2021, 5(11): 145-152.
[13] Dai Zhihong, Hao Xiaoling. Extracting Hypernym-Hyponym Relationship for Financial Market Applications[J]. 数据分析与知识发现, 2021, 5(10): 60-70.
[14] Dai Jianhua, Deng Yubin. Extracting Emotion-Cause Pairs Based on Emotional Dilation Gated CNN[J]. 数据分析与知识发现, 2020, 4(8): 98-106.
[15] Weng Mengjuan,Yao Changqing,Han Hongqi,Wang Lijun,Ran Yaxin. Classification and Indexing Method with CNN for Imbalanced Datasets[J]. 数据分析与知识发现, 2020, 4(7): 87-95.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn