CNN-SM：基于义原与多特征融合的消费品领域缺陷词识别模型<sup>*</sup>

doi:10.11925/infotech.2096-3467.2021.1369

数据分析与知识发现

2022, Vol. 6

Issue (9): 77-85 https://doi.org/10.11925/infotech.2096-3467.2021.1369

研究论文

本期目录 | 过刊浏览 | 高级检索

CNN-SM：基于义原与多特征融合的消费品领域缺陷词识别模型^*

游新冬,袁梦龙,张乐(

),吕学强

北京信息科技大学网络文化与数字传播北京市重点实验室北京 100101

CNN-SM: Identifying Words on Defective Products with Sememe and Multi-features

You Xindong,Yuan Menglong,Zhang Le(

),Lv Xueqiang

Beijing Key Laboratory of Internet Culture and Digital Dissemination Research, Beijing Information Science and Technology University, Beijing 100101, China

摘要
图/表
参考文献
相关文章
Metrics

全文: PDF (1018 KB) HTML ( 28 )
输出: BibTeX | EndNote (RIS)

摘要

【目的】 针对消费品领域中缺陷词识别任务精度不足的问题,提出基于义原与多特征融合的消费品领域缺陷词识别模型。【方法】 模型输入为融合义原信息的分布式词向量,在此基础上添加词性特征和经过随机嵌入的词位置向量,以增添词向量所包含的信息;在卷积神经网络上去除了最大池化,增加卷积核输出的深度向量所包含的信息,为单词分类提供更充分的信息。【结果】 实验结果表明,所提模型相较于仅添加词位置向量的卷积神经网络模型,在精确率、召回率和F1值上分别有0.021、0.002和0.012的提升。【局限】 不同场景下的相同表述的极性识别不足。【结论】 通过消融实验证明,义原、词性以及去除池化层有助于领域词识别模型性能的提升。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	游新冬
	袁梦龙
	张乐
	吕学强

关键词 ：消费品, 领域词, 义原, 词向量, 卷积神经网络

Abstract：

[Objective] This paper proposes a CNN model based on the sememe and multi-features, aiming to improve the recognition accuracy of words on defected consumer products. [Methods] First, we created the model’s input with a distributed word vector fused with sememe. Then, we added part-of-speech features and randomly embedded word position vectors to the input. Finally, we removed the max pooling and increased the information contained in the depth vector output by the convolution kernel, which provided sufficient information for word classification. [Results] Compared with the CNN model only adding word position vectors, the proposed method improved the precision, recall and F1 values by 0.021, 0.002 and 0.012, respectively. [Limitations] We need to improve the polarity recognition of the same expression in different scenarios. [Conclusions] The sememe, part-of-speech, and the removal of pooling layer could improve the performance of model for domain word recognition.

Key words： Consumer Product Domain Words Sememe Word Vector CNN

收稿日期: 2021-12-02 出版日期: 2022-10-26

ZTFLH:

TP391

基金资助:^*北京市自然科学基金项目(4212020);国家自然科学基金项目(62171043);中国标准化研究院院长基金项目(282020Y-7511)

通讯作者: 张乐,ORCID：0000-0002-9620-511X E-mail: zhangle@bistu.edu.cn

引用本文:

游新冬, 袁梦龙, 张乐, 吕学强. CNN-SM：基于义原与多特征融合的消费品领域缺陷词识别模型^*[J]. 数据分析与知识发现, 2022, 6(9): 77-85.
You Xindong, Yuan Menglong, Zhang Le, Lv Xueqiang. CNN-SM: Identifying Words on Defective Products with Sememe and Multi-features. Data Analysis and Knowledge Discovery, 2022, 6(9): 77-85.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2021.1369 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2022/V6/I9/77

Fig.1 义原、语义及词语案例

Fig.2 SAT词向量

Fig.3 模型结构

Table 1 语料库样本与人工定义领域词样例

Table 2 超参数

Table 3 对比实验结果

Table 4 具体实验结果

Table 5 消融实验

[1]	彭郴, 吕学强, 孙宁, 等. 基于CNN的消费品缺陷领域词典构建方法研究[J]. 数据分析与知识发现, 2020, 4(11): 112-120.
[1]	( Peng Chen, Lv Xueqiang, Sun Ning, et al. Building Phrase Dictionary for Defective Products with Convolutional Neural Network[J]. Data Analysis and Knowledge Discovery, 2020, 4(11): 112-120.)
[2]	Li G Y, Wang H F. Improved Automatic Keyword Extraction Based on TextRank Using Domain Knowledge[C]// Proceedings of the 2014 CCF International Conference on Natural Language Processing and Chinese Computing. 2014: 403-413.
[3]	Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. The Journal of Machine Learning Research, 2003, 3: 993-1022.
[4]	Hearst M A, Dumais S T, Osuna E, et al. Support Vector Machines[J]. IEEE Intelligent Systems and Their Applications, 1998, 13(4): 18-28.
[5]	Hu B T, Lu Z D, Li H, et al. Convolutional Neural Network Architectures for Matching Natural Language Sentences[C]// Proceedings of the 27th International Conference on Neural Information Processing Systems. 2014: 2042-2050.
[6]	Hochreiter S, Schmidhuber J. Long Short-Term Memory[J]. Neural Computation, 1997, 9(8): 1735-1780. pmid: 9377276
[7]	Lample G, Ballesteros M, Subramanian S, et al. Neural Architectures for Named Entity Recognition[C]// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2016: 260-270.
[8]	闫强, 张笑妍, 周思敏. 基于义原相似度的关键词抽取方法[J]. 数据分析与知识发现, 2021, 5(4): 80-89.
[8]	( Yan Qiang, Zhang Xiaoyan, Zhou Simin. Extracting Keywords Based on Sememe Similarity[J]. Data Analysis and Knowledge Discovery, 2021, 5(4): 80-89.)
[9]	邵卫, 化柏林. 基于依存句法分析的科技政策领域主题词表无监督构建[J]. 情报工程, 2020, 6(6): 33-44.
[9]	( Shao Wei, Hua Bolin. Unsupervised Construction of Thesaurus in the Science and Technology Policy Based on Dependency Syntax Analysis[J]. Technology Intelligence Engineering, 2020, 6(6): 33-44.)
[10]	陈可嘉, 黄思翌. 中文短文本自动关键词提取的改进RAKE算法[J]. 小型微型计算机系统, 2021, 42(6): 1171-1175.
[10]	( Chen Kejia, Huang Siyi. Improved RAKE Algorithm for Automatic Keyword Extraction in Chinese Short Text[J]. Journal of Chinese Computer Systems, 2021, 42(6): 1171-1175.)
[11]	黄睿智, 黄德才. 词间关系的不确定图模型与关键词自动抽取方法[J]. 小型微型计算机系统, 2019, 40(2): 300-304.
[11]	( Huang Ruizhi, Huang Decai. Words’ Relation Based on Uncertain Graph and Automatic Keyword Extraction[J]. Journal of Chinese Computer Systems, 2019, 40(2): 300-304.)
[12]	张震, 曾金. 面向用户评论的关键词抽取研究——以美团为例[J]. 数据分析与知识发现, 2019, 3(3): 36-44.
[12]	( Zhang Zhen, Zeng Jin. Extracting Keywords from User Comments: Case Study of Meituan[J]. Data Analysis and Knowledge Discovery, 2019, 3(3): 36-44.)
[13]	Zhang Q, Wang Y, Gong Y Y, et al. Keyphrase Extraction Using Deep Recurrent Neural Networks on Twitter[C]// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016: 836-845.
[14]	段建勇, 游世薪, 张梅, 等. 基于多特征融合的关键词抽取[J]. 计算机科学, 2020, 47(S2): 73-77.
[14]	( Duan Jianyong, You Shixin, Zhang Mei, et al. Keyword Extraction Based on Multi-Feature Fusion[J]. Computer Science, 2020, 47(S2): 73-77.)
[15]	Rumelhart D E, Hinton G E, Williams R J. Learning Representations by Back Propagating Errors[J]. Nature, 1986, 323(6088): 533-536. doi: 10.1038/323533a0
[16]	Mikolov T, Chen K, Corrado G S, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781.
[17]	Pennington J, Socher R, Manning C D. GloVe: Global Vectors for Word Representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 1532-1543.
[18]	Sonkar S, Waters A E, Baraniuk R G. Attention Word Embedding[C]// Proceedings of the 28th International Conference on Computational Linguistics. 2020: 6894-6902.
[19]	Tan M H, Jiang J. A BERT-Based Dual Embedding Model for Chinese Idiom Prediction[C]// Proceedings of the 28th International Conference on Computational Linguistics. 2020:1312-1322.
[20]	Niu Y L, Xie R B, Liu Z Y, et al. Improved Word Representation Learning with Sememes[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017: 2049-2058.
[21]	董振东, 董强, 郝长伶. 知网的理论发现[J]. 中文信息学报, 2007, 21(4): 3-9.
[21]	( Dong Zhendong, Dong Qiang, Hao Changling. Theoretical Findings of HowNet[J]. Journal of Chinese Information Processing, 2007, 21(4): 3-9.)
[22]	郗亚辉. 产品评论中领域情感词典的构建[J]. 中文信息学报, 2016, 30(5): 136-144.
[22]	( Xi Yahui. Construction of Domain-specific Sentiment Lexicon in Product Reviews[j]. Journal of Chinese Information Processing, 2016, 30(5): 136-144.)
[23]	张琴, 张智雄. 基于PhraseLDA模型的主题短语挖掘方法研究[J]. 图书情报工作, 2017, 61(8): 120-125. doi: 10.13266/j.issn.0252-3116.2017.08.015
[23]	( Zhang Qin, Zhang Zhixiong. Topical Phrase Mining Based on the PhraseLDA Model[J]. Library and Information Service, 2017, 61(8): 120-125.) doi: 10.13266/j.issn.0252-3116.2017.08.015
[24]	蒋翠清, 郭轶博, 刘尧. 基于中文社交媒体文本的领域情感词典构建方法研究[J]. 数据分析与知识发现, 2019, 3(2): 98-107.
[24]	( Jiang Cuiqing, Guo Yibo, Liu Yao. Constructing a Domain Sentiment Lexicon Based on Chinese Social Media Text[J]. Data Analysis and Knowledge Discovery, 2019, 3(2): 98-107.)
[25]	郑新曼, 董瑜. 基于科技政策文本的程度词典构建研究[J]. 数据分析与知识发现, 2021, 5(10): 81-93.
[25]	( Zheng Xinman, Dong Yu. Constructing Degree Lexicon for STI Policy Texts[J]. Data Analysis and Knowledge Discovery, 2021, 5(10): 81-93.)
[26]	叶霞, 曹军博, 许飞翔, 等. 中文领域情感词典自适应学习方法[J]. 计算机工程与设计, 2020, 41(8): 2231-2237.
[26]	( Ye Xia, Cao Junbo, Xu Feixiang, et al. Sentiment Dictionary Adaptive Learning Method in Chinese Domain[J]. Computer Engineering and Design, 2020, 41(8): 2231-2237.)
[27]	Zeng D J, Liu K, Lai S W, et al. Relation Classification via Convolutional Deep Neural Network[C]// Proceedings of the 25th International Conference on Computational Linguistics:Technical Papers. 2014: 2335-2344.
[28]	Goodfellow I J, Bengio Y, Courville A C. Deep Learning[J]. Nature, 2015, 521: 436-444. doi: 10.1038/nature14539
[29]	van Rijsbergen C J. Information Retrieval[M]. Butterworths, 1975.

[1]	杨美芳, 杨波. 基于笔画ELMo嵌入IDCNN-CRF模型的企业风险领域实体抽取研究^*[J]. 数据分析与知识发现, 2022, 6(9): 86-99.
[2]	赵鹏武, 李志义, 林小琦. 基于注意力机制和卷积神经网络的中文人物关系抽取与识别*[J]. 数据分析与知识发现, 2022, 6(8): 41-51.
[3]	张乐, 杜一凡, 吕学强, 董志安. STNLTP:一种基于集成策略的中文专利摘要生成模型^*[J]. 数据分析与知识发现, 2022, 6(7): 107-117.
[4]	段建勇, 徐丽闪, 刘杰, 李欣, 张家铭, 王昊. 基于义原知识和双向注意力流的问题生成模型*[J]. 数据分析与知识发现, 2022, 6(5): 44-53.
[5]	郭樊容, 黄孝喜, 王荣波, 谌志群, 胡创, 谢一敏, 司博宇. 基于Transformer和图卷积神经网络的隐喻识别^*[J]. 数据分析与知识发现, 2022, 6(4): 120-129.
[6]	张乐, 冷基栋, 吕学强, 袁梦龙, 游新冬. MWEC:一种基于多语义词向量的中文新词发现方法*[J]. 数据分析与知识发现, 2022, 6(1): 113-121.
[7]	范涛,王昊,吴鹏. 基于图卷积神经网络和依存句法分析的网民负面情感分析研究*[J]. 数据分析与知识发现, 2021, 5(9): 97-106.
[8]	范少萍,赵雨宣,安新颖,吴清强. 基于卷积神经网络的医学实体关系分类模型研究*[J]. 数据分析与知识发现, 2021, 5(9): 75-84.
[9]	张建东, 陈仕吉, 徐小婷, 左文革. 基于词向量的PDF表格抽取研究^*[J]. 数据分析与知识发现, 2021, 5(8): 34-44.
[10]	孟镇,王昊,虞为,邓三鸿,张宝隆. 基于特征融合的声乐分类研究^*[J]. 数据分析与知识发现, 2021, 5(5): 59-70.
[11]	韩普,张展鹏,张明淘,顾亮. 基于多特征融合的中文疾病名称归一化研究^*[J]. 数据分析与知识发现, 2021, 5(5): 83-94.
[12]	闫强,张笑妍,周思敏. 基于义原相似度的关键词抽取方法 ^*[J]. 数据分析与知识发现, 2021, 5(4): 80-89.
[13]	郑新曼, 董瑜. 基于科技政策文本的程度词典构建研究^*[J]. 数据分析与知识发现, 2021, 5(10): 81-93.
[14]	戴志宏, 郝晓玲. 上下位关系抽取方法及其在金融市场的应用^*[J]. 数据分析与知识发现, 2021, 5(10): 60-70.
[15]	邱尔丽,何鸿魏,易成岐,李慧颖. 基于字符级CNN技术的公共政策网民支持度研究 ^*[J]. 数据分析与知识发现, 2020, 4(7): 28-37.

Viewed

Full text

Abstract

Cited

Shared

Discussed