基于笔画ELMo嵌入IDCNN-CRF模型的企业风险领域实体抽取研究<sup>*</sup>

doi:10.11925/infotech.2096-3467.2021.1308

数据分析与知识发现

2022, Vol. 6

Issue (9): 86-99 https://doi.org/10.11925/infotech.2096-3467.2021.1308

研究论文

本期目录 | 过刊浏览 | 高级检索

基于笔画ELMo嵌入IDCNN-CRF模型的企业风险领域实体抽取研究^*

杨美芳^1,²(

),杨波^1,²

¹江西财经大学信息管理学院南昌 330013
²江西财经大学信息资源管理研究所南昌 330013

Extracting Entities for Enterprise Risks Based on Stroke ELMo and IDCNN-CRF Model

Yang Meifang^1,²(

),Yang Bo^1,²

¹School of Information Management, Jiangxi University of Finance and Economics, Nanchang 330013, China
²Institute of Information Resource Management, Jiangxi University of Finance and Economics, Nanchang 330013, China

摘要
图/表
参考文献
相关文章
Metrics

全文: PDF (2417 KB) HTML ( 13 )
输出: BibTeX | EndNote (RIS)

摘要

【目的】 有效学习风险领域文本特征和上下文语义关联性,提升企业风险领域实体抽取的性能。【方法】 提出基于笔画ELMo嵌入IDCNN-CRF的企业风险领域实体抽取模型。使用双向语言模型预训练大规模非结构化的企业风险领域数据得到的笔画ELMo向量作为输入特征;将其送入IDCNN网络进行训练,运用CRF对IDCNN的输出层进行处理,获得全局最优的企业风险领域实体序列标注。【结果】 模型对企业风险领域实体抽取的F值为91.9%,相对于BiLSTM-CRF模型的抽取性能提升了2.0%,且测试速度快2.36倍。【局限】 未考虑本模型扩展于更多领域实体抽取任务的普适性。【结论】 本文模型能够为企业风险领域实体语料库构建提供参考借鉴。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	杨美芳
	杨波

关键词 ：笔画ELMo, 迭代膨胀卷积神经网络, 条件随机场, 实体抽取, 风险领域实体

Abstract：

[Objective] This paper proposes a new model to learn the text characteristics and contextual semantic relevance, aiming to extract entities for the enterprise risks more effectively. [Methods] Our entity extraction model is based on stroke ELMo embedded in the IDCNN-CRF. First, we used the bidirectional language model to pre-train the large-scale unstructured data for enterprise risks and obtained the stroke ELMo vector as the input feature. Then, we sent it to the IDCNN network for training, and utilized the CRF to process the output layer of IDCNN. Finally, we got the optimal entity sequence labeling for the enterprise risks. [Results] The F value of this proposed model is 91.9%, which is 2.0% higher than the performance of BiLSTM-CRF deep neural network models. The running speed of our model is 2.36 times faster than the BiLSTM-CRF. [Limitations] More research is needed to exmine this model in more fields. [Conclusions] The proposed model provides reference for constructing entity corpus of enterprise risks.

Key words： Stroke ELMo Iterative Expanded Convolutional Neural Network Conditional Random Field Entity Extraction Risk Domain Entity

收稿日期: 2021-11-17 出版日期: 2022-10-26

ZTFLH:

TP391

基金资助:^*国家自然科学基金项目(72064015);江西省社会科学“十三五”规划项目(19TQ01)

通讯作者: 杨美芳,ORCID：0000-0002-4360-0183 E-mail: yangmeifang@jxufe.edu.cn

引用本文:

杨美芳, 杨波. 基于笔画ELMo嵌入IDCNN-CRF模型的企业风险领域实体抽取研究^*[J]. 数据分析与知识发现, 2022, 6(9): 86-99.
Yang Meifang, Yang Bo. Extracting Entities for Enterprise Risks Based on Stroke ELMo and IDCNN-CRF Model. Data Analysis and Knowledge Discovery, 2022, 6(9): 86-99.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2021.1308 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2022/V6/I9/86

Table 1 企业风险领域实体类型及其属性描述

Fig.1 笔画ELMo嵌入IDCNN-CRF实体抽取模型

Fig.2 笔画ELMo嵌入预训练语言模型

Fig.3 简单叠加DCNN与IDCNN模型对比

Fig.4 应用Doccano标注语料库过程

Table 2 实验数据集统计信息

Table 3 基于统计特征与深度学习的企业风险领域实体抽取性能

Table 4 笔画ELMo嵌入IDCNN-CRF模型的实验结果

Table 5 企业风险领域各类实体抽取的实验结果

Fig.5 不同字向量可视化结果

[1]	张淑惠, 周美琼, 吴雪勤. 年报文本风险信息披露与股价同步性[J]. 现代财经(天津财经大学学报), 2021, 41(2): 62-78.
[1]	( Zhang Shuhui, Zhou Meiqiong, Wu Xueqin. Risk Information Disclosure in Annual Report and Stock Price Synchronization[J]. Modern Finance and Economics-Journal of Tianjin University of Finance and Economics, 2021, 41(2): 62-78.)
[2]	崔笛, 郑明, 李岩, 等. 基于分类体系的上市公司年报信息披露质量研究——以我国A股上市公司为例[J]. 情报学报, 2019, 38(12): 1250-1259.
[2]	( Cui Di, Zheng Ming, Li Yan, et al. Research on the Information Disclosure in Annual Reports of A-Share Listed Companies[J]. Journal of the China Society for Scientific and Technical Information, 2019, 38(12): 1250-1259.)
[3]	Appiagyei K, Boateng C A, Onumah J M. Risk Disclosures in the Annual Reports of Firms in Ghana[J]. International Journal of Management Practice, 2016, 9(2): 142. doi: 10.1504/IJMP.2016.076743
[4]	McHugh D, Shaw S, Moore T R, et al. Uncovering Themes in Personalized Learning: Using Natural Language Processing to Analyze School Interviews[J]. Journal of Research on Technology in Education, 2020, 52(3): 391-402. doi: 10.1080/15391523.2020.1752337
[5]	付瑶, 万静, 邢立栋. 基于条件随机场与信息熵的特定领域概念发现[J]. 计算机应用研究, 2020, 37(3): 708-711.
[5]	( Fu Yao, Wan Jing, Xing Lidong. New Words Discovery Method Based on CRF and Information Entropy in Specific Domain[J]. Application Research of Computers, 2020, 37(3): 708-711.)
[6]	Zhu L, Wang G J, Zou X C. Improved Information Gain Feature Selection Method for Chinese Text Classification Based on Word Embedding[C]// Proceedings of the 6th International Conference on Software and Computer Applications. 2017: 72-76.
[7]	王昊, 邓三鸿, 苏新宁, 等. 基于深度学习的情报学理论及方法术语识别研究[J]. 情报学报, 2020, 39(8): 817-828.
[7]	( Wang Hao, Deng Sanhong, Su Xinning, et al. A Study on Chinese Terminology Recognition of Theory and Method from Information Science: Based on Deep Learning[J]. Journal of the China Society for Scientific and Technical Information, 2020, 39(8): 817-828.)
[8]	彭嘉毅, 方勇, 黄诚, 等. 基于深度主动学习的信息安全领域命名实体识别研究[J]. 四川大学学报(自然科学版), 2019, 56(3): 457-462.
[8]	Peng Jiayi, Fang Yong, Huang Cheng, et al. Cyber Security Named Entity Recognition Based on Deep Active Learning[J]. Journal of Sichuan University(Natural Science Edition), 2019, 56(3): 457-462.)
[9]	Fujimagari H, Fujita K. Detecting Research Fronts Using Neural Network Model for Weighted Citation Network Analysis[J]. Journal of Information Processing, 2015, 23(6): 753-758. doi: 10.2197/ipsjjip.23.753
[10]	徐飞, 叶文豪, 宋英华. 基于BiLSTM-CRF模型的食品安全事件词性自动标注研究[J]. 情报学报, 2018, 37(12): 1204-1211.
[10]	( Xu Fei, Ye Wenhao, Song Yinghua. Part-of-Speech Automated Annotation of Food Safety Events Based on BiLSTM-CRF[J]. Journal of the China Society for Scientific and Technical Information, 2018, 37(12): 1204-1211.)
[11]	Strubell E, Verga P, Belanger D, et al. Fast and Accurate Entity Recognition with Iterated Dilated Convolutions[C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017: 2670-2680.
[12]	Radford A, Metz L, Chintala S. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks[OL]. arXiv Preprint, arXiv: 1511.06434.
[13]	Huang Z H, Xu W, Yu K. Bidirectional LSTM-CRF Models for Sequence Tagging[OL]. arXiv Preprint, arXiv: 1508.01991.
[14]	孙玥莹, 何彦青, 吴广印. 基于领域知识库的科技术语信息匹配模型研究[J]. 情报科学, 2019, 37(8): 16-21.
[14]	( Sun Yueying, He Yanqing, Wu Guangyin. Information Matching Model of Terms in Scientific and Technological Literature Based on Domain Knowledge Base[J]. Information Science, 2019, 37(8): 16-21.)
[15]	罗鹏程, 王一博, 王继民. 基于深度预训练语言模型的文献学科自动分类研究[J]. 情报学报, 2020, 39(10): 1046-1059.
[15]	( Luo Pengcheng, Wang Yibo, Wang Jimin. Automatic Discipline Classification for Scientific Papers Based on a Deep Pre-Training Language Model[J]. Journal of the China Society for Scientific and Technical Information, 2020, 39(10): 1046-1059.)
[16]	罗凌, 杨志豪, 宋雅文, 等. 基于笔画ELMo和多任务学习的中文电子病历命名实体识别研究[J]. 计算机学报, 2020, 43(10): 1943-1957.
[16]	( Luo Ling, Yang Zhihao, Song Yawen, et al. Chinese Clinical Named Entity Recognition Based on Stroke ELMo and Multi-Task Learning[J]. Chinese Journal of Computers, 2020, 43(10): 1943-1957.)
[17]	Hanley K W, Hoberg G. The Information Content of IPO Prospectuses[J]. Review of Financial Studies, 2010, 23(7): 2821-2864. doi: 10.1093/rfs/hhq024
[18]	Bochkay K, Levine C B. Using MD&A to Improve Earnings Forecasts[J]. Journal of Accounting, Auditing & Finance, 2019, 34(3): 458-482.
[19]	胡小荣, 姚长青, 高影繁. 基于风险短语自动抽取的上市公司风险识别方法及可视化研究[J]. 情报学报, 2017, 36(7): 663-668.
[19]	( Hu Xiaorong, Yao Changqing, Gao Yingfan. Risk Identification Method of Listed Companies Based on the Automatic Risk Phrase Extraction and Visualization[J]. Journal of the China Society for Scientific and Technical Information, 2017, 36(7): 663-668.)
[20]	周双文. 基于领域本体的创业板公司年报风险信息抽取方法研究[D]. 长沙: 湖南大学, 2013.
[20]	( Zhou Shuangwen. A Risk Information Extraction Method About GEM Companies’ Annual Report Based on Domain Ontology[D]. Changsha: Hunan University, 2013.)
[21]	郭贤伟, 赖华, 余正涛, 等. 融合情绪知识的案件微博评论情绪分类[J]. 计算机学报, 2021, 44(3): 564-578.
[21]	( Guo Xianwei, Lai Hua, Yu Zhengtao, et al. Emotion Classification of Case-Related Microblog Comments Integrating Emotional Knowledge[J]. Chinese Journal of Computers, 2021, 44(3): 564-578.)
[22]	Qiu J H, Zhou Y M, Wang Q, et al. Chinese Clinical Named Entity Recognition Using Residual Dilated Convolutional Neural Network with Conditional Random Field[J]. IEEE Transactions on Nanobioscience, 2019, 18(3): 306-315. doi: 10.1109/TNB.2019.2908678
[23]	Cao S S, Lu W, Zhou J, et al. cw2vec: Learning Chinese Word Embeddings with Stroke N-Gram Information[C]// Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018: 5053-5061.
[24]	Li X Y, Zhang H, Zhou X H. Chinese Clinical Named Entity Recognition with Variant Neural Structures Based on BERT Methods[J]. Journal of Biomedical Informatics, 2020, 107: 103422. doi: 10.1016/j.jbi.2020.103422
[25]	李舟军, 范宇, 吴贤杰. 面向自然语言处理的预训练技术研究综述[J]. 计算机科学, 2020, 47(3): 162-173. doi: 10.11896/jsjkx.191000167
[25]	( Li Zhoujun, Fan Yu, Wu Xianjie. Survey of Natural Language Processing Pre-Training Techniques[J]. Computer Science, 2020, 47(3): 162-173.) doi: 10.11896/jsjkx.191000167
[26]	Chua C C, Lim T Y, Soon L K, et al. Meaning Preservation in Example-Based Machine Translation with Structural Semantics[J]. Expert Systems with Applications, 2017, 78: 242-258. doi: 10.1016/j.eswa.2017.02.021
[27]	张栋, 陈文亮. 基于上下文相关字向量的中文命名实体识别[J]. 计算机科学, 2021, 48(3): 233-238. doi: 10.11896/jsjkx.191200074
[27]	( Zhang Dong, Chen Wenliang. Chinese Named Entity Recognition Based on Contextualized Char Embeddings[J]. Computer Science, 2021, 48(3): 233-238.) doi: 10.11896/jsjkx.191200074
[28]	Lai S W, Xu L H, Liu K, et al. Recurrent Convolutional Neural Networks for Text Classification[C]// Proceedings of the 29th AAAI Conference on Artificial Intelligence. 2015: 2267-2273.
[29]	Hammerton J. Named Entity Recognition with Long Short-Term Memory[C]// Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL. 2003:172-175.
[30]	肖毅, 熊凯伦, 张希. 基于TEI@I方法论的企业财务风险预警模型研究[J]. 管理评论, 2020, 32(7): 226-235.
[30]	( Xiao Yi, Xiong Kailun, Zhang Xi. Enterprise Financial Risk Early Warning Model Based on TEI@I Methodology[J]. Management Review, 2020, 32(7): 226-235.)
[31]	Chen H, Lin Z J, Ding G G, et al. GRN: Gated Relation Network to Enhance Convolutional Neural Network for Named Entity Recognition[C]// Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 2019, 33: 6236-6243.
[32]	Kim T, Kim H Y. Forecasting Stock Prices with a Feature Fusion LSTM-CNN Model Using Different Representations of the Same Data[J]. PLoS One, 2019, 14(2): e0212320. doi: 10.1371/journal.pone.0212320
[33]	Yang Z C, Hu Z T, Salakhutdinov R, et al. Improved Variational Autoencoders for Text Modeling Using Dilated Convolutions[C]// Proceedings of the 34th International Conference on Machine Learning. 2017: 3881-3890.
[34]	蒋翔, 马建霞, 袁慧. 基于BiLSTM-IDCNN-CRF模型的生态治理技术领域命名实体识别[J]. 计算机应用与软件, 2021, 38(3): 134-141.
[34]	( Jiang Xiang, Ma Jianxia, Yuan Hui. Named Entity Recognition in the Field of Ecological Management Technology Based on BiLSTM-IDCNN-CRF Model[J]. Computer Applications and Software, 2021, 38(3): 134-141.)
[35]	李妮, 关焕梅, 杨飘, 等. 基于BERT-IDCNN-CRF的中文命名实体识别方法[J]. 山东大学学报(理学版), 2020, 55(1): 102-109.
[35]	Li Ni, Guan Huanmei, Yang Piao, et al. BERT-IDCNN-CRF for Named Entity Recognition in Chinese[J]. Journal of Shandong University(Natural Science), 2020, 55(1): 102-109.)
[36]	王芳, 杨京, 徐路路. 面向火灾应急管理的本体构建研究[J]. 情报学报, 2020, 39(9): 914-925.
[36]	( Wang Fang, Yang Jing, Xu Lulu. Ontology Construction for Fire Emergency Management[J]. Journal of the China Society for Scientific and Technical Information, 2020, 39(9): 914-925.)
[37]	张海涛, 刘伟利, 栾宇, 等. 重大突发事件的情景图谱构建[J]. 情报学报, 2021, 40(9): 924-933.
[37]	( Zhang Haitao, Liu Weili, Luan Yu, et al. Construction of Scenario Graph for a Major Emergency[J]. Journal of the China Society for Scientific and Technical Information, 2021, 40(9): 924-933.)
[38]	Peters M, Neumann M, Iyyer M, et al. Deep Contextualized Word Representations[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2018: 2227-2237.
[39]	Che W X, Liu Y J, Wang Y X, et al. Towards Better UD Parsing: Deep Contextualized Word Embeddings, Ensemble, and Treebank Concatenation[OL]. arXiv Preprint, arXiv: 1807.03121.
[40]	Bouvrie J. Notes on Convolutional Neural Networks[OL]. Cogrints, 2006. https://web-archive.southampton.ac.uk/cogprints.org/5869/.
[41]	张应成, 杨洋, 蒋瑞, 等. 基于BiLSTM-CRF的商情实体识别模型[J]. 计算机工程, 2019, 45(5): 308-314.
[41]	( Zhang Yingcheng, Yang Yang, Jiang Rui, et al. Commercial Intelligence Entity Recognition Model Based on BiLSTM-CRF[J]. Computer Engineering, 2019, 45(5): 308-314.)
[42]	Lafferty J, McCallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]// Proceedings of the 18th International Conference on Machine Learning. 2001: 282-289.
[43]	McCallumA, FreitagD, PereiraF. Maximum Entropy Markov Models for Information Extraction and Segmentation[C]// Proceedings of the 17th International Conference on Machine Learning. 2000: 591-598.

[1]	王昊, 林克柔, 孟镇, 李心蕾. 文本表示及其特征生成对法律判决书中多类型实体识别的影响分析[J]. 数据分析与知识发现, 2021, 5(7): 10-25.
[2]	成彬,施水才,都云程,肖诗斌. 基于融合词性的BiLSTM-CRF的期刊关键词抽取方法[J]. 数据分析与知识发现, 2021, 5(3): 101-108.
[3]	赵平,孙连英,涂帅,卞建玲,万莹. 改进的知识迁移景点实体识别算法研究及应用^*[J]. 数据分析与知识发现, 2020, 4(5): 118-126.
[4]	李成梁,赵中英,李超,亓亮,温彦. 基于依存关系嵌入与条件随机场的商品属性抽取方法^*[J]. 数据分析与知识发现, 2020, 4(5): 54-65.
[5]	黄菡,王宏宇,王晓光. 结合主动学习的条件随机场模型用于法律术语的自动识别^*[J]. 数据分析与知识发现, 2019, 3(6): 66-74.
[6]	丁晟春,侯琳琳,王颖. 基于电商数据的产品知识图谱构建研究^*[J]. 数据分析与知识发现, 2019, 3(3): 45-56.
[7]	袁悦,王东波,黄水清,李斌. 不同词性标记集在典籍实体抽取上的差异性探究^*[J]. 数据分析与知识发现, 2019, 3(3): 57-65.
[8]	肖连杰,孟涛,王伟,吴志祥. *基于深度学习的情报分析方法识别研究 ^ ——以安全情报领域为例**[J]. 数据分析与知识发现, 2019, 3(10): 20-28.
[9]	唐慧慧, 王昊, 张紫玄, 王雪颖. 基于汉字标注的中文历史事件名抽取研究^*[J]. 数据分析与知识发现, 2018, 2(7): 89-100.
[10]	王东波, 吴毅, 叶文豪, 刘睿伦. 多特征知识下的食品安全事件实体抽取研究^*[J]. 数据分析与知识发现, 2017, 1(3): 54-61.
[11]	张越, 王东波, 朱丹浩. 面向食品安全突发事件汉语分词的特征选择及模型优化研究^*[J]. 数据分析与知识发现, 2017, 1(2): 64-72.
[12]	张琳, 秦策, 叶文豪. 基于条件随机场的法言法语实体自动识别模型研究^*[J]. 数据分析与知识发现, 2017, 1(11): 46-52.
[13]	王密平,王昊,邓三鸿,吴志祥. 基于CRFs的冶金领域中文专利术语抽取研究^*[J]. 现代图书情报技术, 2016, 32(6): 28-36.
[14]	贺惠新,刘丽娟. 主动学习的科技文献研究对象标引体系研究^*[J]. 现代图书情报技术, 2016, 32(3): 67-73.
[15]	隋明爽,崔雷. 结合多种特征的CRF模型用于化学物质-疾病命名实体识别[J]. 现代图书情报技术, 2016, 32(10): 91-97.

Viewed

Full text

Abstract

Cited

Shared

Discussed