Please wait a minute...
Data Analysis and Knowledge Discovery  2022, Vol. 6 Issue (9): 86-99    DOI: 10.11925/infotech.2096-3467.2021.1308
Current Issue | Archive | Adv Search |
Extracting Entities for Enterprise Risks Based on Stroke ELMo and IDCNN-CRF Model
Yang Meifang1,2(),Yang Bo1,2
1School of Information Management, Jiangxi University of Finance and Economics, Nanchang 330013, China
2Institute of Information Resource Management, Jiangxi University of Finance and Economics, Nanchang 330013, China
Download: PDF (2417 KB)   HTML ( 13
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a new model to learn the text characteristics and contextual semantic relevance, aiming to extract entities for the enterprise risks more effectively. [Methods] Our entity extraction model is based on stroke ELMo embedded in the IDCNN-CRF. First, we used the bidirectional language model to pre-train the large-scale unstructured data for enterprise risks and obtained the stroke ELMo vector as the input feature. Then, we sent it to the IDCNN network for training, and utilized the CRF to process the output layer of IDCNN. Finally, we got the optimal entity sequence labeling for the enterprise risks. [Results] The F value of this proposed model is 91.9%, which is 2.0% higher than the performance of BiLSTM-CRF deep neural network models. The running speed of our model is 2.36 times faster than the BiLSTM-CRF. [Limitations] More research is needed to exmine this model in more fields. [Conclusions] The proposed model provides reference for constructing entity corpus of enterprise risks.

Key wordsStroke ELMo      Iterative Expanded Convolutional Neural Network      Conditional Random Field      Entity Extraction      Risk Domain Entity     
Received: 17 November 2021      Published: 26 October 2022
ZTFLH:  TP391  
Fund:National Natural Science Foundation of China(72064015);Jiangxi Province Social Science “Thirteenth Five-Year Plan” Project(19TQ01)
Corresponding Authors: Yang Meifang,ORCID:0000-0002-4360-0183     E-mail: yangmeifang@jxufe.edu.cn

Cite this article:

Yang Meifang, Yang Bo. Extracting Entities for Enterprise Risks Based on Stroke ELMo and IDCNN-CRF Model. Data Analysis and Knowledge Discovery, 2022, 6(9): 86-99.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2021.1308     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2022/V6/I9/86

实体类型 属性 实体类型描述 示例 标识符号
风险事件 风险名称 国资委发布的企业风险 安全生产风险、企业声誉风险、资金安全风险等 Na
风险时间 风险事件发生的时间 - Ti
风险地点 风险事件发生的地点 - P
风险源 引发风险的内在要素 企业负面新闻、企业仓库易燃物、新冠疫情、恶劣天气等 So
风险管理
智能体
风险管理组织 企业、政府、组织机构等 监管机构、政府、上市公司等 Or
责任部门 引发风险或处理风险的责任部门 市场部、财务部、法务部等 De
风险管理资源 风险管理资源 可供风险管理智能体使用的风控资源 物资、设备、资金、ERP、政策等 Res
风险场景 风险原因 风险形成的原因 员工故意纵火、销售中断、融资受阻等 Rea
风险后果 某事件对企业影响的结果 客户满意度下降、运费增加、工厂停摆、GDP下降等 Co
应对措施 解决风险问题采取的措施 加强员工培训、关闭不安全生产线、监控可疑货款等 Me
Types of Entities in the Field of Corporate Risk and Their Descriptions
Stroke ELMo Embedding IDCNN-CRF Entity Extraction Model
Stroke ELMo Embedding Pre-Trained Language Model
Simple Superposition of DCNN and IDCNN
Application of Doccanoto Annotate Corpus
分类 风险描述记录数 句子数
训练集 11 634 139 608
测试集 3 324 39 888
验证集 1 662 19 944
Experimental Data Set Statistics
模型名称 准确率 召回率 F1值
基于统
计特征
左右熵+互信息 51.2% 39.1% 44.3%
基于Word2Vec的相似词 25.3% 26.1% 25.7%
基于深
度学习
IDCNN+CRF 87.1% 88.3% 87.7%
Word2Vec+IDCNN+CRF 67.7% 62.5% 65.1%
cw2vec+IDCNN+CRF 84.5% 87.7% 86.1%
The Performance of Entity Extraction in the Enterprise Risk Domain Based on Statistical Features and Deep Learning
特征 IDCNN BiLSTM IDCNN-CRF BiLSTM-CRF
F1 Speed F1 Speed F1 Speed F1 Speed
cw2vec 83.9% 14.51× 81.1% 9.64× 86.4% 2.62× 82.3% 1.35×
cw2vec+ELMo 88.1% 10.32× 84.9% 6.55× 90.1% 2.43× 86.5% 1.13×
Word2Vec+cw2vec+ELMo 89.5% 9.82× 87.3% 4.23× 91.9% 2.36× 89.9% 1.00×
Experimental Results of the Stroke ELMo Embedding IDCNN-CRF Model
实体 IDCNN BiLSTM IDCNN-CRF BiLSTM-CRF
cw2vec +ELMo +Word
2Vec
cw2vec +ELMo +Word
2Vec
cw2vec +ELMo +Word
2Vec
cw2vec +ELMo +Word
2Vec
风险名称 82.6% 87.4% 88.6% 79.1% 82.8% 84.3% 85.2% 89.1% 91.0% 80.1% 84.3% 86.3%
风险时间 92.2% 94.1% 96.4% 86.6% 91.2% 93.8% 94.8% 96.3% 98.7% 88.3% 93.1% 96.8%
风险地点 91.8% 92.6% 95.8% 87.1% 91.8% 94.2% 94.6% 94.9% 98.3% 89.2% 93.6% 97.4%
风险源 78.8% 84.6% 85.3% 79.5% 82.8% 83.6% 81.3% 86.0% 87.9% 80.3% 83.6% 86.1%
风险管理组织 88.3% 91.1% 93.1% 85.8% 90.1% 93.1% 90.9% 94.1% 95.9% 87.9% 93.2% 95.8%
责任部门 87.6% 90.4% 91.6% 82.6% 86.6% 91.5% 90.3% 91.4% 94.1% 84.1% 88.7% 94.0%
风险管理资源 79.4% 85.2% 86.1% 77.2% 80.9% 82.8% 81.9% 87.2% 88.6% 77.8% 81.6% 84.7%
风险原因 79.3% 84.1% 84.7% 78.2% 81.6% 83.3% 81.8% 86.1% 87.1% 78.7% 82.5% 85.9%
风险后果 80.1% 85.7% 86.9% 77.5% 80.8% 83.4% 82.3% 87.7% 89.1% 78.2% 81.8% 85.9%
应对措施 79.1% 85.9% 86.3% 77.1% 80.8% 83.1% 81.3% 87.9% 88.3% 78.1% 82.1% 85.8%
总体 83.9% 88.1% 89.5% 81.1% 84.9% 87.3% 86.4% 90.1% 91.9% 82.3% 86.5% 89.9%
Experimental Results of Various Entities in the Field of Enterprise Risk
Visualization Results of Different Word Vectors
[1] 张淑惠, 周美琼, 吴雪勤. 年报文本风险信息披露与股价同步性[J]. 现代财经(天津财经大学学报), 2021, 41(2): 62-78.
[1] ( Zhang Shuhui, Zhou Meiqiong, Wu Xueqin. Risk Information Disclosure in Annual Report and Stock Price Synchronization[J]. Modern Finance and Economics-Journal of Tianjin University of Finance and Economics, 2021, 41(2): 62-78.)
[2] 崔笛, 郑明, 李岩, 等. 基于分类体系的上市公司年报信息披露质量研究——以我国A股上市公司为例[J]. 情报学报, 2019, 38(12): 1250-1259.
[2] ( Cui Di, Zheng Ming, Li Yan, et al. Research on the Information Disclosure in Annual Reports of A-Share Listed Companies[J]. Journal of the China Society for Scientific and Technical Information, 2019, 38(12): 1250-1259.)
[3] Appiagyei K, Boateng C A, Onumah J M. Risk Disclosures in the Annual Reports of Firms in Ghana[J]. International Journal of Management Practice, 2016, 9(2): 142.
doi: 10.1504/IJMP.2016.076743
[4] McHugh D, Shaw S, Moore T R, et al. Uncovering Themes in Personalized Learning: Using Natural Language Processing to Analyze School Interviews[J]. Journal of Research on Technology in Education, 2020, 52(3): 391-402.
doi: 10.1080/15391523.2020.1752337
[5] 付瑶, 万静, 邢立栋. 基于条件随机场与信息熵的特定领域概念发现[J]. 计算机应用研究, 2020, 37(3): 708-711.
[5] ( Fu Yao, Wan Jing, Xing Lidong. New Words Discovery Method Based on CRF and Information Entropy in Specific Domain[J]. Application Research of Computers, 2020, 37(3): 708-711.)
[6] Zhu L, Wang G J, Zou X C. Improved Information Gain Feature Selection Method for Chinese Text Classification Based on Word Embedding[C]// Proceedings of the 6th International Conference on Software and Computer Applications. 2017: 72-76.
[7] 王昊, 邓三鸿, 苏新宁, 等. 基于深度学习的情报学理论及方法术语识别研究[J]. 情报学报, 2020, 39(8): 817-828.
[7] ( Wang Hao, Deng Sanhong, Su Xinning, et al. A Study on Chinese Terminology Recognition of Theory and Method from Information Science: Based on Deep Learning[J]. Journal of the China Society for Scientific and Technical Information, 2020, 39(8): 817-828.)
[8] 彭嘉毅, 方勇, 黄诚, 等. 基于深度主动学习的信息安全领域命名实体识别研究[J]. 四川大学学报(自然科学版), 2019, 56(3): 457-462.
[8] Peng Jiayi, Fang Yong, Huang Cheng, et al. Cyber Security Named Entity Recognition Based on Deep Active Learning[J]. Journal of Sichuan University(Natural Science Edition), 2019, 56(3): 457-462.)
[9] Fujimagari H, Fujita K. Detecting Research Fronts Using Neural Network Model for Weighted Citation Network Analysis[J]. Journal of Information Processing, 2015, 23(6): 753-758.
doi: 10.2197/ipsjjip.23.753
[10] 徐飞, 叶文豪, 宋英华. 基于BiLSTM-CRF模型的食品安全事件词性自动标注研究[J]. 情报学报, 2018, 37(12): 1204-1211.
[10] ( Xu Fei, Ye Wenhao, Song Yinghua. Part-of-Speech Automated Annotation of Food Safety Events Based on BiLSTM-CRF[J]. Journal of the China Society for Scientific and Technical Information, 2018, 37(12): 1204-1211.)
[11] Strubell E, Verga P, Belanger D, et al. Fast and Accurate Entity Recognition with Iterated Dilated Convolutions[C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017: 2670-2680.
[12] Radford A, Metz L, Chintala S. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks[OL]. arXiv Preprint, arXiv: 1511.06434.
[13] Huang Z H, Xu W, Yu K. Bidirectional LSTM-CRF Models for Sequence Tagging[OL]. arXiv Preprint, arXiv: 1508.01991.
[14] 孙玥莹, 何彦青, 吴广印. 基于领域知识库的科技术语信息匹配模型研究[J]. 情报科学, 2019, 37(8): 16-21.
[14] ( Sun Yueying, He Yanqing, Wu Guangyin. Information Matching Model of Terms in Scientific and Technological Literature Based on Domain Knowledge Base[J]. Information Science, 2019, 37(8): 16-21.)
[15] 罗鹏程, 王一博, 王继民. 基于深度预训练语言模型的文献学科自动分类研究[J]. 情报学报, 2020, 39(10): 1046-1059.
[15] ( Luo Pengcheng, Wang Yibo, Wang Jimin. Automatic Discipline Classification for Scientific Papers Based on a Deep Pre-Training Language Model[J]. Journal of the China Society for Scientific and Technical Information, 2020, 39(10): 1046-1059.)
[16] 罗凌, 杨志豪, 宋雅文, 等. 基于笔画ELMo和多任务学习的中文电子病历命名实体识别研究[J]. 计算机学报, 2020, 43(10): 1943-1957.
[16] ( Luo Ling, Yang Zhihao, Song Yawen, et al. Chinese Clinical Named Entity Recognition Based on Stroke ELMo and Multi-Task Learning[J]. Chinese Journal of Computers, 2020, 43(10): 1943-1957.)
[17] Hanley K W, Hoberg G. The Information Content of IPO Prospectuses[J]. Review of Financial Studies, 2010, 23(7): 2821-2864.
doi: 10.1093/rfs/hhq024
[18] Bochkay K, Levine C B. Using MD&A to Improve Earnings Forecasts[J]. Journal of Accounting, Auditing & Finance, 2019, 34(3): 458-482.
[19] 胡小荣, 姚长青, 高影繁. 基于风险短语自动抽取的上市公司风险识别方法及可视化研究[J]. 情报学报, 2017, 36(7): 663-668.
[19] ( Hu Xiaorong, Yao Changqing, Gao Yingfan. Risk Identification Method of Listed Companies Based on the Automatic Risk Phrase Extraction and Visualization[J]. Journal of the China Society for Scientific and Technical Information, 2017, 36(7): 663-668.)
[20] 周双文. 基于领域本体的创业板公司年报风险信息抽取方法研究[D]. 长沙: 湖南大学, 2013.
[20] ( Zhou Shuangwen. A Risk Information Extraction Method About GEM Companies’ Annual Report Based on Domain Ontology[D]. Changsha: Hunan University, 2013.)
[21] 郭贤伟, 赖华, 余正涛, 等. 融合情绪知识的案件微博评论情绪分类[J]. 计算机学报, 2021, 44(3): 564-578.
[21] ( Guo Xianwei, Lai Hua, Yu Zhengtao, et al. Emotion Classification of Case-Related Microblog Comments Integrating Emotional Knowledge[J]. Chinese Journal of Computers, 2021, 44(3): 564-578.)
[22] Qiu J H, Zhou Y M, Wang Q, et al. Chinese Clinical Named Entity Recognition Using Residual Dilated Convolutional Neural Network with Conditional Random Field[J]. IEEE Transactions on Nanobioscience, 2019, 18(3): 306-315.
doi: 10.1109/TNB.2019.2908678
[23] Cao S S, Lu W, Zhou J, et al. cw2vec: Learning Chinese Word Embeddings with Stroke N-Gram Information[C]// Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018: 5053-5061.
[24] Li X Y, Zhang H, Zhou X H. Chinese Clinical Named Entity Recognition with Variant Neural Structures Based on BERT Methods[J]. Journal of Biomedical Informatics, 2020, 107: 103422.
doi: 10.1016/j.jbi.2020.103422
[25] 李舟军, 范宇, 吴贤杰. 面向自然语言处理的预训练技术研究综述[J]. 计算机科学, 2020, 47(3): 162-173.
doi: 10.11896/jsjkx.191000167
[25] ( Li Zhoujun, Fan Yu, Wu Xianjie. Survey of Natural Language Processing Pre-Training Techniques[J]. Computer Science, 2020, 47(3): 162-173.)
doi: 10.11896/jsjkx.191000167
[26] Chua C C, Lim T Y, Soon L K, et al. Meaning Preservation in Example-Based Machine Translation with Structural Semantics[J]. Expert Systems with Applications, 2017, 78: 242-258.
doi: 10.1016/j.eswa.2017.02.021
[27] 张栋, 陈文亮. 基于上下文相关字向量的中文命名实体识别[J]. 计算机科学, 2021, 48(3): 233-238.
doi: 10.11896/jsjkx.191200074
[27] ( Zhang Dong, Chen Wenliang. Chinese Named Entity Recognition Based on Contextualized Char Embeddings[J]. Computer Science, 2021, 48(3): 233-238.)
doi: 10.11896/jsjkx.191200074
[28] Lai S W, Xu L H, Liu K, et al. Recurrent Convolutional Neural Networks for Text Classification[C]// Proceedings of the 29th AAAI Conference on Artificial Intelligence. 2015: 2267-2273.
[29] Hammerton J. Named Entity Recognition with Long Short-Term Memory[C]// Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL. 2003:172-175.
[30] 肖毅, 熊凯伦, 张希. 基于TEI@I方法论的企业财务风险预警模型研究[J]. 管理评论, 2020, 32(7): 226-235.
[30] ( Xiao Yi, Xiong Kailun, Zhang Xi. Enterprise Financial Risk Early Warning Model Based on TEI@I Methodology[J]. Management Review, 2020, 32(7): 226-235.)
[31] Chen H, Lin Z J, Ding G G, et al. GRN: Gated Relation Network to Enhance Convolutional Neural Network for Named Entity Recognition[C]// Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 2019, 33: 6236-6243.
[32] Kim T, Kim H Y. Forecasting Stock Prices with a Feature Fusion LSTM-CNN Model Using Different Representations of the Same Data[J]. PLoS One, 2019, 14(2): e0212320.
doi: 10.1371/journal.pone.0212320
[33] Yang Z C, Hu Z T, Salakhutdinov R, et al. Improved Variational Autoencoders for Text Modeling Using Dilated Convolutions[C]// Proceedings of the 34th International Conference on Machine Learning. 2017: 3881-3890.
[34] 蒋翔, 马建霞, 袁慧. 基于BiLSTM-IDCNN-CRF模型的生态治理技术领域命名实体识别[J]. 计算机应用与软件, 2021, 38(3): 134-141.
[34] ( Jiang Xiang, Ma Jianxia, Yuan Hui. Named Entity Recognition in the Field of Ecological Management Technology Based on BiLSTM-IDCNN-CRF Model[J]. Computer Applications and Software, 2021, 38(3): 134-141.)
[35] 李妮, 关焕梅, 杨飘, 等. 基于BERT-IDCNN-CRF的中文命名实体识别方法[J]. 山东大学学报(理学版), 2020, 55(1): 102-109.
[35] Li Ni, Guan Huanmei, Yang Piao, et al. BERT-IDCNN-CRF for Named Entity Recognition in Chinese[J]. Journal of Shandong University(Natural Science), 2020, 55(1): 102-109.)
[36] 王芳, 杨京, 徐路路. 面向火灾应急管理的本体构建研究[J]. 情报学报, 2020, 39(9): 914-925.
[36] ( Wang Fang, Yang Jing, Xu Lulu. Ontology Construction for Fire Emergency Management[J]. Journal of the China Society for Scientific and Technical Information, 2020, 39(9): 914-925.)
[37] 张海涛, 刘伟利, 栾宇, 等. 重大突发事件的情景图谱构建[J]. 情报学报, 2021, 40(9): 924-933.
[37] ( Zhang Haitao, Liu Weili, Luan Yu, et al. Construction of Scenario Graph for a Major Emergency[J]. Journal of the China Society for Scientific and Technical Information, 2021, 40(9): 924-933.)
[38] Peters M, Neumann M, Iyyer M, et al. Deep Contextualized Word Representations[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2018: 2227-2237.
[39] Che W X, Liu Y J, Wang Y X, et al. Towards Better UD Parsing: Deep Contextualized Word Embeddings, Ensemble, and Treebank Concatenation[OL]. arXiv Preprint, arXiv: 1807.03121.
[40] Bouvrie J. Notes on Convolutional Neural Networks[OL]. Cogrints, 2006. https://web-archive.southampton.ac.uk/cogprints.org/5869/.
[41] 张应成, 杨洋, 蒋瑞, 等. 基于BiLSTM-CRF的商情实体识别模型[J]. 计算机工程, 2019, 45(5): 308-314.
[41] ( Zhang Yingcheng, Yang Yang, Jiang Rui, et al. Commercial Intelligence Entity Recognition Model Based on BiLSTM-CRF[J]. Computer Engineering, 2019, 45(5): 308-314.)
[42] Lafferty J, McCallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]// Proceedings of the 18th International Conference on Machine Learning. 2001: 282-289.
[43] McCallumA, FreitagD, PereiraF. Maximum Entropy Markov Models for Information Extraction and Segmentation[C]// Proceedings of the 17th International Conference on Machine Learning. 2000: 591-598.
[1] Cheng Bin,Shi Shuicai,Du Yuncheng,Xiao Shibin. Keyword Extraction for Journals Based on Part-of-Speech and BiLSTM-CRF Combined Model[J]. 数据分析与知识发现, 2021, 5(3): 101-108.
[2] Zhao Ping,Sun Lianying,Tu Shuai,Bian Jianling,Wan Ying. Identifying Scenic Spot Entities Based on Improved Knowledge Transfer[J]. 数据分析与知识发现, 2020, 4(5): 118-126.
[3] Li Chengliang,Zhao Zhongying,Li Chao,Qi Liang,Wen Yan. Extracting Product Properties with Dependency Relationship Embedding and Conditional Random Field[J]. 数据分析与知识发现, 2020, 4(5): 54-65.
[4] Han Huang,Hongyu Wang,Xiaoguang Wang. Automatic Recognizing Legal Terminologies with Active Learning and Conditional Random Field Model[J]. 数据分析与知识发现, 2019, 3(6): 66-74.
[5] Shengchun Ding,Linlin Hou,Ying Wang. Product Knowledge Map Construction Based on the E-commerce Data[J]. 数据分析与知识发现, 2019, 3(3): 45-56.
[6] Yue Yuan,Dongbo Wang,Shuiqing Huang,Bin Li. The Comparative Study of Different Tagging Sets on Entity Extraction of Classical Books[J]. 数据分析与知识发现, 2019, 3(3): 57-65.
[7] Hao Xu,Xuefang Zhu,Chengzhi Zhang,Chuan Jiang. System Analysis and Design for Methodological Entities Extraction in Full Text of Academic Literature[J]. 数据分析与知识发现, 2019, 3(10): 29-36.
[8] Tang Huihui,Wang Hao,Zhang Zixuan,Wang Xueying. Extracting Names of Historical Events Based on Chinese Character Tags[J]. 数据分析与知识发现, 2018, 2(7): 89-100.
[9] Wang Xiaoyu,Li Bin. Automatically Segmenting Middle Ancient Chinese Words with CRFs[J]. 数据分析与知识发现, 2017, 1(5): 62-70.
[10] Wang Dongbo,Wu Yi,Ye Wenhao,Liu Ruilun. Extracting Events of Food Safety Emergencies with Characteristics Knowledge[J]. 数据分析与知识发现, 2017, 1(3): 54-61.
[11] Zhang Yue,Wang Dongbo,Zhu Danhao. Segmenting Chinese Words from Food Safety Emergencies[J]. 数据分析与知识发现, 2017, 1(2): 64-72.
[12] Zhang Lin,Qin Ce,Ye Wenhao. Automatic Recognition of Legal Language Entities Based on Conditional Random Fields[J]. 数据分析与知识发现, 2017, 1(11): 46-52.
[13] He Huixin,Liu Lijuan. A Scientific Research Object Labeling System Based on Active earning[J]. 现代图书情报技术, 2016, 32(3): 67-73.
[14] Jiang Chuntao. Automatic Annotation of Bibliographical References in Chinese Patent Documents[J]. 现代图书情报技术, 2015, 31(10): 81-87.
[15] He Yu, Lv Xueqiang, Xu Liping. A Chinese Term Extraction System in New Energy Vehicles Domain[J]. 现代图书情报技术, 2015, 31(10): 88-94.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn