Please wait a minute...
Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (7): 26-35    DOI: 10.11925/infotech.2096-3467.2021.0094
Current Issue | Archive | Adv Search |
Extracting Events from Ancient Books Based on RoBERTa-CRF
Yu Xuehan,He Lin(),Xu Jian
College of Information Management, Nanjing Agricultural University, Nanjing 210095, China
Download: PDF (844 KB)   HTML ( 27
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper constructs a framework to extract events from ancient books, which uses the RoBERTa-CRF model to identify event types, argument roles and arguments. [Methods] We collected the war sentences from Zuozhuan as the experimental data, which helped us establish the classification schema for event types and argument roles. Based on the RoBERTa-CRF model, we used the multi-layer transformer to extract the corpus features, which were combined with the sequence tags to learn the correlation constraints. Finally, we identified and extracted the arguments by the tag sequence. [Results] The accuracy, recall and F1 values of the proposed model were 87.6%, 77.2% and 82.1%, which were higher than results of the GuwenBERT-LSTM, Bert-LSTM, RoBERTa-LSTM, Bert-CRF and RoBERTa-CRF on the same dataset. [Limitations] The size of the experimental dataset needs to be expanded, which could make the topic categories more balanced. [Conclusions] The RoBERTa-CRF model constructed in this paper could effectively extract events from ancient Chinese books.

Key wordsRoBERTa      CRF      Event Extraction      Ancient Chinese Language     
Received: 29 January 2021      Published: 11 August 2021
ZTFLH:  TP391  
Fund:Fundamental Research Funds for the Central Universities(SKCX2020006);China Postdoctoral Science Foundation(2020M681652)
Corresponding Authors: He Lin,ORCID:0000-0002-4207-3588     E-mail: helin@njau.edu.cn

Cite this article:

Yu Xuehan, He Lin, Xu Jian. Extracting Events from Ancient Books Based on RoBERTa-CRF. Data Analysis and Knowledge Discovery, 2021, 5(7): 26-35.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2021.0094     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2021/V5/I7/26

Event Extraction Framework from Chinese Ancient Books Based on RoBERTa-CRF
触发词 事件类型 论元角色
伐、敗、入、取、侵、討、圍、滅、戰、追、克、降、襲、執、攻、獲、門、徼、軍、逐 征战 时间、进攻方、防守方、战争原因、战争地点、战利品、助战方、参与人物
殺、弑 戕杀 时间、进攻方、受害人、战争原因、战争地点、战利品、助战方
救、援 救援 时间、援军、被救方、战争原因、战争地点、敌军、助战方
Event Types and Argument Roles of War Sentences in Zuozhuan
预训练模型 BERT RoBERTa GuwenBERT
本文调用的模型名 BERT-Base, Chinese RoBERTa-wwm-ext, Chinese Ethanyt/guwenBERT-Base
训练数据 中文维基百科 中文维基百科 殆知阁古文文献
字形 简体中文、繁体中文 简体中文、繁体中文 简体中文
句子切分粒度 以字为粒度 以词为粒度 以字为粒度
词表大小 21 128 21 128 23 292
支持框架 Pytorch、TensorFlow Pytorch、TensorFlow Pytorch
是否采用NSP函数
是否选用WWM技术
Comparison of Three Pre-training Models
Structure of RoBERTa-CRF Model
Example of Event Annotation
参数名 参数值
序列长度(maxlen) 128
迭代次数(epochs) 45
每批训练大小(batch_size) 32
学习率(learning_rate) 0.000 02
CRF层的学习率(crf_lr_multiplier) 100
Model Parameters Setting Details
实验编号 模型 精确率 召回率 F1值
a GuwenBERT - LSTM 68.3% 45.7% 54.7%
b BERT-LSTM 73.4% 64.6% 68.7%
c RoBERTa -LSTM 77.2% 66.2% 71.3%
d BERT-CRF 85.0% 74.9% 79.7%
e RoBERTa-CRF 87.6% 77.2% 82.1%
Extraction Performance of Different Models
事件类型 精确率 召回率 F1值
战争-征战 87.1% 76.9% 81.7%
战争-戕杀 80.0% 50.0% 61.5%
战争-救援 96.6% 93.3% 94.9%
Argument Extraction Performance of Different Event Types
事件类型 论元角色 精确率 召回率 F1值
战争-征战 时间 98.5% 100.0% 99.3%
进攻方 88.3% 75.7% 81.5%
防守方 92.9% 76.5% 83.9%
战争原因 94.4% 73.9% 82.9%
战争地点 88.0% 84.6% 86.3%
战利品 71.4% 55.6% 62.5%
助战方 66.7% 40.0% 50.0%
参与人物 25.0% 20.0% 22.2%
Extraction Performance of Different Argument Roles
[1] 夏翠娟. 面向人文研究的“数据基础设施”建设——试论图书馆学对数字人文的方法论贡献[J]. 中国图书馆学报, 2020, 46(3):24-37.
[1] (Xia Cuijuan. The Construction of “Data Infrastructure” for Humanities Research: The Methodological Contribution of Library Science to Digital Humanities[J]. Journal of Library Science in China, 2020, 46(3):24-37.)
[2] 李章超, 李忠凯, 何琳. 《左传》战争事件抽取技术研究[J]. 图书情报工作, 2020, 64(7):20-29.
[2] (Li Zhangchao, Li Zhongkai, He Lin. Study on the Extraction Method of War Events in Zuo Zhuan[J]. Library and Information Service, 2020, 64(7):20-29.)
[3] 陈佩辉. 人文数据库建设中人文学者何为——以《全宋文》墓志铭亲属信息提取为例[J]. 图书馆论坛, 2019, 39(5):17-23.
[3] (Chen Peihui. What Humanities Scholars Can Do in the Construction of Humanities Databases——Taking the Extraction of Kinship Data from Epitaphs in Quansongwen for Example[J]. Library Forum, 2019, 39(5):17-23.)
[4] 刘忠宝, 党建飞, 张志剑. 《史记》历史事件自动抽取与事理图谱构建研究[J]. 图书情报工作, 2020, 64(11):116-124.
[4] (Liu Zhongbao, Dang Jianfei, Zhang Zhijian. Research on Automatic Extraction of Historical Events and Construction of Event Graph Based on Historical Records[J]. Library and Information Service, 2020, 64(11):116-124.)
[5] Riloff E. Automatically Constructing a Dictionary for Information Extraction Tasks[C]// Proceedings of the 11th National Conference on Artificial Intelligence. 1993: 811-816.
[6] Cohen K B, Verspoor K, Johnson H L, et al. High-precision Biological Event Extraction with a Concept Recognizer[C]// Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task. Association for Computational Linguistics, 2009:50-58.
[7] Arendarenko E, Kakkonen T. Ontology-Based Information and Event Extraction for Business Intelligence[C]// Proceedings of the 15th International Conference on Artificial Intelligence: Methodology, Systems, and Applications. Springer Berlin Heidelberg, 2012: 89-102.
[8] 陈慧炜. 刑事案件文本信息抽取研究[D]. 南京: 南京师范大学, 2011.
[8] (Chen Huiwei. Research on Text Information Extraction of Criminal Cases[D]. Nanjing: Nanjing Normal University, 2011.)
[9] 赵文娟, 刘忠宝, 王永芳. 基于句法依存分析的事件角色填充研究[J]. 情报科学, 2017, 35(7):65-69.
[9] (Zhao Wenjuan, Liu Zhongbao, Wang Yongfang. Research on Event Role Annotation Based on Syntactic Dependency Analysis[J]. Information Science, 2017, 35(7):65-69.)
[10] Chen Y, Xu L, Liu K, et al. Event Extraction via Dynamic Multi-Pooling Convolutional Neural Networks[C]// Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics. 2015: 167-176.
[11] Sha L, Qian F, Chang B, et al. Jointly Extracting Event Triggers and Arguments by Dependency-Bridge RNN and Tensor-Based Argument Interaction[C]// Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018: 5916-5923.
[12] Duan S, He R, Zhao W. Exploiting Document Level Information to Improve Event Detection via Recurrent Neural Networks[C]// Proceedings of the 8th International Joint Conference on Natural Language Processing. 2017: 352-361.
[13] 阮元. 十三经注疏[M]. 北京: 中华书局, 1980.
[13] (Ruan Yuan. The Confucian Bible[M]. Beijing: China Publishing House, 1980.)
[14] 李学勤. 春秋左传正义[M]. 北京: 北京大学出版社, 1999.
[14] (Li Xueqin. The Standard of Chunqiu Zuozhuan[M]. Beijing: Peking University Press, 1999.)
[15] 朱宝庆. 左氏兵法[M]. 西安: 陕西人民出版社, 1991.
[15] (Zhu Baoqing. Zuo’s Art of War[M]. Xi’an: Shaanxi People’s Publishing House, 1991.)
[16] 中国军事史编写组. 中国历代战争年表[M]. 北京: 解放军出版社, 2003.
[16] (Compilation Group of Chinese Military History. Chronology of Chinese Wars[M]. Beijing: People’s Liberation Army Press, 2003.)
[17] 邓勇. 王霸: 正义与秩序——从春秋战争到普遍正义[D]. 武汉:武汉大学, 2007: 270-295.
[17] (Deng Yong. Wang-Ba: Justice and Order——From Wars in Spring-Autumn Period to Universal Justice[D]. Wuhan: Wuhan University, 2007: 270-295.)
[18] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv:1810.04805.
[19] Cui Y M, Che W X, Liu T, et al. Pre-Training with Whole Word Masking for Chinese BERT[OL]. arXiv Preprint, arXiv: 1906. 08101.
[20] 阎覃. GuwenBERT:古文预训练语言模型(古文BERT)[EB/OL]. [2020-11-22]. https://github.com/Ethan-yt/guwenbert.
[20] (Yan Tan. GuwenBERT:a Pre-trained Language Model for Classical Chinese (Literary Chinese) [EB/OL]. [2020-11-22]. https://github.com/Ethan-yt/guwenbert.)
[1] Chen Xingyue, Ni Liping, Ni Zhiwei. Extracting Financial Events with ELECTRA and Part-of-Speech[J]. 数据分析与知识发现, 2021, 5(7): 36-47.
[2] Wang Hao, Lin Kerou, Meng Zhen, Li Xinlei. Identifying Multi-Type Entities in Legal Judgments with Text Representation and Feature Generation[J]. 数据分析与知识发现, 2021, 5(7): 10-25.
[3] Hu Haotian,Ji Jinfeng,Wang Dongbo,Deng Sanhong. An Integrated Platform for Food Safety Incident Entities Based on Deep Learning[J]. 数据分析与知识发现, 2021, 5(3): 12-24.
[4] Xue Fuliang,Liu Lifang. Fine-Grained Sentiment Analysis with CRF and ATAE-LSTM[J]. 数据分析与知识发现, 2020, 4(2/3): 207-213.
[5] Ma Jianxia,Yuan Hui,Jiang Xiang. Extracting Name Entities from Ecological Restoration Literature with Bi-LSTM+CRF[J]. 数据分析与知识发现, 2020, 4(2/3): 78-88.
[6] Wang Yi,Shen Zhe,Yao Yifan,Cheng Ying. Domain-Specific Event Graph Construction Methods:A Review[J]. 数据分析与知识发现, 2020, 4(10): 1-13.
[7] Na Ma,Zhixiong Zhang,Pengmin Wu. Automatic Identification of Term Citation Object with Feature Fusion[J]. 数据分析与知识发现, 2020, 4(1): 89-98.
[8] Xiaoxiao Zhu,Zunqi Yang,Jing Liu. Construction of an Adverse Drug Reaction Extraction Model Based on Bi-LSTM and CRF[J]. 数据分析与知识发现, 2019, 3(2): 90-97.
[9] Li Yu,Li Qian,Changlei Fu,Huaming Zhao. Extracting Fine-grained Knowledge Units from Texts with Deep Learning[J]. 数据分析与知识发现, 2019, 3(1): 38-45.
[10] Feng Guoming,Zhang Xiaodong,Liu Suhui. DBLC Model for Word Segmentation Based on Autonomous Learning[J]. 数据分析与知识发现, 2018, 2(5): 40-47.
[11] Qi Huiying,Guo Jianguang. Integrating Multi-Source Clinical Research Data Based on CDISC Standard[J]. 数据分析与知识发现, 2018, 2(5): 88-93.
[12] Wang Miping,Wang Hao,Deng Sanhong,Wu Zhixiang. Extracting Chinese Metallurgy Patent Terms with Conditional Random Fields[J]. 现代图书情报技术, 2016, 32(6): 28-36.
[13] Sui Mingshuang,Cui Lei. Extracting Chemical and Disease Named Entities with Multiple-Feature CRF Model[J]. 现代图书情报技术, 2016, 32(10): 91-97.
[14] Duan Yufeng, Zhu Wenjing, Chen Qiao, Liu Wei, Liu Fenghong. The Study on Out-of-Vocabulary Identification on a Model Based on the Combination of CRFs and Domain Ontology Elements Set[J]. 现代图书情报技术, 2015, 31(4): 41-49.
[15] Shi Cui, Wang Yang, Yang Bin, Yao Ye. Identification of Non-nest Coordination for Chinese Patent Literature[J]. 现代图书情报技术, 2014, 30(10): 76-83.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn