Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (7): 26-35    DOI: 10.11925/infotech.2096-3467.2021.0094
Extracting Events from Ancient Books Based on RoBERTa-CRF
Yu Xuehan,He Lin(),Xu Jian
College of Information Management, Nanjing Agricultural University, Nanjing 210095, China
[Objective] This paper constructs a framework to extract events from ancient books, which uses the RoBERTa-CRF model to identify event types, argument roles and arguments. [Methods] We collected the war sentences from Zuozhuan as the experimental data, which helped us establish the classification schema for event types and argument roles. Based on the RoBERTa-CRF model, we used the multi-layer transformer to extract the corpus features, which were combined with the sequence tags to learn the correlation constraints. Finally, we identified and extracted the arguments by the tag sequence. [Results] The accuracy, recall and F1 values of the proposed model were 87.6%, 77.2% and 82.1%, which were higher than results of the GuwenBERT-LSTM, Bert-LSTM, RoBERTa-LSTM, Bert-CRF and RoBERTa-CRF on the same dataset. [Limitations] The size of the experimental dataset needs to be expanded, which could make the topic categories more balanced. [Conclusions] The RoBERTa-CRF model constructed in this paper could effectively extract events from ancient Chinese books.

Key wordsRoBERTa      CRF      Event Extraction      Ancient Chinese Language     
Received: 29 January 2021      Published: 11 August 2021
ZTFLH:  TP391  
Fund:Fundamental Research Funds for the Central Universities(SKCX2020006);China Postdoctoral Science Foundation(2020M681652)
Yu Xuehan, He Lin, Xu Jian. Extracting Events from Ancient Books Based on RoBERTa-CRF. Data Analysis and Knowledge Discovery, 2021, 5(7): 26-35.

Event Extraction Framework from Chinese Ancient Books Based on RoBERTa-CRF
触发词 事件类型 论元角色
伐、敗、入、取、侵、討、圍、滅、戰、追、克、降、襲、執、攻、獲、門、徼、軍、逐 征战 时间、进攻方、防守方、战争原因、战争地点、战利品、助战方、参与人物
殺、弑 戕杀 时间、进攻方、受害人、战争原因、战争地点、战利品、助战方
救、援 救援 时间、援军、被救方、战争原因、战争地点、敌军、助战方
Event Types and Argument Roles of War Sentences in Zuozhuan
预训练模型 BERT RoBERTa GuwenBERT
本文调用的模型名 BERT-Base, Chinese RoBERTa-wwm-ext, Chinese Ethanyt/guwenBERT-Base
训练数据 中文维基百科 中文维基百科 殆知阁古文文献
字形 简体中文、繁体中文 简体中文、繁体中文 简体中文
句子切分粒度 以字为粒度 以词为粒度 以字为粒度
词表大小 21 128 21 128 23 292
支持框架 Pytorch、TensorFlow Pytorch、TensorFlow Pytorch
Comparison of Three Pre-training Models
Structure of RoBERTa-CRF Model
Example of Event Annotation
参数名 参数值
序列长度(maxlen) 128
迭代次数(epochs) 45
每批训练大小(batch_size) 32
学习率(learning_rate) 0.000 02
CRF层的学习率(crf_lr_multiplier) 100
Model Parameters Setting Details
实验编号 模型 精确率 召回率 F1值
a GuwenBERT - LSTM 68.3% 45.7% 54.7%
b BERT-LSTM 73.4% 64.6% 68.7%
c RoBERTa -LSTM 77.2% 66.2% 71.3%
d BERT-CRF 85.0% 74.9% 79.7%
e RoBERTa-CRF 87.6% 77.2% 82.1%
Extraction Performance of Different Models
事件类型 精确率 召回率 F1值
战争-征战 87.1% 76.9% 81.7%
战争-戕杀 80.0% 50.0% 61.5%
战争-救援 96.6% 93.3% 94.9%
Argument Extraction Performance of Different Event Types
事件类型 论元角色 精确率 召回率 F1值
战争-征战 时间 98.5% 100.0% 99.3%
进攻方 88.3% 75.7% 81.5%
防守方 92.9% 76.5% 83.9%
战争原因 94.4% 73.9% 82.9%
战争地点 88.0% 84.6% 86.3%
战利品 71.4% 55.6% 62.5%
助战方 66.7% 40.0% 50.0%
参与人物 25.0% 20.0% 22.2%
Extraction Performance of Different Argument Roles
