Please wait a minute...
Data Analysis and Knowledge Discovery  2023, Vol. 7 Issue (10): 131-143    DOI: 10.11925/infotech.2096-3467.2022.0334
Current Issue | Archive | Adv Search |
Constructing and Evaluating Chinese Reading Comprehension Corpus for Anti-Terrorism Field
Gao Feng1,2,3,4,Yang Zihang1,2,3,4(),Hou Jin1,2,3,4,Gu Jinguang1,2,3,4,Cheng Junjun5
1School of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan 430065, China
2Key Laboratory of Rich Media Digital Publication, Content Organization and Knowledge Service, Wuhan University of Science and Technology, Wuhan 430065, China
3Institute of Big Data Science and Engineering, Wuhan University of Science and Technology, Wuhan 430065, China
4Big Data Science and Engineering Research Institute, Wuhan 430065, China
5China Information Technology Security Evaluation Center, Beijing 100083, China
Download: PDF (996 KB)   HTML ( 11
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper develops a corpus for Chinese machine reading comprehension in the security field (SecMRC), which adds professional data support for related studies. [Methods] First, we constructed a keyword search engine to retrieve the domain news. Then, we automatically generated the questions for pre-annotation with the ERNIE-GEN model. Third, we created the domain vocabulary using temporal feature words and domain keyword-matching algorithms to support accurate word separation. Finally, we formed the final dataset with manually annotated question-answer pairs and proposed a new baseline model (SecMT5). [Results] The dataset contains 2 100 Anti-terrorism and security-related news, 7 300 extracted question-answer pairs, 2 100 generative Q&A pairs, and 4,796,264 characters. We conducted tests using advanced reading comprehension models on the SecMRC. The F1 of the extraction task reached 72.05% (6.13% higher than the baseline model), and the average ROUGE-L of the generative task was 37.62%. Both are significantly weaker than the human intelligence. [Limitations] The number of Q&A pairs in the dataset needs to be expanded, and the difficulty and diversity of these pairs need to be improved. [Conclusions] The SecMRC dataset highlights domain knowledge and is challenging. It can effectively support the research of machine reading comprehension. The dataset construction method can be utilized in other fields.

Key wordsAnti-Terrorism      Machine Reading Comprehension      Data Set     
Received: 12 April 2022      Published: 28 March 2023
ZTFLH:  G35  
  TP391  
Fund:Jointly Cultivated Project of National Natural Science Foundation of China(U1836118);Open Fund for Key Laboratory of Content Organization and Knowledge Services of Rich Media Digital Publishing(ZD2021-11/01);2030 New Generation of Artificial Intelligence Technology Fund for Scientific and Technological Innovation(2020AAA0108500)
Corresponding Authors: Yang Zihang,ORCID:0009-0004-7292-3477,E-mail:616660865@qq.com。   

Cite this article:

Gao Feng, Yang Zihang, Hou Jin, Gu Jinguang, Cheng Junjun. Constructing and Evaluating Chinese Reading Comprehension Corpus for Anti-Terrorism Field. Data Analysis and Knowledge Discovery, 2023, 7(10): 131-143.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2022.0334     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2023/V7/I10/131

数据集名称 研究任务 数据来源 数据规模 答案形式
PD&CFT 完形填空 人民日报和儿童童话 28k个文档,100k个问题 填写词语
CMRC2018 抽取式MRC 维基百科 18k个问题 答案跨度
DuReader 自由式问答 百度搜索引擎和百度知道 1M个问答,200k个问题 自由形式
SQuAD 抽取式MRC 维基百科 536个文档,100k个问题 答案跨度
MS MARCO 摘要生成 必应搜索 200k个文档,100k个问题 自由形式
Overview of Common MRC Datasets
Context Question Answer Answer_Start GTquestion GTanwser
据外媒报道,当地时间28日,A国首都一公路安检站附近发生自杀式汽车炸弹袭击。报道援引......A国政府军曾于本月23日在某州与极端组织“青年党”武装分子发生交火,打死8名武装分子。 A国首都一公路安检站附近发生自杀式汽车炸弹袭击,造成至少多少人死亡? 已经造成至少60人死亡 61 A国“青年党”此次袭击的主要目标可能是什么? 主要目标可能是来自B国的4名工程师。
肯尼亚《旗帜报》最新消息称,当地时间28日清晨,哪里发生爆炸? 摩加迪沙一座繁忙的安全检查站 185
美联社援引当地警方的消息称,此次袭击的主要目标可能是什么? 来自B国的4名工程师 313
另据美国有线电视新闻网报道,哪个组织已宣布对此次袭击负责? A国“青年党” 339
A国政府军于本月23日在A国某州与极端组织“青年党”武装分子发生交火,打死几名武装分子? 8名 445
Example of SecMRC
Text Collection Process of Anti-Terrorism Security News
参数
batch_size 5
log_interval 20
noise_prob 0.2
learning_rate 5e-5
len_penalty 1.0
weight_decay 0.1
Parameter Settings of The ERNIE-GEN
Single-Round Labeling Process of Question-and-Answer Pair
问题 答案
Q1: A国西北部哪个邦的多个城镇连日来遭受大规模恐怖袭击? A1: 某邦 answer_start: 21
Q2: 截至15日已导致包括军警和恐怖分子在内至少多少人死亡? A2: 34人 answer_start: 71
Q3: 军警部队最后击退恐怖分子、击毙暴徒共多少人? A3: 32人 answer_start: 223
Extractive Q&A Pair Example
项目 训练集 验证集 测试集
新闻数量 1 260 420 420
抽取式问答对个数 4 380 1 476 1 453
生成式问答对个数 1 260 420 420
新闻中领域实体数/篇 8 11 7
新闻平均长度 545 582 517
问题平均长度 15 21 13
答案平均长度 8 10 8
SecMRC Subset Statistics
News Type Distribution of SecMRC
Distribution of Q&A Pairs for Major Terrorist Groups
问题类型 抽取式 生成式
实体型 35% 37%
描述型 40% 38%
是非型 20% 20%
无法回答型 5% 5%
Who 42% 40%
How 28% 32%
Where 21% 23%
When 9% 5%
Question Type Distribution for Q&A Pairs
文本类型 正文 抽取式问答对
恐怖袭击事件 据A国军方15日发布的最新消息,A国西北部某邦的多个城镇连日来遭受大规模恐怖袭击,暴力冲突持续不断,截至目前已导致包括军警和恐怖分子在内至少34人死亡..... 描述型
Q:截至15日已导致包括军警和恐怖分子在内至少多少人死亡?
A:至少34人死亡
军事冲突 ......当地时间8日早晨,美军驻B国“某空军基地”遭多枚火箭弹袭击。随后,C国某革命卫队立即发表声明,证实其以“数十枚地对地导弹”袭击美军基地...... 实体型
Q:美军驻B国的哪个空军基地遭到了火箭弹袭击?
A:“某空军基地”遭多枚火箭弹袭击
非恐怖袭击事件 据美国有线电视新闻网(CNN)报道,当地时间7月31日晚,美国纽约市发生大规模枪击事件,造成至少10人受伤......警长詹姆斯·埃西格指出,“这种一再出现的情况必须在全市范围内停止”,这是团伙作案,是枪支犯罪...... 是非型
Q:7月31日晚发生在纽约的大规模枪击事件是团伙作案吗?
A:是团伙作案
Examples of Different Types of News And Q&A Pairs in SecMRC
问题类型 CMRC-BERT Sec-BERT
SecMRC CMRC2018 SecMRC CMRC2018
实体型 73.18 87.57 89.55 81.34
描述型 63.46 82.92 82.71 76.05
是非型 70.93 83.50 86.02 77.89
多跳型 59.68 - 67.27 -
恐袭事件 72.86 - 88.53 -
军事冲突 66.72 - 83.19 -
非恐袭事件 65.91 - 84.20 -
The Performance of CMRC-BERT in Different Q&A Pairs
Model hidden_
size
learning_rate vocab_
size
parameter_size
BART 768 0.000 02 50 265 406M
mT5 512 0.000 6 250 112 300M
GPT-2 - 0.000 5 50 257 124M
CMRC-BERT 768 0.000 5 21 128 101M
Model Parameter Settings
问题类型 ROUGE-L(%)
BART mT5 GPT-2 人类水平
描述型 33.44 45.23 34.56 75.47
实体型 42.73 63.86 49.24 89.51
是非型 36.68 49.45 37.91 68.92
均值 37.62 53.18 40.57 77.97
问题类型 ROUGE-L(%)
BART mT5 GPT-2 人类水平
伊斯兰国 53.82 62.11 55.79 89.50
塔利班 32.10 49.80 33.26 88.92
基地组织 55.34 69.99 65.86 89.21
库尔德武装 79.65 87.18 82.44 96.65
Generative Reading Comprehension Model Experiment Result
Context GTquestion Drop all C words Drop all Q words GTanwser
据外媒报道,当地时间28日,A国首都迪沙......A国政府军曾于本月23日在南部某州与极端组织B的武装分子发生交火,打死8名武装分子。 B组织此次袭击的主要目标可能是什么? 此次袭击的主要目标可能是什么? B组织此次袭击? 主要目标可能是来自C的4名工程师。
Drop Words Experiment Example
问题类型 模型(%)
BART mT5 GPT-2
Original 37.62 53.18 40.57
Drop all C words 13.39 25.54 19.05
Drop all Q words 36.77 52.23 38.40
Experimental Results Of Data Set Capability Assessment
Construction of Syntactic Dependency Matrix
参数 参数
vocab_size 50 000 warm_up_ratio 0.1
learning_rate 0.000 05 Answer_max_length 50
hidden_size 768 Max sequence length 512
dropout_rate 0.1 num_beams 8
vocab_size 50 000 warm_up_ratio 0.1
SecMT5 Parameter Settings
数据集

模型
SecMT5 mT5 BART GPT-2
DuReader 42.29 40.36 33.29 38.17
SecMRC 51.75 41.92 28.83 34.26
Comparative Experimental Results
[1] Liu S S, Zhang X, Zhang S, et al. Neural Machine Reading Comprehension: Methods and Trends[J]. Applied Sciences, 2019, 9(18): Article No.698.
[2] Hermann K M, Kočiský T, Grefenstette E, et al. Teaching Machines to Read and Comprehend[C]// Proceedings of the 28th International Conference on Neural Information Processing Systems. 2015: 1693-1701.
[3] Rajpurkar P, Zhang J, Lopyrev K, et al. SQuAD: 100, 000+ Questions for Machine Comprehension of Text[OL]. arXiv Preprint, arXiv: 1606.05250.
[4] Cui Y M, Liu T, Chen Z P, et al. Consensus Attention-Based Neural Networks for Chinese Reading Comprehension[OL]. arXiv Preprint, arXiv: 1607.02250.
[5] Cui Y M, Liu T, Che W X, et al. A Span-Extraction Dataset for Chinese Machine Reading Comprehension[OL]. arXiv Preprint, arXiv: 1810.07366.
[6] He W, Liu K, Liu J, et al. DuReader: A Chinese Machine Reading Comprehension Dataset from Real-World Applications[OL]. arXiv Preprint, arXiv: 1711.05073.
[7] Bajaj P, Campos D, Craswell N, et al. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset[OL]. arXiv Preprint, arXiv: 1611.09268.
[8] Britannica E. Britannica Concise Encyclopedia[M]. Encyclopaedia Britannica, Inc., 2008.
[9] Anil A, Kumar D, Sharma S, et al. Link Prediction Using Social Network Analysis over Heterogeneous Terrorist Network[C]// Proceedings of the 2015 IEEE International Conference on Smart City/SocialCom/SustainCom (SmartCity). IEEE, 2015: 267-272.
[10] 徐荣贞, 刘文强, 傅子洋. 战时持续消耗型军需物资配送优化问题研究[J]. 铁道科学与工程学报, 2015, 12(5): 1243-1247.
[10] (Xu Rongzhen, Liu Wenqiang, Fu Ziyang. Research on Military Material Distribution Optimization of Continuous Consumption in Wartime[J]. Journal of Railway Science and Engineering, 2015, 12(5): 1243-1247.)
[11] 李乐天, 郑何真, 丁晨, 等. 基于改进BP神经网络的恐怖袭击事件分级研究[J]. 软件导刊, 2019, 18(5): 21-26.
[11] (Li Letian, Zheng Hezhen, Ding Chen, et al. Terrorist Attack Classification Based on Improved BP Neural Network[J]. Software Guide, 2019, 18(5): 21-26.)
[12] Sachan A. E-TGPS: Enhanced Terrorist Group Prediction System for Counter Terrorism[J]. International Journal of Computer Applications, 2015, 117(24): 24-28.
[13] Kadlec R, Schmid M, Bajgar O, et al. Text Understanding with the Attention Sum Reader Network[OL]. arXiv Preprint, arXiv: 1603.01547.
[14] Rajpurkar P, Jia R, Liang P.Know What You Don’t Know: Unanswerable Questions for SQuAD[OL]. arXiv Preprint, arXiv: 1806.03822.
[15] Mo H W, Meng X, Li J Q, et al. Terrorist Event Prediction Based on Revealing Data[C]// Proceedings of the 2nd International Conference on Big Data Analysis. IEEE, 2017: 239-244.
[16] 刘明辉. 基于K-means聚类分析的民航系统恐怖主义风险评估[J]. 数据分析与知识发现, 2018, 2(10): 21-26.
[16] (Liu Minghui. Risk Assessment of Civil Aviation Terrorism Based on K-Means Clustering[J]. Data Analysis and Knowledge Discovery, 2018, 2(10): 21-26.)
[17] 李慧, 张南南, 曹卓, 等. 基于机器学习的恐怖分子预测算法[J]. 计算机工程, 2020, 46(2): 315-320.
doi: 10.19678/j.issn.1000-3428.0053521
[17] (Li Hui, Zhang Nannan, Cao Zhuo, et al. Terrorist Prediction Algorithm Based on Machine Learning[J]. Computer Engineering, 2020, 46(2): 315-320.)
doi: 10.19678/j.issn.1000-3428.0053521
[18] 许兴鹏, 许清风, 房志明, 等. 全球恐怖袭击发展趋势与伤亡风险分析[J]. 中国安全科学学报, 2021, 31(6): 170-175.
doi: 10.16265/j.cnki.issn 1003-3033.2021.06.022
[18] (Xu Xingpeng, Xu Qingfeng, Fang Zhiming, et al. Development Trend of Global Terrorist Attacks and Analysis on Casualty Risk[J]. China Safety Science Journal, 2021, 31(6): 170-175.)
doi: 10.16265/j.cnki.issn 1003-3033.2021.06.022
[19] 叶鸥, 张璟, 李军怀. 中文数据清洗研究综述[J]. 计算机工程与应用, 2012, 48(14): 121-129.
[19] (Ye Ou, Zhang Jing, Li Junhuai. Survey of Chinese Data Cleaning[J]. Computer Engineering and Applications, 2012, 48(14): 121-129.)
[20] 茹立云, 李智超, 马少平. 搜索引擎索引网页集合选取方法研究[J]. 计算机研究与发展, 2014, 51(10): 2239-2247.
[20] (Ru Liyun, Li Zhichao, Ma Shaoping. Indexing Page Collection Selection Method for Search Engine[J]. Journal of Computer Research and Development, 2014, 51(10): 2239-2247.)
[21] Wu C H, Liu C H, Su P H. Sentence Extraction with Topic Modeling for Question-Answer Pair Generation[J]. Soft Computing, 2015, 19(1): 39-46.
doi: 10.1007/s00500-014-1386-6
[22] Liu H, Gao P D. New Words Discovery Method Based On Word Segmentation Result[C]// Proceedings of the 17th International Conference on Computer and Information Science (ICIS). IEEE, 2018: 645-648.
[23] 叶春蕾, 冷伏海. 技术路线图中未来技术词表构建方法研究[J]. 现代图书情报技术, 2013(5): 59-63.
[23] (Ye Chunlei, Leng Fuhai. Building the Future-Oriented Technology Thesaurus of Technology Roadmap[J]. New Technology of Library and Information Service, 2013(5): 59-63.)
[24] 苏立新, 郭嘉丰, 范意兴, 等. 面向多片段答案的抽取式阅读理解模型[J]. 计算机学报, 2020, 43(5): 856-867.
[24] (Su Lixin, Guo Jiafeng, Fan Yixing, et al. A Reading Comprehension Model for Multiple-Span Answers[J]. Chinese Journal of Computers, 2020, 43(5): 856-867.)
[25] Fu J L, Liu P F, Zhang Q. Rethinking Generalization of Neural Models: A Named Entity Recognition Case Study[C]// Proceedings of the 2020 AAAI Conference on Artificial Intelligence. 2020.
[26] Lewis M, Liu Y H, Goyal N, et al. BART: Denoising Sequence-to-Sequence Pre-Training for Natural Language Generation, Translation, and Comprehension[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020: 7871-7880.
[27] Xue L T, Constant N, Roberts A, et al. MT5: A Massively Multilingual Pre-Trained Text-to-Text Transformer[OL]. arXiv Preprint, arXiv: 2010.11934.
[28] Radford A, Wu J, Child R, et al. Language Models Are Unsupervised Multitask Learners[J]. OpenAI Blog, 2019, 1(8): 9.
[29] Lin C Y. ROUGE: A Package for Automatic Evaluation of Summaries[C]// Proceedings of the 2004 Workshop on Text Summarization Branches Out. 2004: 74-81.
[30] Sugawara S, Stenetorp P, Inui K, et al. Assessing the Benchmarking Capacity of Machine Reading Comprehension Datasets[C]// Proceedings of the 2020 AAAI Conference on Artificial Intelligence. 2020.
[31] Che W X, Feng Y L, Qin L B, et al. N-LTP: An Open-Source Neural Language Technology Platform for Chinese[OL]. arXiv Preprint, arXiv: 2009.11616.
[1] Fan Tao, Wang Hao, Zhang Wei, Li Xiaomin. Extracting Entities from Intangible Cultural Heritage Texts Based on Machine Reading Comprehension[J]. 数据分析与知识发现, 2022, 6(12): 70-79.
[2] Duan Jianyong,Wei Xiaopeng,Wang Hao. A Multi-Perspective Co-Matching Model for Machine Reading Comprehension[J]. 数据分析与知识发现, 2021, 5(4): 134-141.
[3] Wang Hongbin,Wang Jianxiong,Zhang Yafei,Yang Heng. Topic Recognition of News Reports with Imbalanced Contents[J]. 数据分析与知识发现, 2021, 5(3): 109-120.
[4] Yu Fengchang,Lu Wei. Constructing Data Set for Location Annotations of Academic Literature Figures and Tables[J]. 数据分析与知识发现, 2020, 4(6): 35-42.
[5] Shi Lei,Wang Yi,Cheng Ying,Wei Ruibin. Review of Attention Mechanism in Natural Language Processing[J]. 数据分析与知识发现, 2020, 4(5): 1-14.
[6] Huang Wei,Yu Hui,Li Yuefeng. Review of Online Anti-terrorism Research in China[J]. 现代图书情报技术, 2016, 32(11): 1-10.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn