1School of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan 430065, China 2Key Laboratory of Rich Media Digital Publication, Content Organization and Knowledge Service, Wuhan University of Science and Technology, Wuhan 430065, China 3Institute of Big Data Science and Engineering, Wuhan University of Science and Technology, Wuhan 430065, China 4Big Data Science and Engineering Research Institute, Wuhan 430065, China 5China Information Technology Security Evaluation Center, Beijing 100083, China
[Objective] This paper develops a corpus for Chinese machine reading comprehension in the security field (SecMRC), which adds professional data support for related studies. [Methods] First, we constructed a keyword search engine to retrieve the domain news. Then, we automatically generated the questions for pre-annotation with the ERNIE-GEN model. Third, we created the domain vocabulary using temporal feature words and domain keyword-matching algorithms to support accurate word separation. Finally, we formed the final dataset with manually annotated question-answer pairs and proposed a new baseline model (SecMT5). [Results] The dataset contains 2 100 Anti-terrorism and security-related news, 7 300 extracted question-answer pairs, 2 100 generative Q&A pairs, and 4,796,264 characters. We conducted tests using advanced reading comprehension models on the SecMRC. The F1 of the extraction task reached 72.05% (6.13% higher than the baseline model), and the average ROUGE-L of the generative task was 37.62%. Both are significantly weaker than the human intelligence. [Limitations] The number of Q&A pairs in the dataset needs to be expanded, and the difficulty and diversity of these pairs need to be improved. [Conclusions] The SecMRC dataset highlights domain knowledge and is challenging. It can effectively support the research of machine reading comprehension. The dataset construction method can be utilized in other fields.
Liu S S, Zhang X, Zhang S, et al. Neural Machine Reading Comprehension: Methods and Trends[J]. Applied Sciences, 2019, 9(18): Article No.698.
[2]
Hermann K M, Kočiský T, Grefenstette E, et al. Teaching Machines to Read and Comprehend[C]// Proceedings of the 28th International Conference on Neural Information Processing Systems. 2015: 1693-1701.
[3]
Rajpurkar P, Zhang J, Lopyrev K, et al. SQuAD: 100, 000+ Questions for Machine Comprehension of Text[OL]. arXiv Preprint, arXiv: 1606.05250.
[4]
Cui Y M, Liu T, Chen Z P, et al. Consensus Attention-Based Neural Networks for Chinese Reading Comprehension[OL]. arXiv Preprint, arXiv: 1607.02250.
[5]
Cui Y M, Liu T, Che W X, et al. A Span-Extraction Dataset for Chinese Machine Reading Comprehension[OL]. arXiv Preprint, arXiv: 1810.07366.
[6]
He W, Liu K, Liu J, et al. DuReader: A Chinese Machine Reading Comprehension Dataset from Real-World Applications[OL]. arXiv Preprint, arXiv: 1711.05073.
[7]
Bajaj P, Campos D, Craswell N, et al. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset[OL]. arXiv Preprint, arXiv: 1611.09268.
[8]
Britannica E. Britannica Concise Encyclopedia[M]. Encyclopaedia Britannica, Inc., 2008.
[9]
Anil A, Kumar D, Sharma S, et al. Link Prediction Using Social Network Analysis over Heterogeneous Terrorist Network[C]// Proceedings of the 2015 IEEE International Conference on Smart City/SocialCom/SustainCom (SmartCity). IEEE, 2015: 267-272.
(Xu Rongzhen, Liu Wenqiang, Fu Ziyang. Research on Military Material Distribution Optimization of Continuous Consumption in Wartime[J]. Journal of Railway Science and Engineering, 2015, 12(5): 1243-1247.)
(Li Letian, Zheng Hezhen, Ding Chen, et al. Terrorist Attack Classification Based on Improved BP Neural Network[J]. Software Guide, 2019, 18(5): 21-26.)
[12]
Sachan A. E-TGPS: Enhanced Terrorist Group Prediction System for Counter Terrorism[J]. International Journal of Computer Applications, 2015, 117(24): 24-28.
[13]
Kadlec R, Schmid M, Bajgar O, et al. Text Understanding with the Attention Sum Reader Network[OL]. arXiv Preprint, arXiv: 1603.01547.
[14]
Rajpurkar P, Jia R, Liang P.Know What You Don’t Know: Unanswerable Questions for SQuAD[OL]. arXiv Preprint, arXiv: 1806.03822.
[15]
Mo H W, Meng X, Li J Q, et al. Terrorist Event Prediction Based on Revealing Data[C]// Proceedings of the 2nd International Conference on Big Data Analysis. IEEE, 2017: 239-244.
(Li Hui, Zhang Nannan, Cao Zhuo, et al. Terrorist Prediction Algorithm Based on Machine Learning[J]. Computer Engineering, 2020, 46(2): 315-320.)
doi: 10.19678/j.issn.1000-3428.0053521
(Xu Xingpeng, Xu Qingfeng, Fang Zhiming, et al. Development Trend of Global Terrorist Attacks and Analysis on Casualty Risk[J]. China Safety Science Journal, 2021, 31(6): 170-175.)
doi: 10.16265/j.cnki.issn 1003-3033.2021.06.022
(Ru Liyun, Li Zhichao, Ma Shaoping. Indexing Page Collection Selection Method for Search Engine[J]. Journal of Computer Research and Development, 2014, 51(10): 2239-2247.)
[21]
Wu C H, Liu C H, Su P H. Sentence Extraction with Topic Modeling for Question-Answer Pair Generation[J]. Soft Computing, 2015, 19(1): 39-46.
doi: 10.1007/s00500-014-1386-6
[22]
Liu H, Gao P D. New Words Discovery Method Based On Word Segmentation Result[C]// Proceedings of the 17th International Conference on Computer and Information Science (ICIS). IEEE, 2018: 645-648.
(Ye Chunlei, Leng Fuhai. Building the Future-Oriented Technology Thesaurus of Technology Roadmap[J]. New Technology of Library and Information Service, 2013(5): 59-63.)
(Su Lixin, Guo Jiafeng, Fan Yixing, et al. A Reading Comprehension Model for Multiple-Span Answers[J]. Chinese Journal of Computers, 2020, 43(5): 856-867.)
[25]
Fu J L, Liu P F, Zhang Q. Rethinking Generalization of Neural Models: A Named Entity Recognition Case Study[C]// Proceedings of the 2020 AAAI Conference on Artificial Intelligence. 2020.
[26]
Lewis M, Liu Y H, Goyal N, et al. BART: Denoising Sequence-to-Sequence Pre-Training for Natural Language Generation, Translation, and Comprehension[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020: 7871-7880.
[27]
Xue L T, Constant N, Roberts A, et al. MT5: A Massively Multilingual Pre-Trained Text-to-Text Transformer[OL]. arXiv Preprint, arXiv: 2010.11934.
[28]
Radford A, Wu J, Child R, et al. Language Models Are Unsupervised Multitask Learners[J]. OpenAI Blog, 2019, 1(8): 9.
[29]
Lin C Y. ROUGE: A Package for Automatic Evaluation of Summaries[C]// Proceedings of the 2004 Workshop on Text Summarization Branches Out. 2004: 74-81.
[30]
Sugawara S, Stenetorp P, Inui K, et al. Assessing the Benchmarking Capacity of Machine Reading Comprehension Datasets[C]// Proceedings of the 2020 AAAI Conference on Artificial Intelligence. 2020.
[31]
Che W X, Feng Y L, Qin L B, et al. N-LTP: An Open-Source Neural Language Technology Platform for Chinese[OL]. arXiv Preprint, arXiv: 2009.11616.