基于机器阅读理解的非遗文本实体抽取研究<sup>*</sup>

doi:10.11925/infotech.2096-3467.2022.0165

数据分析与知识发现

2022, Vol. 6

Issue (12): 70-79 https://doi.org/10.11925/infotech.2096-3467.2022.0165

研究论文

本期目录 | 过刊浏览 | 高级检索

基于机器阅读理解的非遗文本实体抽取研究^*

范涛^1,²,王昊^1,²(

),张卫^1,²,李晓敏^1,²

¹南京大学信息管理学院南京 210023
²江苏省数据工程与知识服务重点实验室南京 210023

Extracting Entities from Intangible Cultural Heritage Texts Based on Machine Reading Comprehension

Fan Tao^1,²,Wang Hao^1,²(

),Zhang Wei^1,²,Li Xiaomin^1,²

¹School of Information Management, Nanjing University, Nanjing 210023, China
²Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China

摘要
图/表
参考文献
相关文章
Metrics

全文: PDF (4410 KB) HTML ( 17 )
输出: BibTeX | EndNote (RIS)

摘要

【目的】针对当前非遗文本实体抽取研究的不足，提出以机器阅读理解方法为基础，通过问答的方式对非遗文本中的实体进行抽取。【方法】构建非遗实体敏感的注意力机制，用于捕捉非遗文本上下文同问题之间的联系，使模型关注同问题相关的非遗实体，并建立非遗文本实体抽取模型ICHQA。【结果】将ICHQA模型在标注的非遗语料库中进行实证研究，并同相关基线模型进行对比，结果表明ICHQA在F1指标中表现最优，达87.139%。为凸显模型的优势和增强可解释性，本文还展开了消融实验并对模型输出进行了可视化。【局限】本文提出的模型仅在非遗语料库中进行验证，泛化性测试不够。【结论】利用机器阅读理解进行非遗实体抽取，能够有效利用实体标签的语义特征，提升实体抽取的效果。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	范涛
	王昊
	张卫
	李晓敏

关键词 ：数字人文, 非物质文化遗产, 命名实体识别, 注意力机制, 机器阅读理解

Abstract：

[Objective] This paper proposes a Question-Answering (QA) model based on machine reading comprehension (MRC) to extract entities from Intangible Cultural Heritage (ICH) texts. [Methods] First, we constructed an ICH entity sensitive attention mechanism, which captured the interaction between contexts and questions. The mechanism also helps our model focus on questions and related ICH entities. Then, we built the ICHQA model for entity extraction. [Results] We examined the ICHQA model with the ICH corpus. The ICHQA’s F1 value reached 87.139%, which was better than the existing models. We also performed ablation studies and visualized outputs of the ICHQA. [Limitations] More research is needed to examine the proposed model with other corpus from digital humanities. [Conclusions] The proposed model could effectively extract ICH entities.

Key words： Digital Humanities Intangible Cultural Heritage Named Entity Recognition Attention Mechanism Machine Reading Comprehension

收稿日期: 2022-02-28 出版日期: 2023-02-03

ZTFLH:

G202

基金资助:*国家自然科学基金面上项目(72074108);南京大学文科青年跨学科团队专项(010814370113);江苏青年社科英才和南京大学“仲英青年学者”等人才培养计划

通讯作者: 王昊，ORCID：0000-0002-0131-0823 E-mail: ywhaowang@nju.edu.cn

引用本文:

范涛, 王昊, 张卫, 李晓敏. 基于机器阅读理解的非遗文本实体抽取研究^*[J]. 数据分析与知识发现, 2022, 6(12): 70-79.
Fan Tao, Wang Hao, Zhang Wei, Li Xiaomin. Extracting Entities from Intangible Cultural Heritage Texts Based on Machine Reading Comprehension. Data Analysis and Knowledge Discovery, 2022, 6(12): 70-79.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2022.0165 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2022/V6/I12/70

Table 1 以标注规则为指导的问题生成

Fig.1 模型结构

Table 2 实验结果

Table 3 解码耗时

Table 4 以不同方式生成的问题

Table 5 不同问题生成方式对模型的影响

Table 6 不同结构对模型性能的影响

Fig.2 ICHQA模型和BiLSTM-Att-CRF模型在不同训练数据规模下的表现

Fig.3 对应用非遗实体敏感注意力机制后的地点实体抽取可视化结果

Fig.4 对应用非遗实体敏感注意力机制后的非遗名称实体抽取可视化结果

[1]	方志玉, 方忠. 数字化构建非遗传承新模式[J/OL]. 中国社会科学报. (2021-03-09). http://www.cssn.cn/zx/bwyc/202103/t20210309_5316504.shtml.
[1]	(Fang Zhiyu, Fang Zhong. Digitalization Constructing the New Model of Intangible Cultural Heritage Inheritance[J/OL]. Chinese Social Sciences Today. (2021-03-09). http://www.cssn.cn/zx/bwyc/202103/t20210309_5316504.shtml. )
[2]	郑海鸥. 中国非物质文化遗产保护令人瞩目[N/OL]. 人民日报. (2021-06-12). http://gs.people.com.cn/n2/2021/0612/c183342-34774089.html.
[2]	(Zheng Haiou. The Protection of Chinese Intangible Cultural Heritage Attracts Much Attention[N/OL]. People’s Daily. (2021-06-12). http://gs.people.com.cn/n2/2021/0612/c183342-34774089.html. )
[3]	敬宜, 林芮, 刘军国, 等. 加强保护和传承,让非遗绽放更迷人光彩[N/OL]. 人民日报. (2021-02-10). http://hb.people.com.cn/n2/2021/0210/c194063-34575307.html.
[3]	(Jing Yi, Lin Rui, Liu Junguo, et al. Strengthen the Protection and Inheritance and Let Intangible Cultural Heritage Blossom More Charming[N/OL]. People’s Daily. (2021-02-10). http://hb.people.com.cn/n2/2021/0210/c194063-34575307.html. )
[4]	Dou J H, Qin J Y, Jin Z X, et al. Knowledge Graph Based on Domain Ontology and Natural Language Processing Technology for Chinese Intangible Cultural Heritage[J]. Journal of Visual Languages & Computing, 2018, 48: 19-28.
[5]	Dragoni M, Tonelli S, Moretti G. A Knowledge Management Architecture for Digital Cultural Heritage[J]. Journal on Computing and Cultural Heritage, 2017, 10(3): 1-18.
[6]	范青, 史中超, 谈国新. 非物质文化遗产的知识图谱构建[J]. 图书馆论坛, 2021, 41(10): 100-109.
[6]	(Fan Qing, Shi Zhongchao, Tan Guoxin. Construction of Intangible Cultural Heritage Knowledge Graphs[J]. Library Tribune, 2021, 41(10): 100-109.)
[7]	刘浏, 秦天允, 王东波. 非物质文化遗产传统音乐术语自动抽取[J]. 数据分析与知识发现, 2020, 4(12): 68-75.
[7]	(Qin Tianyun, Wang Dongbo. Automatic Extraction of Traditional Music Terms of Intangible Cultural Heritage[J]. Data Analysis and Knowledge Discovery, 2020, 4(12): 68-75.)
[8]	Lafferty J D, McCallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]// Proceedings of the 18th International Conference on Machine Learning. 2001: 282-289.
[9]	Huang Z H, Xu W, Yu K. Bidirectional LSTM-CRF Models for Sequence Tagging[OL]. arXiv Preprint, arXiv: 1508.01991.
[10]	Li X Y, Feng J R, Meng Y X, et al. A Unified MRC Framework for Named Entity Recognition[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020: 5849-5859.
[11]	Li X Y, Yin F, Sun Z J, et al. Entity-Relation Extraction as Multi-Turn Question Answering[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2019: 1340-1350.
[12]	Wang Q F, Yang L, Kanagal B, et al. Learning to Extract Attribute Value from Product via Question Answering: A Multi-Task Approach[C]// Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020: 47-55.
[13]	Sun C, Yang Z H, Wang L, et al. Biomedical Named Entity Recognition Using BERT in the Machine Reading Comprehension Framework[J]. Journal of Biomedical Informatics, 2021, 118: 103799. doi: 10.1016/j.jbi.2021.103799
[14]	Xu H M, Wang W T, Mao X, et al. Scaling up Open Tagging from Tens to Thousands: Comprehension Empowered Attribute Value Extraction from Product Title[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2019: 5214-5223.
[15]	Noor S, Shah L, Adil M, et al. Modeling and Representation of Built Cultural Heritage Data Using Semantic Web Technologies and Building Information Model[J]. Computational and Mathematical Organization Theory, 2019, 25(3): 247-270. doi: 10.1007/s10588-018-09285-y
[16]	Dimitropoulos K, Tsalakanidou F, Nikolopoulos S, et al. A Multimodal Approach for the Safeguarding and Transmission of Intangible Cultural Heritage: The Case of I-Treasures[J]. IEEE Intelligent Systems, 2018, 33(6): 3-16.
[17]	Li J, Sun A X, Han J L, et al. A Survey on Deep Learning for Named Entity Recognition[J]. IEEE Transactions on Knowledge and Data Engineering, 2022, 34(1): 50-70. doi: 10.1109/TKDE.2020.2981314
[18]	Luo Y, Zhao H, Zhan J L. Named Entity Recognition Only from Word Embeddings[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2020: 8995-9005.
[19]	Luan Y, Ostendorf M, Hajishirzi H. Scientific Information Extraction with Semi-Supervised Neural Tagging[C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2017: 2630-2640.
[20]	黄水清, 王东波, 何琳. 基于先秦语料库的古汉语地名自动识别模型构建研究[J]. 图书情报工作, 2015, 59(12): 135-140. doi: 10.13266/j.issn.0252-3116.2015.12.020
[20]	(Huang Shuiqing, Wang Dongbo, He Lin. Research on Constructing Automatic Recognition Model for Ancient Chinese Place Names Based on Pre-Qin Corpus[J]. Library and Information Service, 2015, 59(12): 135-140.) doi: 10.13266/j.issn.0252-3116.2015.12.020
[21]	李娜. 基于条件随机场的方志古籍别名自动抽取模型构建[J]. 中文信息学报, 2018, 32(11): 41-48.
[21]	(Li Na. Automatic Extraction of Alias in Ancient Local Chronicles Based on Conditional Random Fields[J]. Journal of Chinese Information Processing, 2018, 32(11): 41-48.)
[22]	徐晨飞, 叶海影, 包平. 基于深度学习的方志物产资料实体自动识别模型构建研究[J]. 数据分析与知识发现, 2020, 4(8): 86-97.
[22]	(Xu Chenfei, Ye Haiying, Bao Ping. Automatic Recognition of Produce Entities from Local Chronicles with Deep Learning[J]. Data Analysis and Knowledge Discovery, 2020, 4(8): 86-97.)
[23]	Reddy S, Chen D Q, Manning C D. CoQA: A Conversational Question Answering Challenge[J]. Transactions of the Association for Computational Linguistics, 2019, 7: 249-266. doi: 10.1162/tacl_a_00266
[24]	Kwiatkowski T, Palomaki J, Redfield O, et al. Natural Questions: A Benchmark for Question Answering Research[J]. Transactions of the Association for Computational Linguistics, 2019, 7: 453-466. doi: 10.1162/tacl_a_00276
[25]	Wang W H, Yang N, Wei F R, et al. Gated Self-Matching Networks for Reading Comprehension and Question Answering[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2017: 189-198.
[26]	Levy O, Seo M, Choi E, et al. Zero-Shot Relation Extraction via Reading Comprehension[C]//Proceedings of the 21st Conference on Computational Natural Language Learning. Association for Computational Linguistics, 2017: 333-342.
[27]	Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781.
[28]	Gui T, Ma R, Zhang Q, et al. CNN-Based Chinese NER with Lexicon Rethinking[C]// Proceedings of the 28th International Joint Conference on Artificial Intelligence. 2019: 4982-4988.
[29]	Wang S, Manning C. Fast Dropout Training[C]// Proceedings of the 30th International Conference on Machine Learning. 2013: 118-126.
[30]	Cao P F, Chen Y B, Liu K, et al. Adversarial Transfer Learning for Chinese Named Entity Recognition with Self-Attention Mechanism[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2018: 182-192.
[31]	Qin Q L, Zhao S, Liu C M. A BERT-BiGRU-CRF Model for Entity Recognition of Chinese Electronic Medical Records[J]. Complexity, 2021, 2021: 6631837.
[32]	Devlin J, Chang M W, Lee K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2019: 4171-4186.

[1]	赵蕊洁, 佟昕瑀, 刘小桦, 路永和. 基于神经网络的医药科技论文实体识别与标注研究^*[J]. 数据分析与知识发现, 2022, 6(9): 100-112.
[2]	高劲松, 张强, 李帅珂, 孙艳玲, 周树斌. 数字人文视域下诗人的时空情感轨迹研究——以李白为例^*[J]. 数据分析与知识发现, 2022, 6(9): 27-39.
[3]	唐娇, 张力生, 桑春艳. 基于潜在主题分布和长、短期用户表示的新闻推荐模型^*[J]. 数据分析与知识发现, 2022, 6(9): 52-64.
[4]	胡吉明, 钱玮, 文鹏, 吕晓光. 基于结构功能和实体识别的文本语义表示——以病历领域为例*[J]. 数据分析与知识发现, 2022, 6(8): 110-121.
[5]	赵鹏武, 李志义, 林小琦. 基于注意力机制和卷积神经网络的中文人物关系抽取与识别*[J]. 数据分析与知识发现, 2022, 6(8): 41-51.
[6]	张若琦, 申建芳, 陈平华. 结合GNN、Bi-GRU及注意力机制的会话序列推荐^*[J]. 数据分析与知识发现, 2022, 6(6): 46-54.
[7]	叶瀚,孙海春,李欣,焦凯楠. 融合注意力机制与句向量压缩的长文本分类模型[J]. 数据分析与知识发现, 2022, 6(6): 84-94.
[8]	范涛, 王昊, 李跃艳, 邓三鸿. 基于多模态融合的非遗图片分类研究^*[J]. 数据分析与知识发现, 2022, 6(2/3): 329-337.
[9]	周泽聿, 王昊, 张小琴, 范涛, 任秋彤. 基于Xception-TD的中华传统刺绣分类模型构建^*[J]. 数据分析与知识发现, 2022, 6(2/3): 338-347.
[10]	郭航程, 何彦青, 兰天, 吴振峰, 董诚. 基于Paragraph-BERT-CRF的科技论文摘要语步功能信息识别方法研究^*[J]. 数据分析与知识发现, 2022, 6(2/3): 298-307.
[11]	徐月梅, 樊祖薇, 曹晗. 基于标签嵌入注意力机制的多任务文本分类模型^*[J]. 数据分析与知识发现, 2022, 6(2/3): 105-116.
[12]	余传明, 林虹君, 张贞港. 基于多任务深度学习的实体和事件联合抽取模型^*[J]. 数据分析与知识发现, 2022, 6(2/3): 117-128.
[13]	张芳丛, 秦秋莉, 姜勇, 庄润涛. 基于RoBERTa-WWM-BiLSTM-CRF的中文电子病历命名实体识别研究[J]. 数据分析与知识发现, 2022, 6(2/3): 251-262.
[14]	张云秋, 汪洋, 李博诚. 基于RoBERTa-wwm动态融合模型的中文电子病历命名实体识别^*[J]. 数据分析与知识发现, 2022, 6(2/3): 242-250.
[15]	严冬梅, 何雯馨, 陈智. 融合情感特征的基于RoBERTa-TCN的股价预测研究[J]. 数据分析与知识发现, 2022, 6(12): 123-134.

Viewed

Full text

Abstract

Cited

Shared

Discussed