Please wait a minute...
Advanced Search
数据分析与知识发现  2022, Vol. 6 Issue (12): 70-79     https://doi.org/10.11925/infotech.2096-3467.2022.0165
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于机器阅读理解的非遗文本实体抽取研究*
范涛1,2,王昊1,2(),张卫1,2,李晓敏1,2
1南京大学信息管理学院 南京 210023
2江苏省数据工程与知识服务重点实验室 南京 210023
Extracting Entities from Intangible Cultural Heritage Texts Based on Machine Reading Comprehension
Fan Tao1,2,Wang Hao1,2(),Zhang Wei1,2,Li Xiaomin1,2
1School of Information Management, Nanjing University, Nanjing 210023, China
2Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China
全文: PDF (4410 KB)   HTML ( 17
输出: BibTeX | EndNote (RIS)      
摘要 

目的】 针对当前非遗文本实体抽取研究的不足,提出以机器阅读理解方法为基础,通过问答的方式对非遗文本中的实体进行抽取。【方法】 构建非遗实体敏感的注意力机制,用于捕捉非遗文本上下文同问题之间的联系,使模型关注同问题相关的非遗实体,并建立非遗文本实体抽取模型ICHQA。【结果】 将ICHQA模型在标注的非遗语料库中进行实证研究,并同相关基线模型进行对比,结果表明ICHQA在F1指标中表现最优,达87.139%。为凸显模型的优势和增强可解释性,本文还展开了消融实验并对模型输出进行了可视化。【局限】 本文提出的模型仅在非遗语料库中进行验证,泛化性测试不够。【结论】 利用机器阅读理解进行非遗实体抽取,能够有效利用实体标签的语义特征,提升实体抽取的效果。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
范涛
王昊
张卫
李晓敏
关键词 数字人文非物质文化遗产命名实体识别注意力机制机器阅读理解    
Abstract

[Objective] This paper proposes a Question-Answering (QA) model based on machine reading comprehension (MRC) to extract entities from Intangible Cultural Heritage (ICH) texts. [Methods] First, we constructed an ICH entity sensitive attention mechanism, which captured the interaction between contexts and questions. The mechanism also helps our model focus on questions and related ICH entities. Then, we built the ICHQA model for entity extraction. [Results] We examined the ICHQA model with the ICH corpus. The ICHQA’s F1 value reached 87.139%, which was better than the existing models. We also performed ablation studies and visualized outputs of the ICHQA. [Limitations] More research is needed to examine the proposed model with other corpus from digital humanities. [Conclusions] The proposed model could effectively extract ICH entities.

Key wordsDigital Humanities    Intangible Cultural Heritage    Named Entity Recognition    Attention Mechanism    Machine Reading Comprehension
收稿日期: 2022-02-28      出版日期: 2023-02-03
ZTFLH:  G202  
基金资助:*国家自然科学基金面上项目(72074108);南京大学文科青年跨学科团队专项(010814370113);江苏青年社科英才和南京大学“仲英青年学者”等人才培养计划
通讯作者: 王昊,ORCID:0000-0002-0131-0823     E-mail: ywhaowang@nju.edu.cn
引用本文:   
范涛, 王昊, 张卫, 李晓敏. 基于机器阅读理解的非遗文本实体抽取研究*[J]. 数据分析与知识发现, 2022, 6(12): 70-79.
Fan Tao, Wang Hao, Zhang Wei, Li Xiaomin. Extracting Entities from Intangible Cultural Heritage Texts Based on Machine Reading Comprehension. Data Analysis and Knowledge Discovery, 2022, 6(12): 70-79.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2022.0165      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2022/V6/I12/70
实体类别 问题
非遗名称 寻找句子中存在的非遗名称或者别名。
组织 寻找句子中存在的非遗保护单位、申报组织、地方协会等机构。
时间 寻找句子中存在的非遗申报时间、起源年代、庆祝时间等。
地点 寻找句子中存在的非遗申报地区、流行地区、所在地方等具体或抽象的地点。
Table 1  以标注规则为指导的问题生成
Fig.1  模型结构
模型 F1 (%)
BiLSTM-CRF 83.757
BiLSTM-Att-CRF 85.346
BiGRU-CRF 85.293
BERT-CRF 76.720
BERT-MRC 79.442
ICHQA 87.139
Table 2  实验结果
模型 耗时(秒/轮)
BiLSTM-CRF 79.0
BiLSTM-Att-CRF 81.5
BiGRU-CRF 77.6
BERT-CRF >100
BERT-MRC >100
ICHQA 41.2
Table 3  解码耗时
问题 内容
Q1: 维基百科定义 地理位置是指地球表面某一事物与其他事物间的空间关系。
Q2: 关键词 地理位置
Q3: 近义词 地位;方位;位置
Q4: 关键词+近义词 地理位置;地位;方位;位置
Q5: 基于模板的问题 句子中有什么地理位置?
Q6: 以标注规则为指导的问题 寻找句子中存在的非遗申报地区、流行地区、所在地方等具体或抽象的地点。
Table 4  以不同方式生成的问题
Query F1 (%)
Q1 87.081
Q2 86.823
Q3 86.760
Q4 86.893
Q5 87.013
Q6 87.139
Table 5  不同问题生成方式对模型的影响
模型 F1 (%)
W/O Attention 86.078
ICHQA 87.139
Table 6  不同结构对模型性能的影响
Fig.2  ICHQA模型和BiLSTM-Att-CRF模型在不同训练数据规模下的表现
Fig.3  对应用非遗实体敏感注意力机制后的地点实体抽取可视化结果
Fig.4  对应用非遗实体敏感注意力机制后的非遗名称实体抽取可视化结果
[1] 方志玉, 方忠. 数字化构建非遗传承新模式[J/OL]. 中国社会科学报. (2021-03-09). http://www.cssn.cn/zx/bwyc/202103/t20210309_5316504.shtml.
[1] (Fang Zhiyu, Fang Zhong. Digitalization Constructing the New Model of Intangible Cultural Heritage Inheritance[J/OL]. Chinese Social Sciences Today. (2021-03-09). http://www.cssn.cn/zx/bwyc/202103/t20210309_5316504.shtml. )
[2] 郑海鸥. 中国非物质文化遗产保护令人瞩目[N/OL]. 人民日报. (2021-06-12). http://gs.people.com.cn/n2/2021/0612/c183342-34774089.html.
[2] (Zheng Haiou. The Protection of Chinese Intangible Cultural Heritage Attracts Much Attention[N/OL]. People’s Daily. (2021-06-12). http://gs.people.com.cn/n2/2021/0612/c183342-34774089.html. )
[3] 敬宜, 林芮, 刘军国, 等. 加强保护和传承,让非遗绽放更迷人光彩[N/OL]. 人民日报. (2021-02-10). http://hb.people.com.cn/n2/2021/0210/c194063-34575307.html.
[3] (Jing Yi, Lin Rui, Liu Junguo, et al. Strengthen the Protection and Inheritance and Let Intangible Cultural Heritage Blossom More Charming[N/OL]. People’s Daily. (2021-02-10). http://hb.people.com.cn/n2/2021/0210/c194063-34575307.html. )
[4] Dou J H, Qin J Y, Jin Z X, et al. Knowledge Graph Based on Domain Ontology and Natural Language Processing Technology for Chinese Intangible Cultural Heritage[J]. Journal of Visual Languages & Computing, 2018, 48: 19-28.
[5] Dragoni M, Tonelli S, Moretti G. A Knowledge Management Architecture for Digital Cultural Heritage[J]. Journal on Computing and Cultural Heritage, 2017, 10(3): 1-18.
[6] 范青, 史中超, 谈国新. 非物质文化遗产的知识图谱构建[J]. 图书馆论坛, 2021, 41(10): 100-109.
[6] (Fan Qing, Shi Zhongchao, Tan Guoxin. Construction of Intangible Cultural Heritage Knowledge Graphs[J]. Library Tribune, 2021, 41(10): 100-109.)
[7] 刘浏, 秦天允, 王东波. 非物质文化遗产传统音乐术语自动抽取[J]. 数据分析与知识发现, 2020, 4(12): 68-75.
[7] (Qin Tianyun, Wang Dongbo. Automatic Extraction of Traditional Music Terms of Intangible Cultural Heritage[J]. Data Analysis and Knowledge Discovery, 2020, 4(12): 68-75.)
[8] Lafferty J D, McCallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]// Proceedings of the 18th International Conference on Machine Learning. 2001: 282-289.
[9] Huang Z H, Xu W, Yu K. Bidirectional LSTM-CRF Models for Sequence Tagging[OL]. arXiv Preprint, arXiv: 1508.01991.
[10] Li X Y, Feng J R, Meng Y X, et al. A Unified MRC Framework for Named Entity Recognition[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020: 5849-5859.
[11] Li X Y, Yin F, Sun Z J, et al. Entity-Relation Extraction as Multi-Turn Question Answering[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2019: 1340-1350.
[12] Wang Q F, Yang L, Kanagal B, et al. Learning to Extract Attribute Value from Product via Question Answering: A Multi-Task Approach[C]// Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020: 47-55.
[13] Sun C, Yang Z H, Wang L, et al. Biomedical Named Entity Recognition Using BERT in the Machine Reading Comprehension Framework[J]. Journal of Biomedical Informatics, 2021, 118: 103799.
doi: 10.1016/j.jbi.2021.103799
[14] Xu H M, Wang W T, Mao X, et al. Scaling up Open Tagging from Tens to Thousands: Comprehension Empowered Attribute Value Extraction from Product Title[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2019: 5214-5223.
[15] Noor S, Shah L, Adil M, et al. Modeling and Representation of Built Cultural Heritage Data Using Semantic Web Technologies and Building Information Model[J]. Computational and Mathematical Organization Theory, 2019, 25(3): 247-270.
doi: 10.1007/s10588-018-09285-y
[16] Dimitropoulos K, Tsalakanidou F, Nikolopoulos S, et al. A Multimodal Approach for the Safeguarding and Transmission of Intangible Cultural Heritage: The Case of I-Treasures[J]. IEEE Intelligent Systems, 2018, 33(6): 3-16.
[17] Li J, Sun A X, Han J L, et al. A Survey on Deep Learning for Named Entity Recognition[J]. IEEE Transactions on Knowledge and Data Engineering, 2022, 34(1): 50-70.
doi: 10.1109/TKDE.2020.2981314
[18] Luo Y, Zhao H, Zhan J L. Named Entity Recognition Only from Word Embeddings[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2020: 8995-9005.
[19] Luan Y, Ostendorf M, Hajishirzi H. Scientific Information Extraction with Semi-Supervised Neural Tagging[C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2017: 2630-2640.
[20] 黄水清, 王东波, 何琳. 基于先秦语料库的古汉语地名自动识别模型构建研究[J]. 图书情报工作, 2015, 59(12): 135-140.
doi: 10.13266/j.issn.0252-3116.2015.12.020
[20] (Huang Shuiqing, Wang Dongbo, He Lin. Research on Constructing Automatic Recognition Model for Ancient Chinese Place Names Based on Pre-Qin Corpus[J]. Library and Information Service, 2015, 59(12): 135-140.)
doi: 10.13266/j.issn.0252-3116.2015.12.020
[21] 李娜. 基于条件随机场的方志古籍别名自动抽取模型构建[J]. 中文信息学报, 2018, 32(11): 41-48.
[21] (Li Na. Automatic Extraction of Alias in Ancient Local Chronicles Based on Conditional Random Fields[J]. Journal of Chinese Information Processing, 2018, 32(11): 41-48.)
[22] 徐晨飞, 叶海影, 包平. 基于深度学习的方志物产资料实体自动识别模型构建研究[J]. 数据分析与知识发现, 2020, 4(8): 86-97.
[22] (Xu Chenfei, Ye Haiying, Bao Ping. Automatic Recognition of Produce Entities from Local Chronicles with Deep Learning[J]. Data Analysis and Knowledge Discovery, 2020, 4(8): 86-97.)
[23] Reddy S, Chen D Q, Manning C D. CoQA: A Conversational Question Answering Challenge[J]. Transactions of the Association for Computational Linguistics, 2019, 7: 249-266.
doi: 10.1162/tacl_a_00266
[24] Kwiatkowski T, Palomaki J, Redfield O, et al. Natural Questions: A Benchmark for Question Answering Research[J]. Transactions of the Association for Computational Linguistics, 2019, 7: 453-466.
doi: 10.1162/tacl_a_00276
[25] Wang W H, Yang N, Wei F R, et al. Gated Self-Matching Networks for Reading Comprehension and Question Answering[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2017: 189-198.
[26] Levy O, Seo M, Choi E, et al. Zero-Shot Relation Extraction via Reading Comprehension[C]//Proceedings of the 21st Conference on Computational Natural Language Learning. Association for Computational Linguistics, 2017: 333-342.
[27] Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781.
[28] Gui T, Ma R, Zhang Q, et al. CNN-Based Chinese NER with Lexicon Rethinking[C]// Proceedings of the 28th International Joint Conference on Artificial Intelligence. 2019: 4982-4988.
[29] Wang S, Manning C. Fast Dropout Training[C]// Proceedings of the 30th International Conference on Machine Learning. 2013: 118-126.
[30] Cao P F, Chen Y B, Liu K, et al. Adversarial Transfer Learning for Chinese Named Entity Recognition with Self-Attention Mechanism[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2018: 182-192.
[31] Qin Q L, Zhao S, Liu C M. A BERT-BiGRU-CRF Model for Entity Recognition of Chinese Electronic Medical Records[J]. Complexity, 2021, 2021: 6631837.
[32] Devlin J, Chang M W, Lee K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2019: 4171-4186.
[1] 赵蕊洁, 佟昕瑀, 刘小桦, 路永和. 基于神经网络的医药科技论文实体识别与标注研究*[J]. 数据分析与知识发现, 2022, 6(9): 100-112.
[2] 高劲松, 张强, 李帅珂, 孙艳玲, 周树斌. 数字人文视域下诗人的时空情感轨迹研究——以李白为例*[J]. 数据分析与知识发现, 2022, 6(9): 27-39.
[3] 唐娇, 张力生, 桑春艳. 基于潜在主题分布和长、短期用户表示的新闻推荐模型*[J]. 数据分析与知识发现, 2022, 6(9): 52-64.
[4] 胡吉明, 钱玮, 文鹏, 吕晓光. 基于结构功能和实体识别的文本语义表示——以病历领域为例*[J]. 数据分析与知识发现, 2022, 6(8): 110-121.
[5] 赵鹏武, 李志义, 林小琦. 基于注意力机制和卷积神经网络的中文人物关系抽取与识别*[J]. 数据分析与知识发现, 2022, 6(8): 41-51.
[6] 张若琦, 申建芳, 陈平华. 结合GNN、Bi-GRU及注意力机制的会话序列推荐*[J]. 数据分析与知识发现, 2022, 6(6): 46-54.
[7] 叶瀚,孙海春,李欣,焦凯楠. 融合注意力机制与句向量压缩的长文本分类模型[J]. 数据分析与知识发现, 2022, 6(6): 84-94.
[8] 范涛, 王昊, 李跃艳, 邓三鸿. 基于多模态融合的非遗图片分类研究*[J]. 数据分析与知识发现, 2022, 6(2/3): 329-337.
[9] 周泽聿, 王昊, 张小琴, 范涛, 任秋彤. 基于Xception-TD的中华传统刺绣分类模型构建*[J]. 数据分析与知识发现, 2022, 6(2/3): 338-347.
[10] 郭航程, 何彦青, 兰天, 吴振峰, 董诚. 基于Paragraph-BERT-CRF的科技论文摘要语步功能信息识别方法研究*[J]. 数据分析与知识发现, 2022, 6(2/3): 298-307.
[11] 徐月梅, 樊祖薇, 曹晗. 基于标签嵌入注意力机制的多任务文本分类模型*[J]. 数据分析与知识发现, 2022, 6(2/3): 105-116.
[12] 余传明, 林虹君, 张贞港. 基于多任务深度学习的实体和事件联合抽取模型*[J]. 数据分析与知识发现, 2022, 6(2/3): 117-128.
[13] 张芳丛, 秦秋莉, 姜勇, 庄润涛. 基于RoBERTa-WWM-BiLSTM-CRF的中文电子病历命名实体识别研究[J]. 数据分析与知识发现, 2022, 6(2/3): 251-262.
[14] 张云秋, 汪洋, 李博诚. 基于RoBERTa-wwm动态融合模型的中文电子病历命名实体识别*[J]. 数据分析与知识发现, 2022, 6(2/3): 242-250.
[15] 严冬梅, 何雯馨, 陈智. 融合情感特征的基于RoBERTa-TCN的股价预测研究[J]. 数据分析与知识发现, 2022, 6(12): 123-134.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn