Please wait a minute...
Data Analysis and Knowledge Discovery  2022, Vol. 6 Issue (12): 70-79    DOI: 10.11925/infotech.2096-3467.2022.0165
Current Issue | Archive | Adv Search |
Extracting Entities from Intangible Cultural Heritage Texts Based on Machine Reading Comprehension
Fan Tao1,2,Wang Hao1,2(),Zhang Wei1,2,Li Xiaomin1,2
1School of Information Management, Nanjing University, Nanjing 210023, China
2Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China
Download: PDF (4410 KB)   HTML ( 17
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a Question-Answering (QA) model based on machine reading comprehension (MRC) to extract entities from Intangible Cultural Heritage (ICH) texts. [Methods] First, we constructed an ICH entity sensitive attention mechanism, which captured the interaction between contexts and questions. The mechanism also helps our model focus on questions and related ICH entities. Then, we built the ICHQA model for entity extraction. [Results] We examined the ICHQA model with the ICH corpus. The ICHQA’s F1 value reached 87.139%, which was better than the existing models. We also performed ablation studies and visualized outputs of the ICHQA. [Limitations] More research is needed to examine the proposed model with other corpus from digital humanities. [Conclusions] The proposed model could effectively extract ICH entities.

Key wordsDigital Humanities      Intangible Cultural Heritage      Named Entity Recognition      Attention Mechanism      Machine Reading Comprehension     
Received: 28 February 2022      Published: 03 February 2023
ZTFLH:  G202  
Fund:National Natural Science Foundation of China(72074108);special project of Nanjing University Liberal Arts Youth Interdisciplinary Team(010814370113);Jiangsu Young Social Science Talents and Nanjing University “Tang Scholar”
Corresponding Authors: Wang Hao,ORCID:0000-0002-0131-0823     E-mail: ywhaowang@nju.edu.cn

Cite this article:

Fan Tao, Wang Hao, Zhang Wei, Li Xiaomin. Extracting Entities from Intangible Cultural Heritage Texts Based on Machine Reading Comprehension. Data Analysis and Knowledge Discovery, 2022, 6(12): 70-79.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2022.0165     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2022/V6/I12/70

实体类别 问题
非遗名称 寻找句子中存在的非遗名称或者别名。
组织 寻找句子中存在的非遗保护单位、申报组织、地方协会等机构。
时间 寻找句子中存在的非遗申报时间、起源年代、庆祝时间等。
地点 寻找句子中存在的非遗申报地区、流行地区、所在地方等具体或抽象的地点。
Questions Generation Guided by Annotation Rules
Model Structure
模型 F1 (%)
BiLSTM-CRF 83.757
BiLSTM-Att-CRF 85.346
BiGRU-CRF 85.293
BERT-CRF 76.720
BERT-MRC 79.442
ICHQA 87.139
Experimental Results
模型 耗时(秒/轮)
BiLSTM-CRF 79.0
BiLSTM-Att-CRF 81.5
BiGRU-CRF 77.6
BERT-CRF >100
BERT-MRC >100
ICHQA 41.2
Decoding Time
问题 内容
Q1: 维基百科定义 地理位置是指地球表面某一事物与其他事物间的空间关系。
Q2: 关键词 地理位置
Q3: 近义词 地位;方位;位置
Q4: 关键词+近义词 地理位置;地位;方位;位置
Q5: 基于模板的问题 句子中有什么地理位置?
Q6: 以标注规则为指导的问题 寻找句子中存在的非遗申报地区、流行地区、所在地方等具体或抽象的地点。
Questions Generated by Different Ways
Query F1 (%)
Q1 87.081
Q2 86.823
Q3 86.760
Q4 86.893
Q5 87.013
Q6 87.139
Impacts of Different Methods of Questions Generation on the Model
模型 F1 (%)
W/O Attention 86.078
ICHQA 87.139
Impacts of Different Structures on the Performance of the Model
Performances of ICHQA and BiLSTM-Att-CRF in Different Scales of Training Dataset
The Place Entity Visualization Result of Applying ICH Entities Sensitive Attention Mechanism
The ICH Name Entity Visualization Results of Applying ICH Entities Sensitive Attention Mechanism
[1] 方志玉, 方忠. 数字化构建非遗传承新模式[J/OL]. 中国社会科学报. (2021-03-09). http://www.cssn.cn/zx/bwyc/202103/t20210309_5316504.shtml.
[1] (Fang Zhiyu, Fang Zhong. Digitalization Constructing the New Model of Intangible Cultural Heritage Inheritance[J/OL]. Chinese Social Sciences Today. (2021-03-09). http://www.cssn.cn/zx/bwyc/202103/t20210309_5316504.shtml. )
[2] 郑海鸥. 中国非物质文化遗产保护令人瞩目[N/OL]. 人民日报. (2021-06-12). http://gs.people.com.cn/n2/2021/0612/c183342-34774089.html.
[2] (Zheng Haiou. The Protection of Chinese Intangible Cultural Heritage Attracts Much Attention[N/OL]. People’s Daily. (2021-06-12). http://gs.people.com.cn/n2/2021/0612/c183342-34774089.html. )
[3] 敬宜, 林芮, 刘军国, 等. 加强保护和传承,让非遗绽放更迷人光彩[N/OL]. 人民日报. (2021-02-10). http://hb.people.com.cn/n2/2021/0210/c194063-34575307.html.
[3] (Jing Yi, Lin Rui, Liu Junguo, et al. Strengthen the Protection and Inheritance and Let Intangible Cultural Heritage Blossom More Charming[N/OL]. People’s Daily. (2021-02-10). http://hb.people.com.cn/n2/2021/0210/c194063-34575307.html. )
[4] Dou J H, Qin J Y, Jin Z X, et al. Knowledge Graph Based on Domain Ontology and Natural Language Processing Technology for Chinese Intangible Cultural Heritage[J]. Journal of Visual Languages & Computing, 2018, 48: 19-28.
[5] Dragoni M, Tonelli S, Moretti G. A Knowledge Management Architecture for Digital Cultural Heritage[J]. Journal on Computing and Cultural Heritage, 2017, 10(3): 1-18.
[6] 范青, 史中超, 谈国新. 非物质文化遗产的知识图谱构建[J]. 图书馆论坛, 2021, 41(10): 100-109.
[6] (Fan Qing, Shi Zhongchao, Tan Guoxin. Construction of Intangible Cultural Heritage Knowledge Graphs[J]. Library Tribune, 2021, 41(10): 100-109.)
[7] 刘浏, 秦天允, 王东波. 非物质文化遗产传统音乐术语自动抽取[J]. 数据分析与知识发现, 2020, 4(12): 68-75.
[7] (Qin Tianyun, Wang Dongbo. Automatic Extraction of Traditional Music Terms of Intangible Cultural Heritage[J]. Data Analysis and Knowledge Discovery, 2020, 4(12): 68-75.)
[8] Lafferty J D, McCallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]// Proceedings of the 18th International Conference on Machine Learning. 2001: 282-289.
[9] Huang Z H, Xu W, Yu K. Bidirectional LSTM-CRF Models for Sequence Tagging[OL]. arXiv Preprint, arXiv: 1508.01991.
[10] Li X Y, Feng J R, Meng Y X, et al. A Unified MRC Framework for Named Entity Recognition[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020: 5849-5859.
[11] Li X Y, Yin F, Sun Z J, et al. Entity-Relation Extraction as Multi-Turn Question Answering[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2019: 1340-1350.
[12] Wang Q F, Yang L, Kanagal B, et al. Learning to Extract Attribute Value from Product via Question Answering: A Multi-Task Approach[C]// Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020: 47-55.
[13] Sun C, Yang Z H, Wang L, et al. Biomedical Named Entity Recognition Using BERT in the Machine Reading Comprehension Framework[J]. Journal of Biomedical Informatics, 2021, 118: 103799.
doi: 10.1016/j.jbi.2021.103799
[14] Xu H M, Wang W T, Mao X, et al. Scaling up Open Tagging from Tens to Thousands: Comprehension Empowered Attribute Value Extraction from Product Title[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2019: 5214-5223.
[15] Noor S, Shah L, Adil M, et al. Modeling and Representation of Built Cultural Heritage Data Using Semantic Web Technologies and Building Information Model[J]. Computational and Mathematical Organization Theory, 2019, 25(3): 247-270.
doi: 10.1007/s10588-018-09285-y
[16] Dimitropoulos K, Tsalakanidou F, Nikolopoulos S, et al. A Multimodal Approach for the Safeguarding and Transmission of Intangible Cultural Heritage: The Case of I-Treasures[J]. IEEE Intelligent Systems, 2018, 33(6): 3-16.
[17] Li J, Sun A X, Han J L, et al. A Survey on Deep Learning for Named Entity Recognition[J]. IEEE Transactions on Knowledge and Data Engineering, 2022, 34(1): 50-70.
doi: 10.1109/TKDE.2020.2981314
[18] Luo Y, Zhao H, Zhan J L. Named Entity Recognition Only from Word Embeddings[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2020: 8995-9005.
[19] Luan Y, Ostendorf M, Hajishirzi H. Scientific Information Extraction with Semi-Supervised Neural Tagging[C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2017: 2630-2640.
[20] 黄水清, 王东波, 何琳. 基于先秦语料库的古汉语地名自动识别模型构建研究[J]. 图书情报工作, 2015, 59(12): 135-140.
doi: 10.13266/j.issn.0252-3116.2015.12.020
[20] (Huang Shuiqing, Wang Dongbo, He Lin. Research on Constructing Automatic Recognition Model for Ancient Chinese Place Names Based on Pre-Qin Corpus[J]. Library and Information Service, 2015, 59(12): 135-140.)
doi: 10.13266/j.issn.0252-3116.2015.12.020
[21] 李娜. 基于条件随机场的方志古籍别名自动抽取模型构建[J]. 中文信息学报, 2018, 32(11): 41-48.
[21] (Li Na. Automatic Extraction of Alias in Ancient Local Chronicles Based on Conditional Random Fields[J]. Journal of Chinese Information Processing, 2018, 32(11): 41-48.)
[22] 徐晨飞, 叶海影, 包平. 基于深度学习的方志物产资料实体自动识别模型构建研究[J]. 数据分析与知识发现, 2020, 4(8): 86-97.
[22] (Xu Chenfei, Ye Haiying, Bao Ping. Automatic Recognition of Produce Entities from Local Chronicles with Deep Learning[J]. Data Analysis and Knowledge Discovery, 2020, 4(8): 86-97.)
[23] Reddy S, Chen D Q, Manning C D. CoQA: A Conversational Question Answering Challenge[J]. Transactions of the Association for Computational Linguistics, 2019, 7: 249-266.
doi: 10.1162/tacl_a_00266
[24] Kwiatkowski T, Palomaki J, Redfield O, et al. Natural Questions: A Benchmark for Question Answering Research[J]. Transactions of the Association for Computational Linguistics, 2019, 7: 453-466.
doi: 10.1162/tacl_a_00276
[25] Wang W H, Yang N, Wei F R, et al. Gated Self-Matching Networks for Reading Comprehension and Question Answering[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2017: 189-198.
[26] Levy O, Seo M, Choi E, et al. Zero-Shot Relation Extraction via Reading Comprehension[C]//Proceedings of the 21st Conference on Computational Natural Language Learning. Association for Computational Linguistics, 2017: 333-342.
[27] Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781.
[28] Gui T, Ma R, Zhang Q, et al. CNN-Based Chinese NER with Lexicon Rethinking[C]// Proceedings of the 28th International Joint Conference on Artificial Intelligence. 2019: 4982-4988.
[29] Wang S, Manning C. Fast Dropout Training[C]// Proceedings of the 30th International Conference on Machine Learning. 2013: 118-126.
[30] Cao P F, Chen Y B, Liu K, et al. Adversarial Transfer Learning for Chinese Named Entity Recognition with Self-Attention Mechanism[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2018: 182-192.
[31] Qin Q L, Zhao S, Liu C M. A BERT-BiGRU-CRF Model for Entity Recognition of Chinese Electronic Medical Records[J]. Complexity, 2021, 2021: 6631837.
[32] Devlin J, Chang M W, Lee K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2019: 4171-4186.
[1] Zhao Ruijie, Tong Xinyu, Liu Xiaohua, Lu Yonghe. Entity Recognition and Labeling for Medical Literature Based on Neural Network[J]. 数据分析与知识发现, 2022, 6(9): 100-112.
[2] Gao Jinsong, Zhang Qiang, Li Shuaike, Sun Yanling, Zhou Shubin. Poet’s Emotional Trajectory in Time and Space: Case Study of Li Bai for Digital Humanities[J]. 数据分析与知识发现, 2022, 6(9): 27-39.
[3] Chen Yuanyuan, Ma Jing. Detecting Multimodal Sarcasm Based on SC-Attention Mechanism[J]. 数据分析与知识发现, 2022, 6(9): 40-51.
[4] Tang Jiao, Zhang Lisheng, Sang Chunyan. News Recommendation with Latent Topic Distribution and Long and Short-Term User Representations[J]. 数据分析与知识发现, 2022, 6(9): 52-64.
[5] Hu Jiming, Qian Wei, Wen Peng, Lv Xiaoguang. Text Semantic Representation with Structure-Function and Entity Recognition: Case Study of Medical Records[J]. 数据分析与知识发现, 2022, 6(8): 110-121.
[6] Zhao Pengwu, Li Zhiyi, Lin Xiaoqi. Identifying Relationship of Chinese Characters with Attention Mechanism and Convolutional Neural Network[J]. 数据分析与知识发现, 2022, 6(8): 41-51.
[7] Ye Han,Sun Haichun,Li Xin,Jiao Kainan. Classification Model for Long Texts with Attention Mechanism and Sentence Vector Compression[J]. 数据分析与知识发现, 2022, 6(6): 84-94.
[8] Zhang Ruoqi, Shen Jianfang, Chen Pinghua. Session Sequence Recommendation with GNN, Bi-GRU and Attention Mechanism[J]. 数据分析与知识发现, 2022, 6(6): 46-54.
[9] Zhang Yunqiu, Wang Yang, Li Bocheng. Identifying Named Entities of Chinese Electronic Medical Records Based on RoBERTa-wwm Dynamic Fusion Model[J]. 数据分析与知识发现, 2022, 6(2/3): 242-250.
[10] Fan Tao, Wang Hao, Li Yueyan, Deng Sanhong. Classifying Images of Intangible Cultural Heritages with Multimodal Fusion[J]. 数据分析与知识发现, 2022, 6(2/3): 329-337.
[11] Zhou Zeyu, Wang Hao, Zhang Xiaoqin, Tao Fao, Ren Qiutong. Classification Model for Chinese Traditional Embroidery Based on Xception-TD[J]. 数据分析与知识发现, 2022, 6(2/3): 338-347.
[12] Guo Hangcheng, He Yanqing, Lan Tian, Wu Zhenfeng, Dong Cheng. Identifying Moves from Scientific Abstracts Based on Paragraph-BERT-CRF[J]. 数据分析与知识发现, 2022, 6(2/3): 298-307.
[13] Xu Yuemei, Fan Zuwei, Cao Han. A Multi-Task Text Classification Model Based on Label Embedding of Attention Mechanism[J]. 数据分析与知识发现, 2022, 6(2/3): 105-116.
[14] Yu Chuanming, Lin Hongjun, Zhang Zhengang. Joint Extraction Model for Entities and Events with Multi-task Deep Learning[J]. 数据分析与知识发现, 2022, 6(2/3): 117-128.
[15] Zhang Fangcong, Qin Qiuli, Jiang Yong, Zhuang Runtao. Named Entity Recognition for Chinese EMR with RoBERTa-WWM-BiLSTM-CRF[J]. 数据分析与知识发现, 2022, 6(2/3): 251-262.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn