Chinese-Tibetan Bilingual Named Entity Recognition for Traditional Tibetan Festivals

doi:10.11925/infotech.2096-3467.2022.0698

Data Analysis and Knowledge Discovery

2023, Vol. 7

Issue (7): 125-135 DOI: 10.11925/infotech.2096-3467.2022.0698

Current Issue | Archive | Adv Search

Chinese-Tibetan Bilingual Named Entity Recognition for Traditional Tibetan Festivals

Deng Yuyang¹,Wu Dan^1,²(

)

¹School of Information Management, Wuhan University, Wuhan 430072, China
²Center for Studies of Human-Computer Interaction and User Behavior, Wuhan University, Wuhan 430072, China

Download: PDF (2337 KB) HTML ( 16 )
Export: BibTeX | EndNote (RIS)

Abstract

[Objective] This paper examines the performance of pre-trained models in resource-scarce languages and assists in building Tibetan knowledge graphs and semantic retrieval. [Methods] We collected Chinese-Tibetan bilingual text data related to traditional Tibetan festivals from websites such as People's Daily and its Tibetan Edition. Then, we compared the performance of multiple pre-trained language models and word embeddings on named entity recognition tasks in a Chinese-Tibetan bilingual context. We also analyzed the impact of two feature processing layers (BiLSTM and CRF) in the named entity recognition model. [Results] Compared with word embeddings, the pre-trained language models of Chinese and Tibetan improved the F1 performance by 0.010 8 and 0.059 0, respectively. In the context of fewer entities, the pre-trained model can extract more textual information than word embeddings, reducing the training time by 40%. [Limitations] The Tibetan and Chinese language data are not parallel corpora, and the Tibetan language data has fewer entities than the Chinese data. [Conclusions] The pre-trained models demonstrate significant performance in the Chinese text domain but also perform well in Tibetan, a language with scarce resources.

Key words： Named Entity Recognition Tibetan Traditional Culture Pretrained Language Model

Received: 07 July 2022 Published: 09 November 2022

ZTFLH:	TP391
	G350

Fund:National Social Science Fund of China(19ZDA341)

Corresponding Authors: Wu Dan，ORCID：0000-0002-2611-7317，E-mail： woodan@whu.edu.cn。

	Service

	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	Yuyang Deng
	Dan Wu

Cite this article:

Deng Yuyang, Wu Dan. Chinese-Tibetan Bilingual Named Entity Recognition for Traditional Tibetan Festivals. Data Analysis and Knowledge Discovery, 2023, 7(7): 125-135.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2022.0698 OR https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2023/V7/I7/125

Overall Model Structure

Word Embedding Layer

Chinese-Tibetan Bilingual Glossary of Tibetan Festival

Data Example

Statistical Results of Entities

Comparison of Chinese Pretrained Models

Comparison of Pretrained Model and Word Vector Model

F1-score on Entity-level

Model Training Curves

语言	模型	准确率	召回率	$F 1$
汉语	RoBERTa预训练模型-CRF模型	81.43%	77.09%	79.09%
	RoBERTa预训练模型-BiLSTM模型	92.51%	89.99%	91.13%
	RoBERTa预训练模型-BiLSTM-CRF模型	93.97%	92.32%	93.05%
藏语	RoBERTa预训练模型-CRF模型	71.25%	40.79%	49.79%
	RoBERTa预训练模型-BiLSTM模型	82.19%	84.98%	83.46%
	RoBERTa预训练模型-BiLSTM-CRF模型	83.40%	89.60%	86.27%

Ablation Experiment

System Interface

[1]	道布. 中国的语言政策和语言规划[J]. 民族研究, 1998(6): 42-52.
[1]	(Dao Bu. Language Policy and Language Planning in China[J]. Ethno-National Studies, 1998(6): 42-52.)
[2]	周和平. 中国非物质文化遗产保护的实践与探索[J]. 求是, 2010(4): 44-46.
[2]	(Zhou Heping. Practice and Exploration of Intangible Cultural Heritage Protection in China[J]. Qiushi, 2010(4): 44-46.)
[3]	周兴维. 东部藏区发展粗论[J]. 西南民族学院学报(哲学社会科学版), 2001, 22(7): 9-14.
[3]	(Zhou Xingwei. A Brief Statement on the Development of East Tibetan Region Tibetan Areas[J]. Journal of Southwest University for Nationalities (Philosophy and Social Sciences), 2001, 22(7): 9-14.)
[4]	Long C, Hill N W. Recent Developments in Tibetan NLP[J]. ACM Transactions on Asian and Low-Resource Language Information Processing, 2021, 20(2): 19: Article No.19.
[5]	Liu P, Guo Y, Wang F, et al. Chinese Named Entity Recognition: The State of the Art[J]. Neurocomputing, 2022, 473: 37-53. doi: 10.1016/j.neucom.2021.10.101
[6]	Goyal A, Gupta V, Kumar M. Recent Named Entity Recognition and Classification Techniques: A Systematic Review[J]. Computer Science Review, 2018, 29: 21-43. doi: 10.1016/j.cosrev.2018.06.001
[7]	陈曙东, 欧阳小叶. 命名实体识别技术综述[J]. 无线电通信技术, 2020, 46(3): 251-260.
[7]	(Chen Shudong, Ouyang Xiaoye. Overview of Named Entity Recognition Technology[J]. Radio Communication Technology, 2020, 46(3): 251-260.)
[8]	Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality[C]// Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2. 2013: 3111-3119.
[9]	Pennington J, Socher R, Manning C D. GloVe: Global Vectors for Word Representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2014: 1532-1543.
[10]	Bojanowski P, Grave E, Joulin A, et al. Enriching Word Vectors with Subword Information[J]. Transactions of the Association for Computational Linguistics, 2017, 5: 135-146. doi: 10.1162/tacl_a_00051
[11]	Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017:6000-6010.
[12]	Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv:1810.04805.
[13]	Peters M E, Neumann M, Iyyer M, et al. Deep Contextualized Word Representations[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Long Papers). 2018: 2227-2237.
[14]	Sun Y, Wang S, Li Y, et al. ERNIE: Enhanced Representation Through Knowledge Integration[OL]. arXiv Preprint, arXiv:1904.09223.
[15]	Radford A, Narasimhan K, Salimans T, et al. Improving Language Understanding by Generative Pre-Training[OL]. Preprint. https://www.cs.ubc.ca/-amuham01/LING530/papers/radford2018improving.pdf.
[16]	Dong C, Zhang J, Zong C, et al. Character-Based LSTM-CRF with Radical-Level Features for Chinese Named Entity Recognition[C]// Proceedings of the 5th CCF Conference on Natural Language Processing and Chinese Computing and the 24th International Conference on Computer Processing of Oriental Languages. Springer, Cham, 2016: 239-250.
[17]	Cai X, Dong S, Hu J. A Deep Learning Model Incorporating Part of Speech and Self-matching Attention for Named Entity Recognition of Chinese Electronic Medical Records[J]. BMC Medical Informatics and Decision Making, 2019, 19(Suppl 2): Article No. 65.
[18]	Zhou F, Han X, Liu Q, et al. Chinese Clinical Named Entity Recognition Based on Stroke-Level and Radical-Level Features[C]// Proceedings of the International Conference on Smart Computing and Communication. Cham: Springer International Publishing, 2021: 9-18.
[19]	Meng Y, Wu W, Wang F, et al. Glyce: Glyph-vectors for Chinese Character Representations[C]// Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019:2746-2757.
[20]	Zhang Y, Yang J. Chinese NER Using Lattice LSTM[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers). Melbourne, Australia: Association for Computational Linguistics, 2018: 1554-1564.
[21]	才让加. 藏语语料库词语分类体系及标记集研究[J]. 中文信息学报, 2009, 23(4): 107-112.
[21]	(Cai Rangjia. Research on the Word Categories and Its Annotation Scheme for Tibetan Corpus[J]. Journal of Chinese Information Processing, 2009, 23(4): 107-112.)
[22]	Feng S Y, Gangal V, Wei J, et al. A Survey of Data Augmentation Approaches for NLP[OL]. arXiv Preprint. arXiv:2105.03075.
[23]	金明, 杨欢欢, 单广荣. 藏语命名实体识别研究[J]. 西北民族大学学报(自然科学版), 2010, 31(3): 49-52.
[23]	(Jin Ming, Yang Huanhuan, Shan GuanRong. The Studies of Named Entity Recognition for Tibetan[J]. Journal of Northwest University for Nationalities (Natural Science), 2010, 31(3): 49-52.)
[24]	孙媛, 王丽客, 郭莉莉. 基于改进词向量GRU神经网络模型的藏语实体关系抽取[J]. 中文信息学报, 2019, 33(6): 35-41.
[24]	(Sun Yuan, Wang Like, Guo Lili. Tibetan Entity Relation Extraction Based on Optimized Word Embedding with GRU Neural Network[J]. Journal of Chinese Information Processing, 2019, 33(6): 35-41.)
[25]	头旦才让, 仁青东主, 尼玛扎西, 基于CRF的藏文地名识别技术研究[J]. 计算机工程与应用, 2019, 55(18): 111-115. doi: 10.3778/j.issn.1002-8331.1903-0232
[25]	(Thupten Tsering, Rinchen Dhondub, Nyima Tashi. Research on Tibetan Location Name Recognition Technology Under CRF[J]. Computer Engineering and Applications, 2019, 55(18):111-115.) doi: 10.3778/j.issn.1002-8331.1903-0232
[26]	Chen P, Ding H, Araki J, et al. Explicitly Capturing Relations Between Entity Mentions via Graph Neural Networks for Domain-specific Named Entity Recognition[C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2:Short Papers). 2021.
[27]	Cui Y, Che W, Liu T, et al. Revisiting Pre-Trained Models for Chinese Natural Language Processing[OL]. arXiv Preprint, arXiv:2004.13922.
[28]	Liu Y, Ott M, Goyal N, et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach[OL]. arXiv Preprint, arXiv:1907.11692.
[29]	Yu Y, Si X, Hu C, et al. A Review of Recurrent Neural Networks: LSTM Cells and Network Architectures[J]. Neural Computation, 2019, 31(7): 1235-1270. doi: 10.1162/neco_a_01199 pmid: 31113301
[30]	陆柳杏, 吴丹. 面向藏族传统节日的汉藏双语本体构建[J]. 图书馆建设, 2022(1): 67-74.
[30]	(Lu Liuxing, Wu Dan. Chinese-Tibetan Bilingual Ontology for Traditional Tibetan Festival[J]. Library Development, 2022(1): 67-74.)
[31]	Yang J, Zhang Y, Li L, et al. YEDDA: A Lightweight Collaborative Text Span Annotation Tool[C]// Proceedings of ACL 2018, System Demonstrations. 2018: 31-36.
[32]	Lan Z, Chen M, Goodman S, et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations[OL]. arXiv Preprint, arXiv: 1909.11942.
[33]	Pei Y, Chen S, Ke Z, et al. AB-LaBSE: Uyghur Sentiment Analysis via the Pre-Training Model with BiLSTM[J]. Applied Sciences, 2022, 12(3): 1182. doi: 10.3390/app12031182
[34]	Lee H, Yoon J, Hwang B, et al. KoreALBERT: Pretraining a Lite BERT Model for Korean Language Understanding[C]// Proceedings of the 25th International Conference on Pattern Recognition (ICPR). 2021: 5551-5557.

[1]	Ben Yanyan, Pang Xueqin. Identifying Medical Named Entities with Word Information[J]. 数据分析与知识发现, 2023, 7(5): 123-132.
[2]	Han Pu, Zhong Yule, Lu Haojie, Ma Shiwen. Identifying Named Entities of Adverse Drug Reaction with Adversarial Transfer Learning[J]. 数据分析与知识发现, 2023, 7(3): 131-141.
[3]	Pei Wei, Sun Shuifa, Li Xiaolong, Lu Ji, Yang Liu, Wu Yirong. Medical Named Entity Recognition with Domain Knowledge[J]. 数据分析与知识发现, 2023, 7(3): 142-154.
[4]	Duan Yufeng, He Guoxiu. Analysis of Neural Network Modules for Named Entity Recognition of Chinese Medical Texts[J]. 数据分析与知识发现, 2023, 7(2): 26-37.
[5]	Zhao Ruijie, Tong Xinyu, Liu Xiaohua, Lu Yonghe. Entity Recognition and Labeling for Medical Literature Based on Neural Network[J]. 数据分析与知识发现, 2022, 6(9): 100-112.
[6]	Hu Jiming, Qian Wei, Wen Peng, Lv Xiaoguang. Text Semantic Representation with Structure-Function and Entity Recognition: Case Study of Medical Records[J]. 数据分析与知识发现, 2022, 6(8): 110-121.
[7]	Zhang Yunqiu, Wang Yang, Li Bocheng. Identifying Named Entities of Chinese Electronic Medical Records Based on RoBERTa-wwm Dynamic Fusion Model[J]. 数据分析与知识发现, 2022, 6(2/3): 242-250.
[8]	Yu Chuanming, Lin Hongjun, Zhang Zhengang. Joint Extraction Model for Entities and Events with Multi-task Deep Learning[J]. 数据分析与知识发现, 2022, 6(2/3): 117-128.
[9]	Zhang Fangcong, Qin Qiuli, Jiang Yong, Zhuang Runtao. Named Entity Recognition for Chinese EMR with RoBERTa-WWM-BiLSTM-CRF[J]. 数据分析与知识发现, 2022, 6(2/3): 251-262.
[10]	Fan Tao, Wang Hao, Zhang Wei, Li Xiaomin. Extracting Entities from Intangible Cultural Heritage Texts Based on Machine Reading Comprehension[J]. 数据分析与知识发现, 2022, 6(12): 70-79.
[11]	Xu Chenfei, Ye Haiying, Bao Ping. Automatic Recognition of Produce Entities from Local Chronicles with Deep Learning[J]. 数据分析与知识发现, 2020, 4(8): 86-97.
[12]	Gao Yuan,Shi Yuanlei,Zhang Lei,Cao Tianyi,Feng Jun. Reconstructing Tour Routes Based on Travel Notes[J]. 数据分析与知识发现, 2020, 4(2/3): 165-172.
[13]	Ma Jianxia,Yuan Hui,Jiang Xiang. Extracting Name Entities from Ecological Restoration Literature with Bi-LSTM+CRF[J]. 数据分析与知识发现, 2020, 4(2/3): 78-88.
[14]	Liu Jingru,Song Yang,Jia Rui,Zhang Yipeng,Luo Yong,Ma Jingdong. A BiLSTM-CRF Model for Protected Health Information in Chinese[J]. 数据分析与知识发现, 2020, 4(10): 124-133.
[15]	Han Huang,Hongyu Wang,Xiaoguang Wang. Automatic Recognizing Legal Terminologies with Active Learning and Conditional Random Field Model[J]. 数据分析与知识发现, 2019, 3(6): 66-74.

Viewed

Full text

Abstract

Cited

Shared

Discussed