Please wait a minute...
Data Analysis and Knowledge Discovery  2023, Vol. 7 Issue (7): 125-135    DOI: 10.11925/infotech.2096-3467.2022.0698
Current Issue | Archive | Adv Search |
Chinese-Tibetan Bilingual Named Entity Recognition for Traditional Tibetan Festivals
Deng Yuyang1,Wu Dan1,2()
1School of Information Management, Wuhan University, Wuhan 430072, China
2Center for Studies of Human-Computer Interaction and User Behavior, Wuhan University, Wuhan 430072, China
Download: PDF (2337 KB)   HTML ( 16
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper examines the performance of pre-trained models in resource-scarce languages and assists in building Tibetan knowledge graphs and semantic retrieval. [Methods] We collected Chinese-Tibetan bilingual text data related to traditional Tibetan festivals from websites such as People's Daily and its Tibetan Edition. Then, we compared the performance of multiple pre-trained language models and word embeddings on named entity recognition tasks in a Chinese-Tibetan bilingual context. We also analyzed the impact of two feature processing layers (BiLSTM and CRF) in the named entity recognition model. [Results] Compared with word embeddings, the pre-trained language models of Chinese and Tibetan improved the F1 performance by 0.010 8 and 0.059 0, respectively. In the context of fewer entities, the pre-trained model can extract more textual information than word embeddings, reducing the training time by 40%. [Limitations] The Tibetan and Chinese language data are not parallel corpora, and the Tibetan language data has fewer entities than the Chinese data. [Conclusions] The pre-trained models demonstrate significant performance in the Chinese text domain but also perform well in Tibetan, a language with scarce resources.

Key wordsNamed Entity Recognition      Tibetan Traditional Culture      Pretrained Language Model     
Received: 07 July 2022      Published: 09 November 2022
ZTFLH:  TP391  
  G350  
Fund:National Social Science Fund of China(19ZDA341)
Corresponding Authors: Wu Dan,ORCID:0000-0002-2611-7317,E-mail: woodan@whu.edu.cn。   

Cite this article:

Deng Yuyang, Wu Dan. Chinese-Tibetan Bilingual Named Entity Recognition for Traditional Tibetan Festivals. Data Analysis and Knowledge Discovery, 2023, 7(7): 125-135.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2022.0698     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2023/V7/I7/125

Overall Model Structure
Word Embedding Layer
节日中文名 节日藏文名 节日中文名 节日藏文名
工布新年 ?????????????? 江孜达玛节 ????????????????
香浪节 ?????????????? 赛马节 ?????????????????
日喀则新年 ?????????????????? 藏历新年 ???????
女儿节 ??????????????? 萨噶达瓦节 ?????????
雪顿节 ???????????????? 沐浴节 ????????????
望果节 ????????????????? 普兰新年 ????????????????
Chinese-Tibetan Bilingual Glossary of Tibetan Festival
Data Example
类别 节日实体
数量
物品实体
数量
事件实体
数量
地点实体
数量
汉语数据 1 154 1 481 847 68
藏语数据 1 674 344 284 255
Statistical Results of Entities
模型 准确率 召回率 F1
汉语BERT预训练模型-BiLSTM-CRF 93.22% 89.76% 91.29%
汉语ALBERT预训练模型-BiLSTM-CRF 88.40% 92.39% 90.34%
汉语ERNIE预训练模型-BiLSTM-CRF 85.68% 89.31% 87.45%
汉语RoBERTa预训练模型-BiLSTM-CRF 93.97% 92.32% 93.05%
Comparison of Chinese Pretrained Models
语言 模型 准确率 召回率 F1
汉语 汉语fastText词向量-BiLSTM-CRF 94.65% 89.64% 91.97%
汉语RoBERTa预训练模型-BiLSTM-CRF 93.97% 92.32% 93.05%
藏语 藏语fastText词向量-BiLSTM-CRF 84.97% 76.68% 80.37%
藏语RoBERTa预训练模型-BiLSTM-CRF 83.40% 89.60% 86.27%
Comparison of Pretrained Model and Word Vector Model
模型 节日实体 事件实体 物品实体 地点实体
汉语fastText词向量-BiLSTM-CRF 97.66% 89.02% 88.20% 93.02%
藏语fastText词向量-BiLSTM-CRF 95.68% 64.44% 76.34% 85.00%
汉语RoBERTa预训练模型-BiLSTM-CRF 96.96% 90.97% 88.80% 95.54%
藏语RoBERTa预训练模型-BiLSTM-CRF 96.41% 74.78% 86.20% 87.69%
F1-score on Entity-level
Model Training Curves
语言 模型 准确率 召回率 F 1
汉语 RoBERTa预训练模型-CRF模型 81.43% 77.09% 79.09%
RoBERTa预训练模型-BiLSTM模型 92.51% 89.99% 91.13%
RoBERTa预训练模型-BiLSTM-CRF模型 93.97% 92.32% 93.05%
藏语 RoBERTa预训练模型-CRF模型 71.25% 40.79% 49.79%
RoBERTa预训练模型-BiLSTM模型 82.19% 84.98% 83.46%
RoBERTa预训练模型-BiLSTM-CRF模型 83.40% 89.60% 86.27%
Ablation Experiment
System Interface
[1] 道布. 中国的语言政策和语言规划[J]. 民族研究, 1998(6): 42-52.
[1] (Dao Bu. Language Policy and Language Planning in China[J]. Ethno-National Studies, 1998(6): 42-52.)
[2] 周和平. 中国非物质文化遗产保护的实践与探索[J]. 求是, 2010(4): 44-46.
[2] (Zhou Heping. Practice and Exploration of Intangible Cultural Heritage Protection in China[J]. Qiushi, 2010(4): 44-46.)
[3] 周兴维. 东部藏区发展粗论[J]. 西南民族学院学报(哲学社会科学版), 2001, 22(7): 9-14.
[3] (Zhou Xingwei. A Brief Statement on the Development of East Tibetan Region Tibetan Areas[J]. Journal of Southwest University for Nationalities (Philosophy and Social Sciences), 2001, 22(7): 9-14.)
[4] Long C, Hill N W. Recent Developments in Tibetan NLP[J]. ACM Transactions on Asian and Low-Resource Language Information Processing, 2021, 20(2): 19: Article No.19.
[5] Liu P, Guo Y, Wang F, et al. Chinese Named Entity Recognition: The State of the Art[J]. Neurocomputing, 2022, 473: 37-53.
doi: 10.1016/j.neucom.2021.10.101
[6] Goyal A, Gupta V, Kumar M. Recent Named Entity Recognition and Classification Techniques: A Systematic Review[J]. Computer Science Review, 2018, 29: 21-43.
doi: 10.1016/j.cosrev.2018.06.001
[7] 陈曙东, 欧阳小叶. 命名实体识别技术综述[J]. 无线电通信技术, 2020, 46(3): 251-260.
[7] (Chen Shudong, Ouyang Xiaoye. Overview of Named Entity Recognition Technology[J]. Radio Communication Technology, 2020, 46(3): 251-260.)
[8] Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality[C]// Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2. 2013: 3111-3119.
[9] Pennington J, Socher R, Manning C D. GloVe: Global Vectors for Word Representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2014: 1532-1543.
[10] Bojanowski P, Grave E, Joulin A, et al. Enriching Word Vectors with Subword Information[J]. Transactions of the Association for Computational Linguistics, 2017, 5: 135-146.
doi: 10.1162/tacl_a_00051
[11] Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017:6000-6010.
[12] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv:1810.04805.
[13] Peters M E, Neumann M, Iyyer M, et al. Deep Contextualized Word Representations[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Long Papers). 2018: 2227-2237.
[14] Sun Y, Wang S, Li Y, et al. ERNIE: Enhanced Representation Through Knowledge Integration[OL]. arXiv Preprint, arXiv:1904.09223.
[15] Radford A, Narasimhan K, Salimans T, et al. Improving Language Understanding by Generative Pre-Training[OL]. Preprint. https://www.cs.ubc.ca/-amuham01/LING530/papers/radford2018improving.pdf.
[16] Dong C, Zhang J, Zong C, et al. Character-Based LSTM-CRF with Radical-Level Features for Chinese Named Entity Recognition[C]// Proceedings of the 5th CCF Conference on Natural Language Processing and Chinese Computing and the 24th International Conference on Computer Processing of Oriental Languages. Springer, Cham, 2016: 239-250.
[17] Cai X, Dong S, Hu J. A Deep Learning Model Incorporating Part of Speech and Self-matching Attention for Named Entity Recognition of Chinese Electronic Medical Records[J]. BMC Medical Informatics and Decision Making, 2019, 19(Suppl 2): Article No. 65.
[18] Zhou F, Han X, Liu Q, et al. Chinese Clinical Named Entity Recognition Based on Stroke-Level and Radical-Level Features[C]// Proceedings of the International Conference on Smart Computing and Communication. Cham: Springer International Publishing, 2021: 9-18.
[19] Meng Y, Wu W, Wang F, et al. Glyce: Glyph-vectors for Chinese Character Representations[C]// Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019:2746-2757.
[20] Zhang Y, Yang J. Chinese NER Using Lattice LSTM[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers). Melbourne, Australia: Association for Computational Linguistics, 2018: 1554-1564.
[21] 才让加. 藏语语料库词语分类体系及标记集研究[J]. 中文信息学报, 2009, 23(4): 107-112.
[21] (Cai Rangjia. Research on the Word Categories and Its Annotation Scheme for Tibetan Corpus[J]. Journal of Chinese Information Processing, 2009, 23(4): 107-112.)
[22] Feng S Y, Gangal V, Wei J, et al. A Survey of Data Augmentation Approaches for NLP[OL]. arXiv Preprint. arXiv:2105.03075.
[23] 金明, 杨欢欢, 单广荣. 藏语命名实体识别研究[J]. 西北民族大学学报(自然科学版), 2010, 31(3): 49-52.
[23] (Jin Ming, Yang Huanhuan, Shan GuanRong. The Studies of Named Entity Recognition for Tibetan[J]. Journal of Northwest University for Nationalities (Natural Science), 2010, 31(3): 49-52.)
[24] 孙媛, 王丽客, 郭莉莉. 基于改进词向量GRU神经网络模型的藏语实体关系抽取[J]. 中文信息学报, 2019, 33(6): 35-41.
[24] (Sun Yuan, Wang Like, Guo Lili. Tibetan Entity Relation Extraction Based on Optimized Word Embedding with GRU Neural Network[J]. Journal of Chinese Information Processing, 2019, 33(6): 35-41.)
[25] 头旦才让, 仁青东主, 尼玛扎西, 基于CRF的藏文地名识别技术研究[J]. 计算机工程与应用, 2019, 55(18): 111-115.
doi: 10.3778/j.issn.1002-8331.1903-0232
[25] (Thupten Tsering, Rinchen Dhondub, Nyima Tashi. Research on Tibetan Location Name Recognition Technology Under CRF[J]. Computer Engineering and Applications, 2019, 55(18):111-115.)
doi: 10.3778/j.issn.1002-8331.1903-0232
[26] Chen P, Ding H, Araki J, et al. Explicitly Capturing Relations Between Entity Mentions via Graph Neural Networks for Domain-specific Named Entity Recognition[C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2:Short Papers). 2021.
[27] Cui Y, Che W, Liu T, et al. Revisiting Pre-Trained Models for Chinese Natural Language Processing[OL]. arXiv Preprint, arXiv:2004.13922.
[28] Liu Y, Ott M, Goyal N, et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach[OL]. arXiv Preprint, arXiv:1907.11692.
[29] Yu Y, Si X, Hu C, et al. A Review of Recurrent Neural Networks: LSTM Cells and Network Architectures[J]. Neural Computation, 2019, 31(7): 1235-1270.
doi: 10.1162/neco_a_01199 pmid: 31113301
[30] 陆柳杏, 吴丹. 面向藏族传统节日的汉藏双语本体构建[J]. 图书馆建设, 2022(1): 67-74.
[30] (Lu Liuxing, Wu Dan. Chinese-Tibetan Bilingual Ontology for Traditional Tibetan Festival[J]. Library Development, 2022(1): 67-74.)
[31] Yang J, Zhang Y, Li L, et al. YEDDA: A Lightweight Collaborative Text Span Annotation Tool[C]// Proceedings of ACL 2018, System Demonstrations. 2018: 31-36.
[32] Lan Z, Chen M, Goodman S, et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations[OL]. arXiv Preprint, arXiv: 1909.11942.
[33] Pei Y, Chen S, Ke Z, et al. AB-LaBSE: Uyghur Sentiment Analysis via the Pre-Training Model with BiLSTM[J]. Applied Sciences, 2022, 12(3): 1182.
doi: 10.3390/app12031182
[34] Lee H, Yoon J, Hwang B, et al. KoreALBERT: Pretraining a Lite BERT Model for Korean Language Understanding[C]// Proceedings of the 25th International Conference on Pattern Recognition (ICPR). 2021: 5551-5557.
[1] Ben Yanyan, Pang Xueqin. Identifying Medical Named Entities with Word Information[J]. 数据分析与知识发现, 2023, 7(5): 123-132.
[2] Han Pu, Zhong Yule, Lu Haojie, Ma Shiwen. Identifying Named Entities of Adverse Drug Reaction with Adversarial Transfer Learning[J]. 数据分析与知识发现, 2023, 7(3): 131-141.
[3] Pei Wei, Sun Shuifa, Li Xiaolong, Lu Ji, Yang Liu, Wu Yirong. Medical Named Entity Recognition with Domain Knowledge[J]. 数据分析与知识发现, 2023, 7(3): 142-154.
[4] Duan Yufeng, He Guoxiu. Analysis of Neural Network Modules for Named Entity Recognition of Chinese Medical Texts[J]. 数据分析与知识发现, 2023, 7(2): 26-37.
[5] Zhao Ruijie, Tong Xinyu, Liu Xiaohua, Lu Yonghe. Entity Recognition and Labeling for Medical Literature Based on Neural Network[J]. 数据分析与知识发现, 2022, 6(9): 100-112.
[6] Hu Jiming, Qian Wei, Wen Peng, Lv Xiaoguang. Text Semantic Representation with Structure-Function and Entity Recognition: Case Study of Medical Records[J]. 数据分析与知识发现, 2022, 6(8): 110-121.
[7] Zhang Yunqiu, Wang Yang, Li Bocheng. Identifying Named Entities of Chinese Electronic Medical Records Based on RoBERTa-wwm Dynamic Fusion Model[J]. 数据分析与知识发现, 2022, 6(2/3): 242-250.
[8] Yu Chuanming, Lin Hongjun, Zhang Zhengang. Joint Extraction Model for Entities and Events with Multi-task Deep Learning[J]. 数据分析与知识发现, 2022, 6(2/3): 117-128.
[9] Zhang Fangcong, Qin Qiuli, Jiang Yong, Zhuang Runtao. Named Entity Recognition for Chinese EMR with RoBERTa-WWM-BiLSTM-CRF[J]. 数据分析与知识发现, 2022, 6(2/3): 251-262.
[10] Fan Tao, Wang Hao, Zhang Wei, Li Xiaomin. Extracting Entities from Intangible Cultural Heritage Texts Based on Machine Reading Comprehension[J]. 数据分析与知识发现, 2022, 6(12): 70-79.
[11] Xu Chenfei, Ye Haiying, Bao Ping. Automatic Recognition of Produce Entities from Local Chronicles with Deep Learning[J]. 数据分析与知识发现, 2020, 4(8): 86-97.
[12] Gao Yuan,Shi Yuanlei,Zhang Lei,Cao Tianyi,Feng Jun. Reconstructing Tour Routes Based on Travel Notes[J]. 数据分析与知识发现, 2020, 4(2/3): 165-172.
[13] Ma Jianxia,Yuan Hui,Jiang Xiang. Extracting Name Entities from Ecological Restoration Literature with Bi-LSTM+CRF[J]. 数据分析与知识发现, 2020, 4(2/3): 78-88.
[14] Liu Jingru,Song Yang,Jia Rui,Zhang Yipeng,Luo Yong,Ma Jingdong. A BiLSTM-CRF Model for Protected Health Information in Chinese[J]. 数据分析与知识发现, 2020, 4(10): 124-133.
[15] Han Huang,Hongyu Wang,Xiaoguang Wang. Automatic Recognizing Legal Terminologies with Active Learning and Conditional Random Field Model[J]. 数据分析与知识发现, 2019, 3(6): 66-74.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn