|
|
Chinese-Tibetan Bilingual Named Entity Recognition for Traditional Tibetan Festivals |
Deng Yuyang1,Wu Dan1,2() |
1School of Information Management, Wuhan University, Wuhan 430072, China 2Center for Studies of Human-Computer Interaction and User Behavior, Wuhan University, Wuhan 430072, China |
|
|
Abstract [Objective] This paper examines the performance of pre-trained models in resource-scarce languages and assists in building Tibetan knowledge graphs and semantic retrieval. [Methods] We collected Chinese-Tibetan bilingual text data related to traditional Tibetan festivals from websites such as People's Daily and its Tibetan Edition. Then, we compared the performance of multiple pre-trained language models and word embeddings on named entity recognition tasks in a Chinese-Tibetan bilingual context. We also analyzed the impact of two feature processing layers (BiLSTM and CRF) in the named entity recognition model. [Results] Compared with word embeddings, the pre-trained language models of Chinese and Tibetan improved the F1 performance by 0.010 8 and 0.059 0, respectively. In the context of fewer entities, the pre-trained model can extract more textual information than word embeddings, reducing the training time by 40%. [Limitations] The Tibetan and Chinese language data are not parallel corpora, and the Tibetan language data has fewer entities than the Chinese data. [Conclusions] The pre-trained models demonstrate significant performance in the Chinese text domain but also perform well in Tibetan, a language with scarce resources.
|
Received: 07 July 2022
Published: 09 November 2022
|
|
Fund:National Social Science Fund of China(19ZDA341) |
Corresponding Authors:
Wu Dan,ORCID:0000-0002-2611-7317,E-mail: woodan@whu.edu.cn。
|
[1] |
道布. 中国的语言政策和语言规划[J]. 民族研究, 1998(6): 42-52.
|
[1] |
(Dao Bu. Language Policy and Language Planning in China[J]. Ethno-National Studies, 1998(6): 42-52.)
|
[2] |
周和平. 中国非物质文化遗产保护的实践与探索[J]. 求是, 2010(4): 44-46.
|
[2] |
(Zhou Heping. Practice and Exploration of Intangible Cultural Heritage Protection in China[J]. Qiushi, 2010(4): 44-46.)
|
[3] |
周兴维. 东部藏区发展粗论[J]. 西南民族学院学报(哲学社会科学版), 2001, 22(7): 9-14.
|
[3] |
(Zhou Xingwei. A Brief Statement on the Development of East Tibetan Region Tibetan Areas[J]. Journal of Southwest University for Nationalities (Philosophy and Social Sciences), 2001, 22(7): 9-14.)
|
[4] |
Long C, Hill N W. Recent Developments in Tibetan NLP[J]. ACM Transactions on Asian and Low-Resource Language Information Processing, 2021, 20(2): 19: Article No.19.
|
[5] |
Liu P, Guo Y, Wang F, et al. Chinese Named Entity Recognition: The State of the Art[J]. Neurocomputing, 2022, 473: 37-53.
doi: 10.1016/j.neucom.2021.10.101
|
[6] |
Goyal A, Gupta V, Kumar M. Recent Named Entity Recognition and Classification Techniques: A Systematic Review[J]. Computer Science Review, 2018, 29: 21-43.
doi: 10.1016/j.cosrev.2018.06.001
|
[7] |
陈曙东, 欧阳小叶. 命名实体识别技术综述[J]. 无线电通信技术, 2020, 46(3): 251-260.
|
[7] |
(Chen Shudong, Ouyang Xiaoye. Overview of Named Entity Recognition Technology[J]. Radio Communication Technology, 2020, 46(3): 251-260.)
|
[8] |
Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality[C]// Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2. 2013: 3111-3119.
|
[9] |
Pennington J, Socher R, Manning C D. GloVe: Global Vectors for Word Representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2014: 1532-1543.
|
[10] |
Bojanowski P, Grave E, Joulin A, et al. Enriching Word Vectors with Subword Information[J]. Transactions of the Association for Computational Linguistics, 2017, 5: 135-146.
doi: 10.1162/tacl_a_00051
|
[11] |
Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017:6000-6010.
|
[12] |
Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv:1810.04805.
|
[13] |
Peters M E, Neumann M, Iyyer M, et al. Deep Contextualized Word Representations[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Long Papers). 2018: 2227-2237.
|
[14] |
Sun Y, Wang S, Li Y, et al. ERNIE: Enhanced Representation Through Knowledge Integration[OL]. arXiv Preprint, arXiv:1904.09223.
|
[15] |
Radford A, Narasimhan K, Salimans T, et al. Improving Language Understanding by Generative Pre-Training[OL]. Preprint. https://www.cs.ubc.ca/-amuham01/LING530/papers/radford2018improving.pdf.
|
[16] |
Dong C, Zhang J, Zong C, et al. Character-Based LSTM-CRF with Radical-Level Features for Chinese Named Entity Recognition[C]// Proceedings of the 5th CCF Conference on Natural Language Processing and Chinese Computing and the 24th International Conference on Computer Processing of Oriental Languages. Springer, Cham, 2016: 239-250.
|
[17] |
Cai X, Dong S, Hu J. A Deep Learning Model Incorporating Part of Speech and Self-matching Attention for Named Entity Recognition of Chinese Electronic Medical Records[J]. BMC Medical Informatics and Decision Making, 2019, 19(Suppl 2): Article No. 65.
|
[18] |
Zhou F, Han X, Liu Q, et al. Chinese Clinical Named Entity Recognition Based on Stroke-Level and Radical-Level Features[C]// Proceedings of the International Conference on Smart Computing and Communication. Cham: Springer International Publishing, 2021: 9-18.
|
[19] |
Meng Y, Wu W, Wang F, et al. Glyce: Glyph-vectors for Chinese Character Representations[C]// Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019:2746-2757.
|
[20] |
Zhang Y, Yang J. Chinese NER Using Lattice LSTM[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers). Melbourne, Australia: Association for Computational Linguistics, 2018: 1554-1564.
|
[21] |
才让加. 藏语语料库词语分类体系及标记集研究[J]. 中文信息学报, 2009, 23(4): 107-112.
|
[21] |
(Cai Rangjia. Research on the Word Categories and Its Annotation Scheme for Tibetan Corpus[J]. Journal of Chinese Information Processing, 2009, 23(4): 107-112.)
|
[22] |
Feng S Y, Gangal V, Wei J, et al. A Survey of Data Augmentation Approaches for NLP[OL]. arXiv Preprint. arXiv:2105.03075.
|
[23] |
金明, 杨欢欢, 单广荣. 藏语命名实体识别研究[J]. 西北民族大学学报(自然科学版), 2010, 31(3): 49-52.
|
[23] |
(Jin Ming, Yang Huanhuan, Shan GuanRong. The Studies of Named Entity Recognition for Tibetan[J]. Journal of Northwest University for Nationalities (Natural Science), 2010, 31(3): 49-52.)
|
[24] |
孙媛, 王丽客, 郭莉莉. 基于改进词向量GRU神经网络模型的藏语实体关系抽取[J]. 中文信息学报, 2019, 33(6): 35-41.
|
[24] |
(Sun Yuan, Wang Like, Guo Lili. Tibetan Entity Relation Extraction Based on Optimized Word Embedding with GRU Neural Network[J]. Journal of Chinese Information Processing, 2019, 33(6): 35-41.)
|
[25] |
头旦才让, 仁青东主, 尼玛扎西, 基于CRF的藏文地名识别技术研究[J]. 计算机工程与应用, 2019, 55(18): 111-115.
doi: 10.3778/j.issn.1002-8331.1903-0232
|
[25] |
(Thupten Tsering, Rinchen Dhondub, Nyima Tashi. Research on Tibetan Location Name Recognition Technology Under CRF[J]. Computer Engineering and Applications, 2019, 55(18):111-115.)
doi: 10.3778/j.issn.1002-8331.1903-0232
|
[26] |
Chen P, Ding H, Araki J, et al. Explicitly Capturing Relations Between Entity Mentions via Graph Neural Networks for Domain-specific Named Entity Recognition[C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2:Short Papers). 2021.
|
[27] |
Cui Y, Che W, Liu T, et al. Revisiting Pre-Trained Models for Chinese Natural Language Processing[OL]. arXiv Preprint, arXiv:2004.13922.
|
[28] |
Liu Y, Ott M, Goyal N, et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach[OL]. arXiv Preprint, arXiv:1907.11692.
|
[29] |
Yu Y, Si X, Hu C, et al. A Review of Recurrent Neural Networks: LSTM Cells and Network Architectures[J]. Neural Computation, 2019, 31(7): 1235-1270.
doi: 10.1162/neco_a_01199
pmid: 31113301
|
[30] |
陆柳杏, 吴丹. 面向藏族传统节日的汉藏双语本体构建[J]. 图书馆建设, 2022(1): 67-74.
|
[30] |
(Lu Liuxing, Wu Dan. Chinese-Tibetan Bilingual Ontology for Traditional Tibetan Festival[J]. Library Development, 2022(1): 67-74.)
|
[31] |
Yang J, Zhang Y, Li L, et al. YEDDA: A Lightweight Collaborative Text Span Annotation Tool[C]// Proceedings of ACL 2018, System Demonstrations. 2018: 31-36.
|
[32] |
Lan Z, Chen M, Goodman S, et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations[OL]. arXiv Preprint, arXiv: 1909.11942.
|
[33] |
Pei Y, Chen S, Ke Z, et al. AB-LaBSE: Uyghur Sentiment Analysis via the Pre-Training Model with BiLSTM[J]. Applied Sciences, 2022, 12(3): 1182.
doi: 10.3390/app12031182
|
[34] |
Lee H, Yoon J, Hwang B, et al. KoreALBERT: Pretraining a Lite BERT Model for Korean Language Understanding[C]// Proceedings of the 25th International Conference on Pattern Recognition (ICPR). 2021: 5551-5557.
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|