Please wait a minute...
Advanced Search
数据分析与知识发现  2023, Vol. 7 Issue (7): 125-135     https://doi.org/10.11925/infotech.2096-3467.2022.0698
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
面向藏族传统节日的汉藏双语命名实体识别研究*
邓宇扬1,吴丹1,2()
1武汉大学信息管理学院 武汉 430072
2武汉大学人机交互与用户行为研究中心 武汉 430072
Chinese-Tibetan Bilingual Named Entity Recognition for Traditional Tibetan Festivals
Deng Yuyang1,Wu Dan1,2()
1School of Information Management, Wuhan University, Wuhan 430072, China
2Center for Studies of Human-Computer Interaction and User Behavior, Wuhan University, Wuhan 430072, China
全文: PDF (2337 KB)   HTML ( 16
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 研究资源稀少语言中预训练模型的表现,为构建藏语知识图谱、语义检索提供帮助。【方法】 本研究采集人民网、人民网藏文版等新闻网站中藏族传统节日的汉藏双语文本数据,并比较多种预训练语言模型与词向量在汉藏双语情景下对命名实体识别任务的表现,同时分析了命名实体识别模型的两种特征处理层(BiLSTM层与CRF层)对实验结果的影响。【结果】 实验结果表明:相较于词向量,汉语以及藏语的预训练语言模型在该任务上的F1性能分别提升0.010 8及0.059 0。特别是在实体数量较少的情景下,预训练模型相比词向量可提取更多的文本信息,并且训练时间缩短40%。【局限】 藏语数据与汉语数据并非平行语料,且藏语数据中的实体数量少于汉语数据。【结论】 预训练语言模型不仅在汉语文本领域有显著效果,在藏语这种资源稀少的语种也能取得很好的表现。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
邓宇扬
吴丹
关键词 命名实体识别藏族传统文化预训练语言模型    
Abstract

[Objective] This paper examines the performance of pre-trained models in resource-scarce languages and assists in building Tibetan knowledge graphs and semantic retrieval. [Methods] We collected Chinese-Tibetan bilingual text data related to traditional Tibetan festivals from websites such as People's Daily and its Tibetan Edition. Then, we compared the performance of multiple pre-trained language models and word embeddings on named entity recognition tasks in a Chinese-Tibetan bilingual context. We also analyzed the impact of two feature processing layers (BiLSTM and CRF) in the named entity recognition model. [Results] Compared with word embeddings, the pre-trained language models of Chinese and Tibetan improved the F1 performance by 0.010 8 and 0.059 0, respectively. In the context of fewer entities, the pre-trained model can extract more textual information than word embeddings, reducing the training time by 40%. [Limitations] The Tibetan and Chinese language data are not parallel corpora, and the Tibetan language data has fewer entities than the Chinese data. [Conclusions] The pre-trained models demonstrate significant performance in the Chinese text domain but also perform well in Tibetan, a language with scarce resources.

Key wordsNamed Entity Recognition    Tibetan Traditional Culture    Pretrained Language Model
收稿日期: 2022-07-07      出版日期: 2022-11-09
ZTFLH:  TP391  
  G350  
基金资助:*国家社会科学基金重大项目研究成果之一(19ZDA341)
通讯作者: 吴丹,ORCID:0000-0002-2611-7317,E-mail: woodan@whu.edu.cn。   
引用本文:   
邓宇扬, 吴丹. 面向藏族传统节日的汉藏双语命名实体识别研究*[J]. 数据分析与知识发现, 2023, 7(7): 125-135.
Deng Yuyang, Wu Dan. Chinese-Tibetan Bilingual Named Entity Recognition for Traditional Tibetan Festivals. Data Analysis and Knowledge Discovery, 2023, 7(7): 125-135.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2022.0698      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2023/V7/I7/125
Fig.1  模型总体结构
Fig.2  模型词嵌入层结构
节日中文名 节日藏文名 节日中文名 节日藏文名
工布新年 ?????????????? 江孜达玛节 ????????????????
香浪节 ?????????????? 赛马节 ?????????????????
日喀则新年 ?????????????????? 藏历新年 ???????
女儿节 ??????????????? 萨噶达瓦节 ?????????
雪顿节 ???????????????? 沐浴节 ????????????
望果节 ????????????????? 普兰新年 ????????????????
Table 1  藏族节日名汉语藏语对照
Fig.3  藏语以及汉语数据示例
类别 节日实体
数量
物品实体
数量
事件实体
数量
地点实体
数量
汉语数据 1 154 1 481 847 68
藏语数据 1 674 344 284 255
Table 2  数据集实体数量统计结果
模型 准确率 召回率 F1
汉语BERT预训练模型-BiLSTM-CRF 93.22% 89.76% 91.29%
汉语ALBERT预训练模型-BiLSTM-CRF 88.40% 92.39% 90.34%
汉语ERNIE预训练模型-BiLSTM-CRF 85.68% 89.31% 87.45%
汉语RoBERTa预训练模型-BiLSTM-CRF 93.97% 92.32% 93.05%
Table 3  汉语预训练模型对比
语言 模型 准确率 召回率 F1
汉语 汉语fastText词向量-BiLSTM-CRF 94.65% 89.64% 91.97%
汉语RoBERTa预训练模型-BiLSTM-CRF 93.97% 92.32% 93.05%
藏语 藏语fastText词向量-BiLSTM-CRF 84.97% 76.68% 80.37%
藏语RoBERTa预训练模型-BiLSTM-CRF 83.40% 89.60% 86.27%
Table 4  预训练模型与词向量模型对比
模型 节日实体 事件实体 物品实体 地点实体
汉语fastText词向量-BiLSTM-CRF 97.66% 89.02% 88.20% 93.02%
藏语fastText词向量-BiLSTM-CRF 95.68% 64.44% 76.34% 85.00%
汉语RoBERTa预训练模型-BiLSTM-CRF 96.96% 90.97% 88.80% 95.54%
藏语RoBERTa预训练模型-BiLSTM-CRF 96.41% 74.78% 86.20% 87.69%
Table 5  实体级别F1值
Fig.4  模型训练曲线
语言 模型 准确率 召回率 F 1
汉语 RoBERTa预训练模型-CRF模型 81.43% 77.09% 79.09%
RoBERTa预训练模型-BiLSTM模型 92.51% 89.99% 91.13%
RoBERTa预训练模型-BiLSTM-CRF模型 93.97% 92.32% 93.05%
藏语 RoBERTa预训练模型-CRF模型 71.25% 40.79% 49.79%
RoBERTa预训练模型-BiLSTM模型 82.19% 84.98% 83.46%
RoBERTa预训练模型-BiLSTM-CRF模型 83.40% 89.60% 86.27%
Table 6  消融实验
Fig.5  系统界面
[1] 道布. 中国的语言政策和语言规划[J]. 民族研究, 1998(6): 42-52.
[1] (Dao Bu. Language Policy and Language Planning in China[J]. Ethno-National Studies, 1998(6): 42-52.)
[2] 周和平. 中国非物质文化遗产保护的实践与探索[J]. 求是, 2010(4): 44-46.
[2] (Zhou Heping. Practice and Exploration of Intangible Cultural Heritage Protection in China[J]. Qiushi, 2010(4): 44-46.)
[3] 周兴维. 东部藏区发展粗论[J]. 西南民族学院学报(哲学社会科学版), 2001, 22(7): 9-14.
[3] (Zhou Xingwei. A Brief Statement on the Development of East Tibetan Region Tibetan Areas[J]. Journal of Southwest University for Nationalities (Philosophy and Social Sciences), 2001, 22(7): 9-14.)
[4] Long C, Hill N W. Recent Developments in Tibetan NLP[J]. ACM Transactions on Asian and Low-Resource Language Information Processing, 2021, 20(2): 19: Article No.19.
[5] Liu P, Guo Y, Wang F, et al. Chinese Named Entity Recognition: The State of the Art[J]. Neurocomputing, 2022, 473: 37-53.
doi: 10.1016/j.neucom.2021.10.101
[6] Goyal A, Gupta V, Kumar M. Recent Named Entity Recognition and Classification Techniques: A Systematic Review[J]. Computer Science Review, 2018, 29: 21-43.
doi: 10.1016/j.cosrev.2018.06.001
[7] 陈曙东, 欧阳小叶. 命名实体识别技术综述[J]. 无线电通信技术, 2020, 46(3): 251-260.
[7] (Chen Shudong, Ouyang Xiaoye. Overview of Named Entity Recognition Technology[J]. Radio Communication Technology, 2020, 46(3): 251-260.)
[8] Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality[C]// Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2. 2013: 3111-3119.
[9] Pennington J, Socher R, Manning C D. GloVe: Global Vectors for Word Representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2014: 1532-1543.
[10] Bojanowski P, Grave E, Joulin A, et al. Enriching Word Vectors with Subword Information[J]. Transactions of the Association for Computational Linguistics, 2017, 5: 135-146.
doi: 10.1162/tacl_a_00051
[11] Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017:6000-6010.
[12] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv:1810.04805.
[13] Peters M E, Neumann M, Iyyer M, et al. Deep Contextualized Word Representations[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Long Papers). 2018: 2227-2237.
[14] Sun Y, Wang S, Li Y, et al. ERNIE: Enhanced Representation Through Knowledge Integration[OL]. arXiv Preprint, arXiv:1904.09223.
[15] Radford A, Narasimhan K, Salimans T, et al. Improving Language Understanding by Generative Pre-Training[OL]. Preprint. https://www.cs.ubc.ca/-amuham01/LING530/papers/radford2018improving.pdf.
[16] Dong C, Zhang J, Zong C, et al. Character-Based LSTM-CRF with Radical-Level Features for Chinese Named Entity Recognition[C]// Proceedings of the 5th CCF Conference on Natural Language Processing and Chinese Computing and the 24th International Conference on Computer Processing of Oriental Languages. Springer, Cham, 2016: 239-250.
[17] Cai X, Dong S, Hu J. A Deep Learning Model Incorporating Part of Speech and Self-matching Attention for Named Entity Recognition of Chinese Electronic Medical Records[J]. BMC Medical Informatics and Decision Making, 2019, 19(Suppl 2): Article No. 65.
[18] Zhou F, Han X, Liu Q, et al. Chinese Clinical Named Entity Recognition Based on Stroke-Level and Radical-Level Features[C]// Proceedings of the International Conference on Smart Computing and Communication. Cham: Springer International Publishing, 2021: 9-18.
[19] Meng Y, Wu W, Wang F, et al. Glyce: Glyph-vectors for Chinese Character Representations[C]// Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019:2746-2757.
[20] Zhang Y, Yang J. Chinese NER Using Lattice LSTM[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers). Melbourne, Australia: Association for Computational Linguistics, 2018: 1554-1564.
[21] 才让加. 藏语语料库词语分类体系及标记集研究[J]. 中文信息学报, 2009, 23(4): 107-112.
[21] (Cai Rangjia. Research on the Word Categories and Its Annotation Scheme for Tibetan Corpus[J]. Journal of Chinese Information Processing, 2009, 23(4): 107-112.)
[22] Feng S Y, Gangal V, Wei J, et al. A Survey of Data Augmentation Approaches for NLP[OL]. arXiv Preprint. arXiv:2105.03075.
[23] 金明, 杨欢欢, 单广荣. 藏语命名实体识别研究[J]. 西北民族大学学报(自然科学版), 2010, 31(3): 49-52.
[23] (Jin Ming, Yang Huanhuan, Shan GuanRong. The Studies of Named Entity Recognition for Tibetan[J]. Journal of Northwest University for Nationalities (Natural Science), 2010, 31(3): 49-52.)
[24] 孙媛, 王丽客, 郭莉莉. 基于改进词向量GRU神经网络模型的藏语实体关系抽取[J]. 中文信息学报, 2019, 33(6): 35-41.
[24] (Sun Yuan, Wang Like, Guo Lili. Tibetan Entity Relation Extraction Based on Optimized Word Embedding with GRU Neural Network[J]. Journal of Chinese Information Processing, 2019, 33(6): 35-41.)
[25] 头旦才让, 仁青东主, 尼玛扎西, 基于CRF的藏文地名识别技术研究[J]. 计算机工程与应用, 2019, 55(18): 111-115.
doi: 10.3778/j.issn.1002-8331.1903-0232
[25] (Thupten Tsering, Rinchen Dhondub, Nyima Tashi. Research on Tibetan Location Name Recognition Technology Under CRF[J]. Computer Engineering and Applications, 2019, 55(18):111-115.)
doi: 10.3778/j.issn.1002-8331.1903-0232
[26] Chen P, Ding H, Araki J, et al. Explicitly Capturing Relations Between Entity Mentions via Graph Neural Networks for Domain-specific Named Entity Recognition[C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2:Short Papers). 2021.
[27] Cui Y, Che W, Liu T, et al. Revisiting Pre-Trained Models for Chinese Natural Language Processing[OL]. arXiv Preprint, arXiv:2004.13922.
[28] Liu Y, Ott M, Goyal N, et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach[OL]. arXiv Preprint, arXiv:1907.11692.
[29] Yu Y, Si X, Hu C, et al. A Review of Recurrent Neural Networks: LSTM Cells and Network Architectures[J]. Neural Computation, 2019, 31(7): 1235-1270.
doi: 10.1162/neco_a_01199 pmid: 31113301
[30] 陆柳杏, 吴丹. 面向藏族传统节日的汉藏双语本体构建[J]. 图书馆建设, 2022(1): 67-74.
[30] (Lu Liuxing, Wu Dan. Chinese-Tibetan Bilingual Ontology for Traditional Tibetan Festival[J]. Library Development, 2022(1): 67-74.)
[31] Yang J, Zhang Y, Li L, et al. YEDDA: A Lightweight Collaborative Text Span Annotation Tool[C]// Proceedings of ACL 2018, System Demonstrations. 2018: 31-36.
[32] Lan Z, Chen M, Goodman S, et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations[OL]. arXiv Preprint, arXiv: 1909.11942.
[33] Pei Y, Chen S, Ke Z, et al. AB-LaBSE: Uyghur Sentiment Analysis via the Pre-Training Model with BiLSTM[J]. Applied Sciences, 2022, 12(3): 1182.
doi: 10.3390/app12031182
[34] Lee H, Yoon J, Hwang B, et al. KoreALBERT: Pretraining a Lite BERT Model for Korean Language Understanding[C]// Proceedings of the 25th International Conference on Pattern Recognition (ICPR). 2021: 5551-5557.
[1] 陈诺, 李旭晖. 一种基于模板提示学习的事件抽取方法*[J]. 数据分析与知识发现, 2023, 7(6): 86-98.
[2] 本妍妍, 庞雪芹. 融入词性的医疗命名实体识别研究*[J]. 数据分析与知识发现, 2023, 7(5): 123-132.
[3] 韩普, 仲雨乐, 陆豪杰, 马诗雯. 基于对抗性迁移学习的药品不良反应实体识别研究*[J]. 数据分析与知识发现, 2023, 7(3): 131-141.
[4] 裴伟, 孙水发, 李小龙, 鲁际, 杨柳, 吴义熔. 融合领域知识的医学命名实体识别研究*[J]. 数据分析与知识发现, 2023, 7(3): 142-154.
[5] 段宇锋, 贺国秀. 面向中文医学文本命名实体识别的神经网络模块分解分析*[J]. 数据分析与知识发现, 2023, 7(2): 26-37.
[6] 胡吉明, 钱玮, 文鹏, 吕晓光. 基于结构功能和实体识别的文本语义表示——以病历领域为例*[J]. 数据分析与知识发现, 2022, 6(8): 110-121.
[7] 景慎旗, 赵又霖. 基于医学领域知识和远程监督的医学实体关系抽取研究*[J]. 数据分析与知识发现, 2022, 6(6): 105-114.
[8] 叶瀚,孙海春,李欣,焦凯楠. 融合注意力机制与句向量压缩的长文本分类模型[J]. 数据分析与知识发现, 2022, 6(6): 84-94.
[9] 余传明, 林虹君, 张贞港. 基于多任务深度学习的实体和事件联合抽取模型*[J]. 数据分析与知识发现, 2022, 6(2/3): 117-128.
[10] 张芳丛, 秦秋莉, 姜勇, 庄润涛. 基于RoBERTa-WWM-BiLSTM-CRF的中文电子病历命名实体识别研究[J]. 数据分析与知识发现, 2022, 6(2/3): 251-262.
[11] 张云秋, 汪洋, 李博诚. 基于RoBERTa-wwm动态融合模型的中文电子病历命名实体识别*[J]. 数据分析与知识发现, 2022, 6(2/3): 242-250.
[12] 范涛, 王昊, 张卫, 李晓敏. 基于机器阅读理解的非遗文本实体抽取研究*[J]. 数据分析与知识发现, 2022, 6(12): 70-79.
[13] 刘兴丽, 范俊杰, 马海群. 面向小样本命名实体识别的数据增强算法改进策略研究*[J]. 数据分析与知识发现, 2022, 6(10): 128-141.
[14] 王义真,欧石燕,陈金菊. 民事裁判文书两阶段式自动摘要研究*[J]. 数据分析与知识发现, 2021, 5(5): 104-114.
[15] 徐晨飞, 叶海影, 包平. 基于深度学习的方志物产资料实体自动识别模型构建研究*[J]. 数据分析与知识发现, 2020, 4(8): 86-97.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn