|
|
Intelligent Completion of Ancient Texts Based on Pre-trained Language Models |
Li Jiajun1,Ming Can1,Guo Zhihao1,Qian Tieyun1,2,Peng Zhiyong1,2,Wang Xiaoguang2,3,Li Xuhui2,3,Li Jing2,4() |
1School of Computer Science, Wuhan University, Wuhan 430072, China 2Intellectual Computing Laboratory for Cultural Heritage, Wuhan University, Wuhan 430072, China 3School of Information Management, Wuhan University, Wuhan 430072, China 4School of History, Wuhan University, Wuhan 430072, China |
|
|
Abstract [Objective] This paper proposes a new method based on pre-trained language models for completing ancient texts, utilizing representations obtained from pre-training models at different semantic levels and for simplified and traditional Chinese characters. The method constructs a mixture-of-experts system and a simplified-traditional Chinese fusion model to complete ancient texts. [Methods] We designed the mixture-of-experts system-based model for transmitted texts and constructed the simplified-traditional Chinese character fusion model for excavated literature. We fully integrated and explored the model’s capabilities in different scenarios to improve its ability to complete ancient texts. [Results] We examined the new models with self-constructed datasets of transmitted and excavated texts. The models achieved accuracy of 70.14% and 57.13% for the completion task. [Limitations] We only utilized natural language processing approaches. Future improvements involve leveraging multimodal techniques, combining computer vision with natural language processing, and integrating image and semantic information to yield better results. [Conclusions] The proposed models achieve high accuracy on the constructed datasets of ancient literature, providing a competitive solution for completing ancient texts.
|
Received: 04 March 2023
Published: 17 April 2024
|
|
Fund:National Social Science Fund of China(21&ZD334) |
Corresponding Authors:
Li Jing, ORCID:0009-0006-9458-8379, E-mail:ljwhu@163.com。
|
[1] |
章红雨. 高质量做好新时代古籍工作为建设社会主义文化强国提供有力支撑[N]. 中国新闻出版广电报, 2022-10-13( 003).
|
[1] |
(Zhang Hongyu. High Quality Work on Ancient Books in the New Era Provides Strong Support for Building a Socialist Cultural Power[N]. China Press, Publication, Radio and Television News,2022-10-13(003).)
|
[2] |
刘圆圆. 人工智能让古籍“活”起来[N]. 人民政协报, 2022-10-14( 010).
|
[2] |
(Liu Yuanyuan. Artificial Intelligence Brings Ancient Books to Life[N]. Journal of the Chinese People’s Political Consultative Conference, 2022-10-14( 010).)
|
[3] |
Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv:1810.04805.
|
[4] |
Ma J Q, Zhao Z, Yi X Y, et al. Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts[C]// Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018: 1930-1939.
|
[5] |
唐雪梅, 苏祺, 王军, 等. 基于预训练语言模型的繁体古文自动句读研究[C]// 第二十届中国计算语言学大会论文集. 2021: 678-688.
|
[5] |
(Tang Xuemie, Su Qi, Wang Jun, et al. Automatic Traditional Ancient Chinese Texts Segmentation and Punctuation Based on Pre-training Language Model[C]// Proceedings of the 20th Chinese National Conference on Computational Linguistics. 2021: 678-688.)
|
[6] |
俞敬松, 魏一, 张永伟, 等. 基于非参数贝叶斯模型和深度学习的古文分词研究[J]. 中文信息学报, 2020, 34(6): 1-8.
|
[6] |
(Yu Jingsong, Wei Yi, Zhang Yongwei, et al. Word Segmentation for Ancient Chinese Texts Based on Nonparametric Bayesian Models and Deep Learning[J]. Journal of Chinese Information Processing, 2020, 34(6): 1-8.)
|
[7] |
谢志强, 刘金柱, 刘根辉. 古汉语嵌套命名实体识别数据集的构建和应用研究[C]// 第二十一届中国计算语言学大会论文集. 2022: 406-416.
|
[7] |
(Xie Zhiqiang, Liu Jinzhu, Liu Genhui. Construction and Application of Classical Chinese Nested Named Entity Recognition Data Set[C]// Proceedings of the 21st Chinese National Conference on Computational Linguistics. 2022: 406-416.)
|
[8] |
Assael Y, Sommerschield T, Shillingford B, et al. Restoring and Attributing Ancient Texts Using Deep Neural Networks[J]. Nature, 2022, 603: 280-283.
|
[9] |
Fetaya E, Lifshitz Y, Aaron E, et al. Restoration of Fragmentary Babylonian Texts Using Recurrent Neural Networks[J]. PNAS, 2020, 117(37): 22743-22751.
doi: 10.1073/pnas.2003794117
pmid: 32873650
|
[10] |
Su B P, Liu X X, Gao W Z, et al. A Restoration Method Using Dual Generate Adversarial Networks for Chinese Ancient Characters[J]. Visual Informatics, 2022, 6(1): 26-34.
|
[11] |
Zheng W J, Su B P, Feng R Q, et al. EA-GAN: Restoration of Text in Ancient Chinese Books Based on an Example Attention Generative Adversarial Network[J]. Heritage Science, 2023, 11(1): 42.
|
[12] |
Chen S X, Yang Y, Liu X X, et al. Dual Discriminator GAN: Restoring Ancient Yi Characters[J]. ACM Transactions on Asian and Low-Resource Language Information Processing, 2022, 21(4): 66.
|
[13] |
Kaneko H, Ishibashi R, Meng L. Deteriorated Characters Restoration for Early Japanese Books Using Enhanced CycleGAN[J]. Heritage, 2023, 6(5): 4345-4361.
|
[14] |
王东波, 刘畅, 朱子赫, 等. SikuBERT与SikuRoBERTa: 面向数字人文的《四库全书》预训练模型构建及应用研究[J]. 图书馆论坛, 2022, 42(6): 30-43.
|
[14] |
(Wang Dongbo, Liu Chang, Zhu Zihe, et al. Construction and Application of Pre-Trained Models of Siku Quanshu in Orientation to Digital Humanities[J]. Library Tribune, 2022, 42(6): 30-43.)
|
[15] |
Cui Y M, Che W X, Liu T, et al. Revisiting Pre-trained Models for Chinese Natural Language Processing[OL]. arXiv Preprint, arXiv: 2004.13922.
|
[16] |
Cui Y M, Che W X, Liu T, et al. Pre-training with Whole Word Masking for Chinese BERT[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3504-3514.
|
[17] |
Joshi M, Chen D, Liu Y, et al. SpanBERT: Improving Pre-training by Representing and Predicting Spans[J]. Transactions of the Association for Computational Linguistics, 2020, 8: 64-77.
|
[18] |
He H, Choi J D. The Stem Cell Hypothesis: Dilemma Behind Multi-task Learning with Transformer Encoders[OL]. arXiv Preprint, arXiv:2109.06939.
|
[19] |
Kingma D P, Ba J. Adam: A Method for Stochastic Optimization[OL]. arXiv Preprint, arXiv:1412.6980.
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|