Please wait a minute...
Data Analysis and Knowledge Discovery  2024, Vol. 8 Issue (5): 59-67    DOI: 10.11925/infotech.2096-3467.2023.0163
Current Issue | Archive | Adv Search |
Intelligent Completion of Ancient Texts Based on Pre-trained Language Models
Li Jiajun1,Ming Can1,Guo Zhihao1,Qian Tieyun1,2,Peng Zhiyong1,2,Wang Xiaoguang2,3,Li Xuhui2,3,Li Jing2,4()
1School of Computer Science, Wuhan University, Wuhan 430072, China
2Intellectual Computing Laboratory for Cultural Heritage, Wuhan University, Wuhan 430072, China
3School of Information Management, Wuhan University, Wuhan 430072, China
4School of History, Wuhan University, Wuhan 430072, China
Download: PDF (779 KB)   HTML ( 14
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a new method based on pre-trained language models for completing ancient texts, utilizing representations obtained from pre-training models at different semantic levels and for simplified and traditional Chinese characters. The method constructs a mixture-of-experts system and a simplified-traditional Chinese fusion model to complete ancient texts. [Methods] We designed the mixture-of-experts system-based model for transmitted texts and constructed the simplified-traditional Chinese character fusion model for excavated literature. We fully integrated and explored the model’s capabilities in different scenarios to improve its ability to complete ancient texts. [Results] We examined the new models with self-constructed datasets of transmitted and excavated texts. The models achieved accuracy of 70.14% and 57.13% for the completion task. [Limitations] We only utilized natural language processing approaches. Future improvements involve leveraging multimodal techniques, combining computer vision with natural language processing, and integrating image and semantic information to yield better results. [Conclusions] The proposed models achieve high accuracy on the constructed datasets of ancient literature, providing a competitive solution for completing ancient texts.

Key wordsDigitization of Ancient Books      Pre-trained Language Models      Mixture-of-Experts Systems     
Received: 04 March 2023      Published: 17 April 2024
ZTFLH:  G350  
  TP391  
Fund:National Social Science Fund of China(21&ZD334)
Corresponding Authors: Li Jing, ORCID:0009-0006-9458-8379, E-mail:ljwhu@163.com。   

Cite this article:

Li Jiajun, Ming Can, Guo Zhihao, Qian Tieyun, Peng Zhiyong, Wang Xiaoguang, Li Xuhui, Li Jing. Intelligent Completion of Ancient Texts Based on Pre-trained Language Models. Data Analysis and Knowledge Discovery, 2024, 8(5): 59-67.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2023.0163     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2024/V8/I5/59

名称 内容
古籍原文 天地不仁,以万物为刍狗;圣人不仁,以百姓为刍狗。
分词结果 天地/不仁,以/万物/为/刍狗;圣人/不仁,以/百姓/为/刍狗/。
随机遮蔽 天地[MASK]仁,[MASK]万物为刍狗;圣[MASK]不仁,以百[MASK]为刍狗。
全词遮蔽 天地不仁,以[MASK][MASK]为刍狗;圣人[MASK][MASK],以百姓为刍狗。
范围遮蔽 天地不仁,以万[MASK][MASK] [MASK][MASK];圣人不仁,以百姓为刍狗。
Example of Subtask Training Data
A Mixture-of-Experts System-Based Model for Completing Ancient Books
A Simplified-Traditional Chinese Character Fusion Model
模型 语料库 类型 参数量
GuwenBERT 殆知阁 103.96MB
SikuRoBERTa 四库全书 108.95MB
Comparision Between Simplified-Traditional Model
模型 准确率/% MRR
GuwenBERT-base 69.06 0.035 3
SikuRoBERTa 62.37 0.031 4
Baseline Fine-Tuning Results of Handed-Down Literature
模型 准确率/% MRR
GuwenBERT-base 52.54 0.028 8
SikuRoBERTa 55.04 0.030 7
Baseline Fine-Tuning Results of Excavated Literature
模型 准确率/% MRR
GuwenBERT-base 69.06 0.035 3
混合专家系统 70.14 0.036 1
Results of the Ancient Book Completion Model Based on Mixture-of-Experts System
模型 准确率/% MRR
GuwenBERT-base+后训练 55.04 0.030 7
简繁融合模型 57.13 0.032 0
Fine-Tuning Results of Baseline
Model Output for [MASK] Position One
Model Output for [MASK] Position Two
[1] 章红雨. 高质量做好新时代古籍工作为建设社会主义文化强国提供有力支撑[N]. 中国新闻出版广电报, 2022-10-13( 003).
[1] (Zhang Hongyu. High Quality Work on Ancient Books in the New Era Provides Strong Support for Building a Socialist Cultural Power[N]. China Press, Publication, Radio and Television News,2022-10-13(003).)
[2] 刘圆圆. 人工智能让古籍“活”起来[N]. 人民政协报, 2022-10-14( 010).
[2] (Liu Yuanyuan. Artificial Intelligence Brings Ancient Books to Life[N]. Journal of the Chinese People’s Political Consultative Conference, 2022-10-14( 010).)
[3] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv:1810.04805.
[4] Ma J Q, Zhao Z, Yi X Y, et al. Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts[C]// Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018: 1930-1939.
[5] 唐雪梅, 苏祺, 王军, 等. 基于预训练语言模型的繁体古文自动句读研究[C]// 第二十届中国计算语言学大会论文集. 2021: 678-688.
[5] (Tang Xuemie, Su Qi, Wang Jun, et al. Automatic Traditional Ancient Chinese Texts Segmentation and Punctuation Based on Pre-training Language Model[C]// Proceedings of the 20th Chinese National Conference on Computational Linguistics. 2021: 678-688.)
[6] 俞敬松, 魏一, 张永伟, 等. 基于非参数贝叶斯模型和深度学习的古文分词研究[J]. 中文信息学报, 2020, 34(6): 1-8.
[6] (Yu Jingsong, Wei Yi, Zhang Yongwei, et al. Word Segmentation for Ancient Chinese Texts Based on Nonparametric Bayesian Models and Deep Learning[J]. Journal of Chinese Information Processing, 2020, 34(6): 1-8.)
[7] 谢志强, 刘金柱, 刘根辉. 古汉语嵌套命名实体识别数据集的构建和应用研究[C]// 第二十一届中国计算语言学大会论文集. 2022: 406-416.
[7] (Xie Zhiqiang, Liu Jinzhu, Liu Genhui. Construction and Application of Classical Chinese Nested Named Entity Recognition Data Set[C]// Proceedings of the 21st Chinese National Conference on Computational Linguistics. 2022: 406-416.)
[8] Assael Y, Sommerschield T, Shillingford B, et al. Restoring and Attributing Ancient Texts Using Deep Neural Networks[J]. Nature, 2022, 603: 280-283.
[9] Fetaya E, Lifshitz Y, Aaron E, et al. Restoration of Fragmentary Babylonian Texts Using Recurrent Neural Networks[J]. PNAS, 2020, 117(37): 22743-22751.
doi: 10.1073/pnas.2003794117 pmid: 32873650
[10] Su B P, Liu X X, Gao W Z, et al. A Restoration Method Using Dual Generate Adversarial Networks for Chinese Ancient Characters[J]. Visual Informatics, 2022, 6(1): 26-34.
[11] Zheng W J, Su B P, Feng R Q, et al. EA-GAN: Restoration of Text in Ancient Chinese Books Based on an Example Attention Generative Adversarial Network[J]. Heritage Science, 2023, 11(1): 42.
[12] Chen S X, Yang Y, Liu X X, et al. Dual Discriminator GAN: Restoring Ancient Yi Characters[J]. ACM Transactions on Asian and Low-Resource Language Information Processing, 2022, 21(4): 66.
[13] Kaneko H, Ishibashi R, Meng L. Deteriorated Characters Restoration for Early Japanese Books Using Enhanced CycleGAN[J]. Heritage, 2023, 6(5): 4345-4361.
[14] 王东波, 刘畅, 朱子赫, 等. SikuBERT与SikuRoBERTa: 面向数字人文的《四库全书》预训练模型构建及应用研究[J]. 图书馆论坛, 2022, 42(6): 30-43.
[14] (Wang Dongbo, Liu Chang, Zhu Zihe, et al. Construction and Application of Pre-Trained Models of Siku Quanshu in Orientation to Digital Humanities[J]. Library Tribune, 2022, 42(6): 30-43.)
[15] Cui Y M, Che W X, Liu T, et al. Revisiting Pre-trained Models for Chinese Natural Language Processing[OL]. arXiv Preprint, arXiv: 2004.13922.
[16] Cui Y M, Che W X, Liu T, et al. Pre-training with Whole Word Masking for Chinese BERT[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3504-3514.
[17] Joshi M, Chen D, Liu Y, et al. SpanBERT: Improving Pre-training by Representing and Predicting Spans[J]. Transactions of the Association for Computational Linguistics, 2020, 8: 64-77.
[18] He H, Choi J D. The Stem Cell Hypothesis: Dilemma Behind Multi-task Learning with Transformer Encoders[OL]. arXiv Preprint, arXiv:2109.06939.
[19] Kingma D P, Ba J. Adam: A Method for Stochastic Optimization[OL]. arXiv Preprint, arXiv:1412.6980.
[1] Bao Tong, Zhang Chengzhi. Extracting Chinese Information with ChatGPT:An Empirical Study by Three Typical Tasks[J]. 数据分析与知识发现, 2023, 7(9): 1-11.
[2] Zhang Yiqin, Deng Sanhong, Hu Haotian, Wang Dongbo. Identifying Styles of Cross-Language Classics with Pre-Trained Models[J]. 数据分析与知识发现, 2023, 7(10): 50-62.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn