1School of Information Management, Nanjing University, Nanjing 210023, China 2School of Information Management, Nanjing Agricultural University, Nanjing 210095, China
[Objective] This paper uses pre-trained language models to explore and study the linguistic style of canonical texts, aiming to improve their connotation quality. [Methods] We compared the performance of five pre-trained language models with the deep learning model Bi-LSTM-CRF on the cross-lingual canonical ancient Chinese-English corpus. The selected works include The Analects of Confucius, The Tao Te Ching, The Book of Rites, The Shangshu, and The Warring States Curse. We also examined the lexicon-based canonical language style. [Results] The SikuBERT pre-trained language model achieved 91.29% precision, 91.76% recall, and 91.52% in concordance mean F1 for recognizing canonical words. The modern Chinese translation yielded deeper semantic meaning, clearer ideographic referents, and more vivid and flexible word combinations than the original canonical words. [Limitations] This study only chose specific pre-Qin classical texts and their translations. More research is needed to examine the models’ performance in other domains. [Conclusions] The pre-trained language model SikuBERT could effectively analyze language style differences of cross-lingual canonical texts, which promotes the dissemination of classic Chinese works.
张逸勤, 邓三鸿, 胡昊天, 王东波. 预训练模型视角下的跨语言典籍风格计算研究*[J]. 数据分析与知识发现, 2023, 7(10): 50-62.
Zhang Yiqin, Deng Sanhong, Hu Haotian, Wang Dongbo. Identifying Styles of Cross-Language Classics with Pre-Trained Models. Data Analysis and Knowledge Discovery, 2023, 7(10): 50-62.
(Wu Xiaochun, Huang Xuanjing, Wu Lide. Authorship Identification Based on Semantic Analysis[J]. Journal of Chinese Information Processing, 2006, 20(6): 61-68.)
(Xiao Tianjiu, Liu Ying. A Stylistic Analysis of Jin Yong’s and Gu Long’s Fictions Based on Text Clustering and Classification[J]. Journal of Chinese Information Processing, 2015, 29(5): 167-177.)
(Wang Yi, Zhang Ruie, Han Mingli. “Huainanzi” Chinese-English Parallel Corpus: Construction and Application Prospects[J]. Journal of Anhui University of Science and Technology (Social Science), 2021, 23(1): 84-89.)
Fan Min. On Translators’ Styles of Five Versions of The Analects: A Statistic Analysis Based on Corpus Studies[J]. Journal of Beijing University of Aeronautics and Astronautics (Social Sciences Edition), 2016, 29(6): 81-88.)
[7]
习近平. 坚定文化自信,建设社会主义文化强国[J]. 求是, 2019(12):4-12.
[7]
(Xi Jinping. Building Cultural Confidence and Strength and Securing New Successes in Developing Socialist Culture[J]. Qiushi, 2019(12):4-12.)
(Huang Shuiqing, Wang Dongbo. Review and Trend of Researches on Ancient Chinese Character Information Processing[J]. Library and Information Service, 2017, 61(12): 43-49.)
doi: 10.13266/j.issn.0252-3116.2017.12.005
(Feng Wenhe, Gao Zixiong, Zhang Wenjuan. Detecting the Global Range of Segments for Clause Recognition with Bert[J]. Applied Linguistics, 2022(2): 111-121.)
[10]
Ye X, Dong M H. A Review on Different English Versions of an Ancient Classic of Chinese Medicine: Huang Di Nei Jing[J]. Journal of Integrative Medicine, 2017, 15(1): 11-18.
doi: 10.1016/S2095-4964(17)60310-8
(Chen Jing. The Characteristics of Dialect Words in Call to Arms and Its Influence on Lu Xun’s Literary Language Style[J]. Journal of Modern Chinese Literature, 2022(4): 16-22.)
(Gao Lianfang, Gonpo Tashi. On the Structure, Format and Language Style of the Ancient Tibetan Contracts Unearthed in Dunhuang and Western Regions[J]. Journal of Tibet University, 2020, 35(2): 97-107.)
(Ma Chuangxin, Liang Shehui, Chen Xiaohe. Study on the Correlation Coefficient and Characteristic Words of the Pre-Qin Schools[J]. Journal of Chinese Information Processing, 2019, 33(12): 129-134.)
(Zhang Xuran, Xing Yongle, Zhang Pan, et al. A Corpus-Based Comparative Study on the Translation Styles of Tao Te Ching[J]. Shanghai Journal of Translators, 2022(3): 33-38.)
(Cheng Qikai, Li Xin, Lu Wei. Research on the Evolution of Scientific Writing Style Between 1994 and 2012 Based on Emotional Vocabulary Perspective: A Retrospective Analysis[J]. Documentation, Information & Knowledge, 2016(6): 62-68.)
(Li Yingyu. A Corpus-Based Comparison Between L1 and L2 Translation: Focusing on the Translation of Emotional Words in Shaanxi Literary Works[J]. Journal of Xi’an International Studies University, 2020, 28(4): 81-86.)
(Xu Ming, Jiang Yue. A Stylometric Comparison of L1 Translations and L2 Translations of the True Story of Ah Q[J]. Foreign Languages Research, 2020, 37(3): 86-92.)
(Huang Shuiqing, Wang Xiaoguang, Xia Cuijuan, et al. Advancing the Work on Ancient Classics in the New Era and Accelerating Innovative and Intelligent Development[J]. Journal of Library and Information Science in Agriculture, 2022, 34(5): 4-20.)
doi: 10.13998/j.cnki.issn1002-1248.22-0359
(Yuan Yue, Wang Dongbo, Huang Shuiqing, et al. The Comparative Study of Different Tagging Sets on Entity Extraction of Classical Books[J]. Data Analysis and Knowledge Discovery, 2019, 3(3): 57-65.)
[21]
Penn Treebank P.O.S. Tags[EB/OL]. [2022-11-11]. https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html.
[22]
Hochreiter S, Schmidhuber J. Long Short-Term Memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
doi: 10.1162/neco.1997.9.8.1735
pmid: 9377276
[23]
Devlin J, Chang M W, Lee K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805.
[24]
Hao Y R, Dong L, Wei F R, et al. Visualizing and Understanding the Effectiveness of BERT[OL]. arXiv Preprint, arXiv: 1908.05620.
[25]
Cui Y M, Che W X, Liu T, et al. Pre-Training with Whole Word Masking for Chinese BERT[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3504-3514.
doi: 10.1109/TASLP.2021.3124365
(Yan Tan. Guwen BERT: A Pre-Trained Language Model for Classical Chinese (Literary Chinese)[EB/OL]. [2022-07-28]. https://github.com/Ethan-yt/guwenbert.)
(Wang Dongbo, Liu Chang, Zhu Zihe, et al. Construction and Application of Pre-Trained Models of Siku Quanshu in Orientation to Digital Humanities[J]. Library Tribune, 2022, 42(6): 31-43.)
(Liu Jiangfeng, Feng Yutong, Wang Dongbo, et al. Research on SikuBERT-Enhanced Entity Recognition of Historical Records from the Perspective of Digital Humanities[J]. Library Tribune, 2022, 42(10): 61-72.)
(Liu Chang, Wang Dongbo, Hu Haotian, et al. Automatic Word Segmentation of Classic Books with External Features for Digital Humanities: A Case Study of SikuBERT Pre-Training Model[J]. Library Tribune, 2022, 42(6): 44-54.)
(Hu Haotian, Zhang Yiqin, Deng Sanhong, et al. Automatic Text Classification of “Zi” Part of Siku Quanshu from the Perspective of Digital Humanities: Based on SikuBERT and SikuRoBERTa Pre-Trained Models[J]. Library Tribune, 2022, 42(12): 138-148.)
(Geng Yundong, Zhang Yiqin, Liu Huan, et al. Automatic Part-of-Speech Tagging of Chinese Ancient Classics in the Context of Digital Humanities Research: A Case Study of SIKU-BERT Pre-training Model[J]. Library Tribune, 2022, 42(6): 55-63.)
(Duanmu San. Syllable Analysis and Syllable Inventories in English and Chinese[J]. Linguistic Sciences, 2021, 20(6): 561-588.)
doi: 10.7509/j.linsci.202102.033530
[33]
Baker M. Corpora in Translation Studies: An Overview and Some Suggestions for Future Research[J]. Target, 1995, 7(2): 223-243.
doi: 10.1075/target
[34]
Yu G X. Lexical Diversity in Writing and Speaking Task Performances[J]. Applied Linguistics, 2010, 31(2):236-259.
doi: 10.1093/applin/amp024
[35]
Biber D, Johansson S, Leech G, et al. Longman Grammar of Spoken and Written English[M]. London: Pearson Education Limited, 1999.
[36]
Laviosa S. Core Patterns of Lexical Use in a Comparable Corpus of English Narrative Prose[J]. Meta, 2002, 43(4): 557-570.
(Deng Sanhong, Hu Haotian, Wang Hao, et al. Review of Automatic Processing of Ancient Chinese Character and Prospects for Its Development Trends in the New Era[J]. Scientific Information Research, 2021, 3(1): 1-20.)