|
|
Identifying Styles of Cross-Language Classics with Pre-Trained Models |
Zhang Yiqin1,Deng Sanhong1,Hu Haotian1,Wang Dongbo2() |
1School of Information Management, Nanjing University, Nanjing 210023, China 2School of Information Management, Nanjing Agricultural University, Nanjing 210095, China |
|
|
Abstract [Objective] This paper uses pre-trained language models to explore and study the linguistic style of canonical texts, aiming to improve their connotation quality. [Methods] We compared the performance of five pre-trained language models with the deep learning model Bi-LSTM-CRF on the cross-lingual canonical ancient Chinese-English corpus. The selected works include The Analects of Confucius, The Tao Te Ching, The Book of Rites, The Shangshu, and The Warring States Curse. We also examined the lexicon-based canonical language style. [Results] The SikuBERT pre-trained language model achieved 91.29% precision, 91.76% recall, and 91.52% in concordance mean F1 for recognizing canonical words. The modern Chinese translation yielded deeper semantic meaning, clearer ideographic referents, and more vivid and flexible word combinations than the original canonical words. [Limitations] This study only chose specific pre-Qin classical texts and their translations. More research is needed to examine the models’ performance in other domains. [Conclusions] The pre-trained language model SikuBERT could effectively analyze language style differences of cross-lingual canonical texts, which promotes the dissemination of classic Chinese works.
|
Received: 01 September 2022
Published: 22 March 2023
|
|
Fund:National Social Science Fund of China(21&ZD331) |
Corresponding Authors:
Wang Dongbo, ORCID:0000-0002-9894-9550, E-mail:db.wang@njau.edu.cn。
|
[1] |
张华莉. 语用视角下的现代汉语研究[M]. 长春: 东北师范大学出版社, 2019: 160-161.
|
[1] |
(Zhang Huali. A Pragmatic Study of Modern Chinese[M]. Changchun: Northeast Normal University Press, 2019: 160-161.)
|
[2] |
祝克懿. 语言风格研究的理论渊源与功能衍化路径[J]. 当代修辞学, 2021(1): 59-71.
|
[2] |
(Zhu Keyi. Theoretical Origin and Functional Evolution Path of Language Style Research[J]. Contemporary Rhetoric, 2021(1): 59-71.)
|
[3] |
武晓春, 黄萱菁, 吴立德. 基于语义分析的作者身份识别方法研究[J]. 中文信息学报, 2006, 20(6): 61-68.
|
[3] |
(Wu Xiaochun, Huang Xuanjing, Wu Lide. Authorship Identification Based on Semantic Analysis[J]. Journal of Chinese Information Processing, 2006, 20(6): 61-68.)
|
[4] |
肖天久, 刘颖. 基于聚类和分类的金庸与古龙小说风格分析[J]. 中文信息学报, 2015, 29(5): 167-177.
|
[4] |
(Xiao Tianjiu, Liu Ying. A Stylistic Analysis of Jin Yong’s and Gu Long’s Fictions Based on Text Clustering and Classification[J]. Journal of Chinese Information Processing, 2015, 29(5): 167-177.)
|
[5] |
王翊, 张瑞娥, 韩名利. 《淮南子》汉英平行语料库建设及应用前景[J]. 安徽理工大学学报(社会科学版), 2021, 23(1): 84-89.
|
[5] |
(Wang Yi, Zhang Ruie, Han Mingli. “Huainanzi” Chinese-English Parallel Corpus: Construction and Application Prospects[J]. Journal of Anhui University of Science and Technology (Social Science), 2021, 23(1): 84-89.)
|
[6] |
范敏. 《论语》五译本译者风格研究——基于语料库的统计与分析[J]. 北京航空航天大学学报(社会科学版), 2016, 29(6): 81-88.
|
[6] |
Fan Min. On Translators’ Styles of Five Versions of The Analects: A Statistic Analysis Based on Corpus Studies[J]. Journal of Beijing University of Aeronautics and Astronautics (Social Sciences Edition), 2016, 29(6): 81-88.)
|
[7] |
习近平. 坚定文化自信,建设社会主义文化强国[J]. 求是, 2019(12):4-12.
|
[7] |
(Xi Jinping. Building Cultural Confidence and Strength and Securing New Successes in Developing Socialist Culture[J]. Qiushi, 2019(12):4-12.)
|
[8] |
黄水清, 王东波. 古文信息处理研究的现状及趋势[J]. 图书情报工作, 2017, 61(12): 43-49.
doi: 10.13266/j.issn.0252-3116.2017.12.005
|
[8] |
(Huang Shuiqing, Wang Dongbo. Review and Trend of Researches on Ancient Chinese Character Information Processing[J]. Library and Information Service, 2017, 61(12): 43-49.)
doi: 10.13266/j.issn.0252-3116.2017.12.005
|
[9] |
冯文贺, 高子雄, 张文娟. 小句识别所依赖的语段全局范围探究——基于预训练语言模型Bert的汉语小句识别[J]. 语言文字应用, 2022(2): 111-121.
|
[9] |
(Feng Wenhe, Gao Zixiong, Zhang Wenjuan. Detecting the Global Range of Segments for Clause Recognition with Bert[J]. Applied Linguistics, 2022(2): 111-121.)
|
[10] |
Ye X, Dong M H. A Review on Different English Versions of an Ancient Classic of Chinese Medicine: Huang Di Nei Jing[J]. Journal of Integrative Medicine, 2017, 15(1): 11-18.
doi: 10.1016/S2095-4964(17)60310-8
|
[11] |
陈静. 《呐喊》方言词语特征及其对鲁迅文学语言风格的影响[J]. 现代中文学刊, 2022(4): 16-22.
|
[11] |
(Chen Jing. The Characteristics of Dialect Words in Call to Arms and Its Influence on Lu Xun’s Literary Language Style[J]. Journal of Modern Chinese Literature, 2022(4): 16-22.)
|
[12] |
高莲芳, 贡保扎西. 论敦煌西域出土古藏文契约文书的结构格式与语言风格[J]. 西藏大学学报(社会科学版), 2020, 35(2): 97-107.
|
[12] |
(Gao Lianfang, Gonpo Tashi. On the Structure, Format and Language Style of the Ancient Tibetan Contracts Unearthed in Dunhuang and Western Regions[J]. Journal of Tibet University, 2020, 35(2): 97-107.)
|
[13] |
马创新, 梁社会, 陈小荷. 先秦诸家学派的相关系数与特征词研究[J]. 中文信息学报, 2019, 33(12): 129-134.
|
[13] |
(Ma Chuangxin, Liang Shehui, Chen Xiaohe. Study on the Correlation Coefficient and Characteristic Words of the Pre-Qin Schools[J]. Journal of Chinese Information Processing, 2019, 33(12): 129-134.)
|
[14] |
张旭冉, 杏永乐, 张盼, 等. 《道德经》四个英译本的翻译风格对比研究——基于语料库的统计与分析[J]. 上海翻译, 2022(3): 33-38.
|
[14] |
(Zhang Xuran, Xing Yongle, Zhang Pan, et al. A Corpus-Based Comparative Study on the Translation Styles of Tao Te Ching[J]. Shanghai Journal of Translators, 2022(3): 33-38.)
|
[15] |
程齐凯, 李信, 陆伟. 基于情感词汇的科研论文写作风格演变研究——1994-2012年科研论文摘要中情感词汇使用情况的回顾性分析[J]. 图书情报知识, 2016(6): 62-68.
|
[15] |
(Cheng Qikai, Li Xin, Lu Wei. Research on the Evolution of Scientific Writing Style Between 1994 and 2012 Based on Emotional Vocabulary Perspective: A Retrospective Analysis[J]. Documentation, Information & Knowledge, 2016(6): 62-68.)
|
[16] |
李颖玉. 基于陕西文学英译语料库的译入译出文本比较研究: 以词语情感色彩的传达为例[J]. 西安外国语大学学报, 2020, 28(4): 81-86.
|
[16] |
(Li Yingyu. A Corpus-Based Comparison Between L1 and L2 Translation: Focusing on the Translation of Emotional Words in Shaanxi Literary Works[J]. Journal of Xi’an International Studies University, 2020, 28(4): 81-86.)
|
[17] |
许明, 蒋跃. 《阿Q正传》译入译出文本的风格计量学对比[J]. 外语研究, 2020, 37(3): 86-92.
|
[17] |
(Xu Ming, Jiang Yue. A Stylometric Comparison of L1 Translations and L2 Translations of the True Story of Ah Q[J]. Foreign Languages Research, 2020, 37(3): 86-92.)
|
[18] |
黄水清, 王东波. 国内语料库研究综述[J]. 信息资源管理学报, 2021, 11(3): 4-17.
|
[18] |
(Huang Shuiqing, Wang Dongbo. Review of Corpus Research in China[J]. Journal of Information Resources Management, 2021, 11(3): 4-17.)
|
[19] |
黄水清, 王晓光, 夏翠娟, 等. 推进新时代古籍工作,加快创新智能化发展[J]. 农业图书情报学报, 2022, 34(5): 4-20.
doi: 10.13998/j.cnki.issn1002-1248.22-0359
|
[19] |
(Huang Shuiqing, Wang Xiaoguang, Xia Cuijuan, et al. Advancing the Work on Ancient Classics in the New Era and Accelerating Innovative and Intelligent Development[J]. Journal of Library and Information Science in Agriculture, 2022, 34(5): 4-20.)
doi: 10.13998/j.cnki.issn1002-1248.22-0359
|
[20] |
袁悦, 王东波, 黄水清, 等. 不同词性标记集在典籍实体抽取上的差异性探究[J]. 数据分析与知识发现, 2019, 3(3): 57-65.
|
[20] |
(Yuan Yue, Wang Dongbo, Huang Shuiqing, et al. The Comparative Study of Different Tagging Sets on Entity Extraction of Classical Books[J]. Data Analysis and Knowledge Discovery, 2019, 3(3): 57-65.)
|
[21] |
Penn Treebank P.O.S. Tags[EB/OL]. [2022-11-11]. https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html.
|
[22] |
Hochreiter S, Schmidhuber J. Long Short-Term Memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
doi: 10.1162/neco.1997.9.8.1735
pmid: 9377276
|
[23] |
Devlin J, Chang M W, Lee K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805.
|
[24] |
Hao Y R, Dong L, Wei F R, et al. Visualizing and Understanding the Effectiveness of BERT[OL]. arXiv Preprint, arXiv: 1908.05620.
|
[25] |
Cui Y M, Che W X, Liu T, et al. Pre-Training with Whole Word Masking for Chinese BERT[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3504-3514.
doi: 10.1109/TASLP.2021.3124365
|
[26] |
阎覃. Guwen BERT:古文预训练语言模型(古文BERT)[EB/OL]. [2022-07-28]. https://github.com/Ethan-yt/guwenbert.
|
[26] |
(Yan Tan. Guwen BERT: A Pre-Trained Language Model for Classical Chinese (Literary Chinese)[EB/OL]. [2022-07-28]. https://github.com/Ethan-yt/guwenbert.)
|
[27] |
王东波, 刘畅, 朱子赫, 等. SikuBERT与SikuRoBERTa:面向数字人文的《四库全书》预训练模型构建及应用研究[J]. 图书馆论坛, 2022, 42(6): 31-43.
|
[27] |
(Wang Dongbo, Liu Chang, Zhu Zihe, et al. Construction and Application of Pre-Trained Models of Siku Quanshu in Orientation to Digital Humanities[J]. Library Tribune, 2022, 42(6): 31-43.)
|
[28] |
刘江峰, 冯钰童, 王东波, 等. 数字人文视域下SikuBERT增强的史籍实体识别研究[J]. 图书馆论坛, 2022, 42(10): 61-72.
|
[28] |
(Liu Jiangfeng, Feng Yutong, Wang Dongbo, et al. Research on SikuBERT-Enhanced Entity Recognition of Historical Records from the Perspective of Digital Humanities[J]. Library Tribune, 2022, 42(10): 61-72.)
|
[29] |
刘畅, 王东波, 胡昊天, 等. 面向数字人文的融合外部特征的典籍自动分词研究——以SikuBERT预训练模型为例[J]. 图书馆论坛, 2022, 42(6): 44-54.
|
[29] |
(Liu Chang, Wang Dongbo, Hu Haotian, et al. Automatic Word Segmentation of Classic Books with External Features for Digital Humanities: A Case Study of SikuBERT Pre-Training Model[J]. Library Tribune, 2022, 42(6): 44-54.)
|
[30] |
胡昊天, 张逸勤, 邓三鸿, 等. 面向数字人文的《四库全书》子部自动分类研究——以SikuBERT和SikuRoBERTa预训练模型为例[J]. 图书馆论坛, 2022, 42(12): 138-148.
|
[30] |
(Hu Haotian, Zhang Yiqin, Deng Sanhong, et al. Automatic Text Classification of “Zi” Part of Siku Quanshu from the Perspective of Digital Humanities: Based on SikuBERT and SikuRoBERTa Pre-Trained Models[J]. Library Tribune, 2022, 42(12): 138-148.)
|
[31] |
耿云冬, 张逸勤, 刘欢, 等. 面向数字人文的中国古代典籍词性自动标注研究——以SikuBERT预训练模型为例[J]. 图书馆论坛, 2022, 42(6): 55-63.
|
[31] |
(Geng Yundong, Zhang Yiqin, Liu Huan, et al. Automatic Part-of-Speech Tagging of Chinese Ancient Classics in the Context of Digital Humanities Research: A Case Study of SIKU-BERT Pre-training Model[J]. Library Tribune, 2022, 42(6): 55-63.)
|
[32] |
端木三. 英汉音节分析及数量对比[J]. 语言科学, 2021, 20(6): 561-588.
doi: 10.7509/j.linsci.202102.033530
|
[32] |
(Duanmu San. Syllable Analysis and Syllable Inventories in English and Chinese[J]. Linguistic Sciences, 2021, 20(6): 561-588.)
doi: 10.7509/j.linsci.202102.033530
|
[33] |
Baker M. Corpora in Translation Studies: An Overview and Some Suggestions for Future Research[J]. Target, 1995, 7(2): 223-243.
doi: 10.1075/target
|
[34] |
Yu G X. Lexical Diversity in Writing and Speaking Task Performances[J]. Applied Linguistics, 2010, 31(2):236-259.
doi: 10.1093/applin/amp024
|
[35] |
Biber D, Johansson S, Leech G, et al. Longman Grammar of Spoken and Written English[M]. London: Pearson Education Limited, 1999.
|
[36] |
Laviosa S. Core Patterns of Lexical Use in a Comparable Corpus of English Narrative Prose[J]. Meta, 2002, 43(4): 557-570.
|
[37] |
邓三鸿, 胡昊天, 王昊, 等. 古文自动处理研究现状与新时代发展趋势展望[J]. 科技情报研究, 2021, 3(1): 1-20.
|
[37] |
(Deng Sanhong, Hu Haotian, Wang Hao, et al. Review of Automatic Processing of Ancient Chinese Character and Prospects for Its Development Trends in the New Era[J]. Scientific Information Research, 2021, 3(1): 1-20.)
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|