预训练模型视角下的跨语言典籍风格计算研究<sup>*</sup>

doi:10.11925/infotech.2096-3467.2022.0926

数据分析与知识发现

2023, Vol. 7

Issue (10): 50-62 https://doi.org/10.11925/infotech.2096-3467.2022.0926

研究论文

本期目录 | 过刊浏览 | 高级检索

预训练模型视角下的跨语言典籍风格计算研究^*

张逸勤¹,邓三鸿¹,胡昊天¹,王东波²(

)

¹南京大学信息管理学院南京 210023
²南京农业大学信息管理学院南京 210095

Identifying Styles of Cross-Language Classics with Pre-Trained Models

Zhang Yiqin¹,Deng Sanhong¹,Hu Haotian¹,Wang Dongbo²(

)

¹School of Information Management, Nanjing University, Nanjing 210023, China
²School of Information Management, Nanjing Agricultural University, Nanjing 210095, China

摘要
图/表
参考文献
相关文章
Metrics

全文: PDF (1035 KB) HTML ( 7 )
输出: BibTeX | EndNote (RIS)

摘要

【目的】利用预训练语言模型对典籍文本进行风格计算与对比分析，宏观把控跨语言环境下典籍语言风格特征，提升典籍外译质量。【方法】分别应用5种预训练语言模型并对比深度学习模型Bi-LSTM-CRF在《论语》、《道德经》、《礼记》、《尚书》和《战国策》所构建的跨语言典籍古汉英语料库上的分词词性标注性能，基于预训练模型的最优训练结果完成对语料库中所有古汉语典籍的分词与词性标注，在这基础上进行对古汉语典籍及其对应的白话文和英文翻译在词汇层面的语言风格分析，包括词性、词汇长度、词汇多样性和密度的比较和总结。【结果】SikuBERT预训练语言模型对典籍词汇识别准确率、召回率、调和平均值F1达到91.29%、91.76%和91.52%，现代汉语译文较典籍原文词汇表意指代更为明确、词组功能相对单一、词汇组合方式更为多样，而英文译文存在翻译简化的现象。【局限】 因数据抽样偏差，仅选取了特定的先秦典籍文本与译本，结论扩展到其他领域文本的有效性需进一步验证。【结论】本研究验证了预训练语言模型SikuBERT对典籍语言风格挖掘研究的可行性，深入分析典籍文本语言风格差异，为提升古代汉语翻译质量与促进中国优秀典籍跨文化传播奠定了研究基础。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	张逸勤
	邓三鸿
	胡昊天
	王东波

关键词 ：预训练语言模型, 语言风格, 数字人文, 典籍文本

Abstract：

[Objective] This paper uses pre-trained language models to explore and study the linguistic style of canonical texts, aiming to improve their connotation quality. [Methods] We compared the performance of five pre-trained language models with the deep learning model Bi-LSTM-CRF on the cross-lingual canonical ancient Chinese-English corpus. The selected works include The Analects of Confucius, The Tao Te Ching, The Book of Rites, The Shangshu, and The Warring States Curse. We also examined the lexicon-based canonical language style. [Results] The SikuBERT pre-trained language model achieved 91.29% precision, 91.76% recall, and 91.52% in concordance mean F1 for recognizing canonical words. The modern Chinese translation yielded deeper semantic meaning, clearer ideographic referents, and more vivid and flexible word combinations than the original canonical words. [Limitations] This study only chose specific pre-Qin classical texts and their translations. More research is needed to examine the models’ performance in other domains. [Conclusions] The pre-trained language model SikuBERT could effectively analyze language style differences of cross-lingual canonical texts, which promotes the dissemination of classic Chinese works.

Key words： Pre-Trained Language Models Language Style Digital Humanities Canonical Texts

收稿日期: 2022-09-01 出版日期: 2023-03-22

ZTFLH:	G122
	G254

基金资助:*国家社会科学基金重大项目(21&ZD331)

通讯作者: 王东波， ORCID：0000-0002-9894-9550， E-mail：db.wang@njau.edu.cn。

引用本文:

张逸勤, 邓三鸿, 胡昊天, 王东波. 预训练模型视角下的跨语言典籍风格计算研究^*[J]. 数据分析与知识发现, 2023, 7(10): 50-62.
Zhang Yiqin, Deng Sanhong, Hu Haotian, Wang Dongbo. Identifying Styles of Cross-Language Classics with Pre-Trained Models. Data Analysis and Knowledge Discovery, 2023, 7(10): 50-62.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2022.0926 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2023/V7/I10/50

Table 1 古汉语、英语、现代汉语句子对齐样例

Table 2 古代汉语词性标记集

Table 3 现代汉语词性标记集

Table 4 典籍英文词性标记集

Fig.1 数据标注样例

Fig.2 BERT模型框架

Fig.3 模型性能

Table 5 典籍词性训练结果

Fig.4 典籍古代汉语、现代汉语译本与英语译本词性频次统计

Fig.5 分词长汉语词汇频数统计对比

Table 6 典籍语言风格特征对比

[1]	张华莉. 语用视角下的现代汉语研究[M]. 长春: 东北师范大学出版社, 2019: 160-161.
[1]	(Zhang Huali. A Pragmatic Study of Modern Chinese[M]. Changchun: Northeast Normal University Press, 2019: 160-161.)
[2]	祝克懿. 语言风格研究的理论渊源与功能衍化路径[J]. 当代修辞学, 2021(1): 59-71.
[2]	(Zhu Keyi. Theoretical Origin and Functional Evolution Path of Language Style Research[J]. Contemporary Rhetoric, 2021(1): 59-71.)
[3]	武晓春, 黄萱菁, 吴立德. 基于语义分析的作者身份识别方法研究[J]. 中文信息学报, 2006, 20(6): 61-68.
[3]	(Wu Xiaochun, Huang Xuanjing, Wu Lide. Authorship Identification Based on Semantic Analysis[J]. Journal of Chinese Information Processing, 2006, 20(6): 61-68.)
[4]	肖天久, 刘颖. 基于聚类和分类的金庸与古龙小说风格分析[J]. 中文信息学报, 2015, 29(5): 167-177.
[4]	(Xiao Tianjiu, Liu Ying. A Stylistic Analysis of Jin Yong’s and Gu Long’s Fictions Based on Text Clustering and Classification[J]. Journal of Chinese Information Processing, 2015, 29(5): 167-177.)
[5]	王翊, 张瑞娥, 韩名利. 《淮南子》汉英平行语料库建设及应用前景[J]. 安徽理工大学学报(社会科学版), 2021, 23(1): 84-89.
[5]	(Wang Yi, Zhang Ruie, Han Mingli. “Huainanzi” Chinese-English Parallel Corpus: Construction and Application Prospects[J]. Journal of Anhui University of Science and Technology (Social Science), 2021, 23(1): 84-89.)
[6]	范敏. 《论语》五译本译者风格研究——基于语料库的统计与分析[J]. 北京航空航天大学学报(社会科学版), 2016, 29(6): 81-88.
[6]	Fan Min. On Translators’ Styles of Five Versions of The Analects: A Statistic Analysis Based on Corpus Studies[J]. Journal of Beijing University of Aeronautics and Astronautics (Social Sciences Edition), 2016, 29(6): 81-88.)
[7]	习近平. 坚定文化自信,建设社会主义文化强国[J]. 求是, 2019(12):4-12.
[7]	(Xi Jinping. Building Cultural Confidence and Strength and Securing New Successes in Developing Socialist Culture[J]. Qiushi, 2019(12):4-12.)
[8]	黄水清, 王东波. 古文信息处理研究的现状及趋势[J]. 图书情报工作, 2017, 61(12): 43-49. doi: 10.13266/j.issn.0252-3116.2017.12.005
[8]	(Huang Shuiqing, Wang Dongbo. Review and Trend of Researches on Ancient Chinese Character Information Processing[J]. Library and Information Service, 2017, 61(12): 43-49.) doi: 10.13266/j.issn.0252-3116.2017.12.005
[9]	冯文贺, 高子雄, 张文娟. 小句识别所依赖的语段全局范围探究——基于预训练语言模型Bert的汉语小句识别[J]. 语言文字应用, 2022(2): 111-121.
[9]	(Feng Wenhe, Gao Zixiong, Zhang Wenjuan. Detecting the Global Range of Segments for Clause Recognition with Bert[J]. Applied Linguistics, 2022(2): 111-121.)
[10]	Ye X, Dong M H. A Review on Different English Versions of an Ancient Classic of Chinese Medicine: Huang Di Nei Jing[J]. Journal of Integrative Medicine, 2017, 15(1): 11-18. doi: 10.1016/S2095-4964(17)60310-8
[11]	陈静. 《呐喊》方言词语特征及其对鲁迅文学语言风格的影响[J]. 现代中文学刊, 2022(4): 16-22.
[11]	(Chen Jing. The Characteristics of Dialect Words in Call to Arms and Its Influence on Lu Xun’s Literary Language Style[J]. Journal of Modern Chinese Literature, 2022(4): 16-22.)
[12]	高莲芳, 贡保扎西. 论敦煌西域出土古藏文契约文书的结构格式与语言风格[J]. 西藏大学学报(社会科学版), 2020, 35(2): 97-107.
[12]	(Gao Lianfang, Gonpo Tashi. On the Structure, Format and Language Style of the Ancient Tibetan Contracts Unearthed in Dunhuang and Western Regions[J]. Journal of Tibet University, 2020, 35(2): 97-107.)
[13]	马创新, 梁社会, 陈小荷. 先秦诸家学派的相关系数与特征词研究[J]. 中文信息学报, 2019, 33(12): 129-134.
[13]	(Ma Chuangxin, Liang Shehui, Chen Xiaohe. Study on the Correlation Coefficient and Characteristic Words of the Pre-Qin Schools[J]. Journal of Chinese Information Processing, 2019, 33(12): 129-134.)
[14]	张旭冉, 杏永乐, 张盼, 等. 《道德经》四个英译本的翻译风格对比研究——基于语料库的统计与分析[J]. 上海翻译, 2022(3): 33-38.
[14]	(Zhang Xuran, Xing Yongle, Zhang Pan, et al. A Corpus-Based Comparative Study on the Translation Styles of Tao Te Ching[J]. Shanghai Journal of Translators, 2022(3): 33-38.)
[15]	程齐凯, 李信, 陆伟. 基于情感词汇的科研论文写作风格演变研究——1994-2012年科研论文摘要中情感词汇使用情况的回顾性分析[J]. 图书情报知识, 2016(6): 62-68.
[15]	(Cheng Qikai, Li Xin, Lu Wei. Research on the Evolution of Scientific Writing Style Between 1994 and 2012 Based on Emotional Vocabulary Perspective: A Retrospective Analysis[J]. Documentation, Information & Knowledge, 2016(6): 62-68.)
[16]	李颖玉. 基于陕西文学英译语料库的译入译出文本比较研究: 以词语情感色彩的传达为例[J]. 西安外国语大学学报, 2020, 28(4): 81-86.
[16]	(Li Yingyu. A Corpus-Based Comparison Between L1 and L2 Translation: Focusing on the Translation of Emotional Words in Shaanxi Literary Works[J]. Journal of Xi’an International Studies University, 2020, 28(4): 81-86.)
[17]	许明, 蒋跃. 《阿Q正传》译入译出文本的风格计量学对比[J]. 外语研究, 2020, 37(3): 86-92.
[17]	(Xu Ming, Jiang Yue. A Stylometric Comparison of L1 Translations and L2 Translations of the True Story of Ah Q[J]. Foreign Languages Research, 2020, 37(3): 86-92.)
[18]	黄水清, 王东波. 国内语料库研究综述[J]. 信息资源管理学报, 2021, 11(3): 4-17.
[18]	(Huang Shuiqing, Wang Dongbo. Review of Corpus Research in China[J]. Journal of Information Resources Management, 2021, 11(3): 4-17.)
[19]	黄水清, 王晓光, 夏翠娟, 等. 推进新时代古籍工作,加快创新智能化发展[J]. 农业图书情报学报, 2022, 34(5): 4-20. doi: 10.13998/j.cnki.issn1002-1248.22-0359
[19]	(Huang Shuiqing, Wang Xiaoguang, Xia Cuijuan, et al. Advancing the Work on Ancient Classics in the New Era and Accelerating Innovative and Intelligent Development[J]. Journal of Library and Information Science in Agriculture, 2022, 34(5): 4-20.) doi: 10.13998/j.cnki.issn1002-1248.22-0359
[20]	袁悦, 王东波, 黄水清, 等. 不同词性标记集在典籍实体抽取上的差异性探究[J]. 数据分析与知识发现, 2019, 3(3): 57-65.
[20]	(Yuan Yue, Wang Dongbo, Huang Shuiqing, et al. The Comparative Study of Different Tagging Sets on Entity Extraction of Classical Books[J]. Data Analysis and Knowledge Discovery, 2019, 3(3): 57-65.)
[21]	Penn Treebank P.O.S. Tags[EB/OL]. [2022-11-11]. https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html.
[22]	Hochreiter S, Schmidhuber J. Long Short-Term Memory[J]. Neural Computation, 1997, 9(8): 1735-1780. doi: 10.1162/neco.1997.9.8.1735 pmid: 9377276
[23]	Devlin J, Chang M W, Lee K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805.
[24]	Hao Y R, Dong L, Wei F R, et al. Visualizing and Understanding the Effectiveness of BERT[OL]. arXiv Preprint, arXiv: 1908.05620.
[25]	Cui Y M, Che W X, Liu T, et al. Pre-Training with Whole Word Masking for Chinese BERT[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3504-3514. doi: 10.1109/TASLP.2021.3124365
[26]	阎覃. Guwen BERT:古文预训练语言模型(古文BERT)[EB/OL]. [2022-07-28]. https://github.com/Ethan-yt/guwenbert.
[26]	(Yan Tan. Guwen BERT: A Pre-Trained Language Model for Classical Chinese (Literary Chinese)[EB/OL]. [2022-07-28]. https://github.com/Ethan-yt/guwenbert.)
[27]	王东波, 刘畅, 朱子赫, 等. SikuBERT与SikuRoBERTa:面向数字人文的《四库全书》预训练模型构建及应用研究[J]. 图书馆论坛, 2022, 42(6): 31-43.
[27]	(Wang Dongbo, Liu Chang, Zhu Zihe, et al. Construction and Application of Pre-Trained Models of Siku Quanshu in Orientation to Digital Humanities[J]. Library Tribune, 2022, 42(6): 31-43.)
[28]	刘江峰, 冯钰童, 王东波, 等. 数字人文视域下SikuBERT增强的史籍实体识别研究[J]. 图书馆论坛, 2022, 42(10): 61-72.
[28]	(Liu Jiangfeng, Feng Yutong, Wang Dongbo, et al. Research on SikuBERT-Enhanced Entity Recognition of Historical Records from the Perspective of Digital Humanities[J]. Library Tribune, 2022, 42(10): 61-72.)
[29]	刘畅, 王东波, 胡昊天, 等. 面向数字人文的融合外部特征的典籍自动分词研究——以SikuBERT预训练模型为例[J]. 图书馆论坛, 2022, 42(6): 44-54.
[29]	(Liu Chang, Wang Dongbo, Hu Haotian, et al. Automatic Word Segmentation of Classic Books with External Features for Digital Humanities: A Case Study of SikuBERT Pre-Training Model[J]. Library Tribune, 2022, 42(6): 44-54.)
[30]	胡昊天, 张逸勤, 邓三鸿, 等. 面向数字人文的《四库全书》子部自动分类研究——以SikuBERT和SikuRoBERTa预训练模型为例[J]. 图书馆论坛, 2022, 42(12): 138-148.
[30]	(Hu Haotian, Zhang Yiqin, Deng Sanhong, et al. Automatic Text Classification of “Zi” Part of Siku Quanshu from the Perspective of Digital Humanities: Based on SikuBERT and SikuRoBERTa Pre-Trained Models[J]. Library Tribune, 2022, 42(12): 138-148.)
[31]	耿云冬, 张逸勤, 刘欢, 等. 面向数字人文的中国古代典籍词性自动标注研究——以SikuBERT预训练模型为例[J]. 图书馆论坛, 2022, 42(6): 55-63.
[31]	(Geng Yundong, Zhang Yiqin, Liu Huan, et al. Automatic Part-of-Speech Tagging of Chinese Ancient Classics in the Context of Digital Humanities Research: A Case Study of SIKU-BERT Pre-training Model[J]. Library Tribune, 2022, 42(6): 55-63.)
[32]	端木三. 英汉音节分析及数量对比[J]. 语言科学, 2021, 20(6): 561-588. doi: 10.7509/j.linsci.202102.033530
[32]	(Duanmu San. Syllable Analysis and Syllable Inventories in English and Chinese[J]. Linguistic Sciences, 2021, 20(6): 561-588.) doi: 10.7509/j.linsci.202102.033530
[33]	Baker M. Corpora in Translation Studies: An Overview and Some Suggestions for Future Research[J]. Target, 1995, 7(2): 223-243. doi: 10.1075/target
[34]	Yu G X. Lexical Diversity in Writing and Speaking Task Performances[J]. Applied Linguistics, 2010, 31(2):236-259. doi: 10.1093/applin/amp024
[35]	Biber D, Johansson S, Leech G, et al. Longman Grammar of Spoken and Written English[M]. London: Pearson Education Limited, 1999.
[36]	Laviosa S. Core Patterns of Lexical Use in a Comparable Corpus of English Narrative Prose[J]. Meta, 2002, 43(4): 557-570.
[37]	邓三鸿, 胡昊天, 王昊, 等. 古文自动处理研究现状与新时代发展趋势展望[J]. 科技情报研究, 2021, 3(1): 1-20.
[37]	(Deng Sanhong, Hu Haotian, Wang Hao, et al. Review of Automatic Processing of Ancient Chinese Character and Prospects for Its Development Trends in the New Era[J]. Scientific Information Research, 2021, 3(1): 1-20.)

[1]	鲍彤, 章成志. ChatGPT中文信息抽取能力测评——以三种典型的抽取任务为例^*[J]. 数据分析与知识发现, 2023, 7(9): 1-11.
[2]	邓宇扬, 吴丹. 面向藏族传统节日的汉藏双语命名实体识别研究^*[J]. 数据分析与知识发现, 2023, 7(7): 125-135.
[3]	陈诺, 李旭晖. 一种基于模板提示学习的事件抽取方法^*[J]. 数据分析与知识发现, 2023, 7(6): 86-98.
[4]	高劲松, 张强, 李帅珂, 孙艳玲, 周树斌. 数字人文视域下诗人的时空情感轨迹研究——以李白为例^*[J]. 数据分析与知识发现, 2022, 6(9): 27-39.
[5]	景慎旗, 赵又霖. 基于医学领域知识和远程监督的医学实体关系抽取研究^*[J]. 数据分析与知识发现, 2022, 6(6): 105-114.
[6]	叶瀚,孙海春,李欣,焦凯楠. 融合注意力机制与句向量压缩的长文本分类模型[J]. 数据分析与知识发现, 2022, 6(6): 84-94.
[7]	范涛, 王昊, 李跃艳, 邓三鸿. 基于多模态融合的非遗图片分类研究^*[J]. 数据分析与知识发现, 2022, 6(2/3): 329-337.
[8]	周泽聿, 王昊, 张小琴, 范涛, 任秋彤. 基于Xception-TD的中华传统刺绣分类模型构建^*[J]. 数据分析与知识发现, 2022, 6(2/3): 338-347.
[9]	范涛, 王昊, 张卫, 李晓敏. 基于机器阅读理解的非遗文本实体抽取研究^*[J]. 数据分析与知识发现, 2022, 6(12): 70-79.
[10]	李晓敏,王昊,李跃艳,赵萌. 数字人文视域下中国行政区划地名演化知识库构建及分析研究^*[J]. 数据分析与知识发现, 2022, 6(11): 139-153.
[11]	王义真,欧石燕,陈金菊. 民事裁判文书两阶段式自动摘要研究^*[J]. 数据分析与知识发现, 2021, 5(5): 104-114.
[12]	张琪,江川,纪有书,冯敏萱,李斌,许超,刘浏. 面向多领域先秦典籍的分词词性一体化自动标注模型构建^*[J]. 数据分析与知识发现, 2021, 5(3): 2-11.
[13]	王倩,王东波,李斌,许超. 面向海量典籍文本的深度学习自动断句与标点平台构建研究^*[J]. 数据分析与知识发现, 2021, 5(3): 25-34.
[14]	纪有书, 王东波, 黄水清. *基于词对齐的古汉语同义词自动抽取研究^——以前四史典籍为例**[J]. 数据分析与知识发现, 2021, 5(11): 135-144.
[15]	赵宇翔,练靖雯. 数字人文视域下文化遗产众包研究综述*[J]. 数据分析与知识发现, 2021, 5(1): 36-55.

Viewed

Full text

Abstract

Cited

Shared

Discussed