Please wait a minute...
Data Analysis and Knowledge Discovery  2023, Vol. 7 Issue (10): 50-62    DOI: 10.11925/infotech.2096-3467.2022.0926
Current Issue | Archive | Adv Search |
Identifying Styles of Cross-Language Classics with Pre-Trained Models
Zhang Yiqin1,Deng Sanhong1,Hu Haotian1,Wang Dongbo2()
1School of Information Management, Nanjing University, Nanjing 210023, China
2School of Information Management, Nanjing Agricultural University, Nanjing 210095, China
Download: PDF (1035 KB)   HTML ( 7
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper uses pre-trained language models to explore and study the linguistic style of canonical texts, aiming to improve their connotation quality. [Methods] We compared the performance of five pre-trained language models with the deep learning model Bi-LSTM-CRF on the cross-lingual canonical ancient Chinese-English corpus. The selected works include The Analects of Confucius, The Tao Te Ching, The Book of Rites, The Shangshu, and The Warring States Curse. We also examined the lexicon-based canonical language style. [Results] The SikuBERT pre-trained language model achieved 91.29% precision, 91.76% recall, and 91.52% in concordance mean F1 for recognizing canonical words. The modern Chinese translation yielded deeper semantic meaning, clearer ideographic referents, and more vivid and flexible word combinations than the original canonical words. [Limitations] This study only chose specific pre-Qin classical texts and their translations. More research is needed to examine the models’ performance in other domains. [Conclusions] The pre-trained language model SikuBERT could effectively analyze language style differences of cross-lingual canonical texts, which promotes the dissemination of classic Chinese works.

Key wordsPre-Trained Language Models      Language Style      Digital Humanities      Canonical Texts     
Received: 01 September 2022      Published: 22 March 2023
ZTFLH:  G122  
  G254  
Fund:National Social Science Fund of China(21&ZD331)
Corresponding Authors: Wang Dongbo, ORCID:0000-0002-9894-9550, E-mail:db.wang@njau.edu.cn。   

Cite this article:

Zhang Yiqin, Deng Sanhong, Hu Haotian, Wang Dongbo. Identifying Styles of Cross-Language Classics with Pre-Trained Models. Data Analysis and Knowledge Discovery, 2023, 7(10): 50-62.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2022.0926     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2023/V7/I10/50

编码 古汉语句子 英文句子 现代汉语句子
1 先進: 子曰:“孝哉閔子騫!人不間於其父母昆弟之言。” Xian Jin: The Master said, “Filial indeed is Min Zi Qian! Other people say nothing of him different from the report of his parents and brothers.” 孔子说:“闵子骞真是孝顺呀!人们对于他的父母兄弟称赞他的话没有异议。”
1.01 先進: 子曰:“孝哉閔子騫! Xian Jin: The Master said, “Filial indeed is Min Zi Qian! 孔子说:“闵子骞真是孝顺呀!
1.02 人不間於其父母昆弟之言。 Other people say nothing of him different from the report of his parents and brothers. 人们对于他的父母兄弟称赞他的话没有异议。”
Sample Sentence Alignment in Modern Chinese, English and Ancient Chinese
符号 含义 符号 含义
a 形容词 p 介词
c 连词 q 量词
d 副词 r 代词
f 方位名词 s 拟声词
gv 古代动词 t 时间词
j 兼词 u 助词
m 数词 v 动词
n 普通名词 w 标点符号
nr 人名 y 语气词
ns 地名
Part-of-Speech Tagging Set for Ancient Chinese
符号 含义 符号 含义
a 形容词 ni 机构名
b 区别词 nl 处所名词
c 连词 ns 地名
d 副词 nt 时间词
e 叹词 nz 其他专名
g 语素字 o 拟声词
h 前接成分 p 介词
i 习用语 q 量词
j 简称 r 代词
k 后接成分 u 助词
m 数词 v 动词
n 普通名词 wp 标点符号
nd 方位名词 ws 字符串
nh 人名 x 非语素字
Part-of-Speech Tagging Set for Modern Chinese
词性标签 含义 样例 词性标签 含义 样例
CC Coordinating conjunction and but or PRP Possessive pronoun
CD Cardinal number ns RB Adverb
DT Determiner nt RBR Adverb comparative
EX Existential there nz RBS Adverb superlative
FW Foreign word o RP Particle
IN Preposision or subordinating
conjunction
p SYM Symbol Should be used for mathematical, scientific or technical symbols
JJ Adjective q TO to
JJR Adjective comparative r UH Interjection Uh, well, yes
JJS Adjective superlative u VB Verb, base form Subsumes imperatives, infinitives and subjunctives
LS List item maker v VBD Verb, past tense Includes the conditional form of the verb to be
MD Modal could might VBG Verb, gerund or persent participle
NN Noun singular or mass ws VBN Verb, past participle
Part-of-Speech Tagging Set for English
Data Annotation Example
BERT Model Architecture
Model Performance
标签 含义 准确率(%) 召回率(%) F1值(%) 标签数量
总计 91.29 91.76 91.52 36 762
a 形容词 62.75 64.79 63.76 338
c 连词 93.46 94.21 93.83 1 243
d 副词 91.55 91.82 91.69 2 006
f 方位名词 71.13 64.49 67.65 107
gv 古代动词 100.00 33.33 50.00 3
j 兼词 82.61 80.85 81.72 47
m 数词 91.09 91.29 91.19 448
n 普通名词 83.77 83.81 83.79 7 288
nr 人名 79.19 81.03 80.10 1 935
ns 地名 80.86 84.95 82.85 711
p 介词 94.90 96.63 95.76 1 156
q 量词 89.66 96.30 92.86 27
r 代词 96.02 95.93 95.97 2 162
s 拟声词 0.00 0.00 0.00 0
t 时间名词 91.25 90.33 90.79 300
u 助词 95.38 95.38 95.38 692
v 动词 91.30 92.01 91.65 8 808
w 标点符号 99.91 99.79 99.85 8 454
y 语气词 96.49 98.07 97.27 1 037
Training Results for Part-of-Speech Tagging in Classical Texts
Lexical Frequency Statistics Across Language Canons
Comparison of the Frequency Statistics of Chinese Words with Long Subwords
特征

文本
古代汉语 现代汉语 英文译本
形符 333 896 440 282 540 908
类符 11 914 22 158 18 379
实词 211 112 225 820 163 738
类符/形符比(TTR) 0.035 7 0.050 3 0.034 0
平滑类符/形符比(log TTR) 73.79% 77.00% 74.38%
词汇密度 63.23% 51.29% 30.27%
Linguistic and Stylistic Features of the Canonical Texts
[1] 张华莉. 语用视角下的现代汉语研究[M]. 长春: 东北师范大学出版社, 2019: 160-161.
[1] (Zhang Huali. A Pragmatic Study of Modern Chinese[M]. Changchun: Northeast Normal University Press, 2019: 160-161.)
[2] 祝克懿. 语言风格研究的理论渊源与功能衍化路径[J]. 当代修辞学, 2021(1): 59-71.
[2] (Zhu Keyi. Theoretical Origin and Functional Evolution Path of Language Style Research[J]. Contemporary Rhetoric, 2021(1): 59-71.)
[3] 武晓春, 黄萱菁, 吴立德. 基于语义分析的作者身份识别方法研究[J]. 中文信息学报, 2006, 20(6): 61-68.
[3] (Wu Xiaochun, Huang Xuanjing, Wu Lide. Authorship Identification Based on Semantic Analysis[J]. Journal of Chinese Information Processing, 2006, 20(6): 61-68.)
[4] 肖天久, 刘颖. 基于聚类和分类的金庸与古龙小说风格分析[J]. 中文信息学报, 2015, 29(5): 167-177.
[4] (Xiao Tianjiu, Liu Ying. A Stylistic Analysis of Jin Yong’s and Gu Long’s Fictions Based on Text Clustering and Classification[J]. Journal of Chinese Information Processing, 2015, 29(5): 167-177.)
[5] 王翊, 张瑞娥, 韩名利. 《淮南子》汉英平行语料库建设及应用前景[J]. 安徽理工大学学报(社会科学版), 2021, 23(1): 84-89.
[5] (Wang Yi, Zhang Ruie, Han Mingli. “Huainanzi” Chinese-English Parallel Corpus: Construction and Application Prospects[J]. Journal of Anhui University of Science and Technology (Social Science), 2021, 23(1): 84-89.)
[6] 范敏. 《论语》五译本译者风格研究——基于语料库的统计与分析[J]. 北京航空航天大学学报(社会科学版), 2016, 29(6): 81-88.
[6] Fan Min. On Translators’ Styles of Five Versions of The Analects: A Statistic Analysis Based on Corpus Studies[J]. Journal of Beijing University of Aeronautics and Astronautics (Social Sciences Edition), 2016, 29(6): 81-88.)
[7] 习近平. 坚定文化自信,建设社会主义文化强国[J]. 求是, 2019(12):4-12.
[7] (Xi Jinping. Building Cultural Confidence and Strength and Securing New Successes in Developing Socialist Culture[J]. Qiushi, 2019(12):4-12.)
[8] 黄水清, 王东波. 古文信息处理研究的现状及趋势[J]. 图书情报工作, 2017, 61(12): 43-49.
doi: 10.13266/j.issn.0252-3116.2017.12.005
[8] (Huang Shuiqing, Wang Dongbo. Review and Trend of Researches on Ancient Chinese Character Information Processing[J]. Library and Information Service, 2017, 61(12): 43-49.)
doi: 10.13266/j.issn.0252-3116.2017.12.005
[9] 冯文贺, 高子雄, 张文娟. 小句识别所依赖的语段全局范围探究——基于预训练语言模型Bert的汉语小句识别[J]. 语言文字应用, 2022(2): 111-121.
[9] (Feng Wenhe, Gao Zixiong, Zhang Wenjuan. Detecting the Global Range of Segments for Clause Recognition with Bert[J]. Applied Linguistics, 2022(2): 111-121.)
[10] Ye X, Dong M H. A Review on Different English Versions of an Ancient Classic of Chinese Medicine: Huang Di Nei Jing[J]. Journal of Integrative Medicine, 2017, 15(1): 11-18.
doi: 10.1016/S2095-4964(17)60310-8
[11] 陈静. 《呐喊》方言词语特征及其对鲁迅文学语言风格的影响[J]. 现代中文学刊, 2022(4): 16-22.
[11] (Chen Jing. The Characteristics of Dialect Words in Call to Arms and Its Influence on Lu Xun’s Literary Language Style[J]. Journal of Modern Chinese Literature, 2022(4): 16-22.)
[12] 高莲芳, 贡保扎西. 论敦煌西域出土古藏文契约文书的结构格式与语言风格[J]. 西藏大学学报(社会科学版), 2020, 35(2): 97-107.
[12] (Gao Lianfang, Gonpo Tashi. On the Structure, Format and Language Style of the Ancient Tibetan Contracts Unearthed in Dunhuang and Western Regions[J]. Journal of Tibet University, 2020, 35(2): 97-107.)
[13] 马创新, 梁社会, 陈小荷. 先秦诸家学派的相关系数与特征词研究[J]. 中文信息学报, 2019, 33(12): 129-134.
[13] (Ma Chuangxin, Liang Shehui, Chen Xiaohe. Study on the Correlation Coefficient and Characteristic Words of the Pre-Qin Schools[J]. Journal of Chinese Information Processing, 2019, 33(12): 129-134.)
[14] 张旭冉, 杏永乐, 张盼, 等. 《道德经》四个英译本的翻译风格对比研究——基于语料库的统计与分析[J]. 上海翻译, 2022(3): 33-38.
[14] (Zhang Xuran, Xing Yongle, Zhang Pan, et al. A Corpus-Based Comparative Study on the Translation Styles of Tao Te Ching[J]. Shanghai Journal of Translators, 2022(3): 33-38.)
[15] 程齐凯, 李信, 陆伟. 基于情感词汇的科研论文写作风格演变研究——1994-2012年科研论文摘要中情感词汇使用情况的回顾性分析[J]. 图书情报知识, 2016(6): 62-68.
[15] (Cheng Qikai, Li Xin, Lu Wei. Research on the Evolution of Scientific Writing Style Between 1994 and 2012 Based on Emotional Vocabulary Perspective: A Retrospective Analysis[J]. Documentation, Information & Knowledge, 2016(6): 62-68.)
[16] 李颖玉. 基于陕西文学英译语料库的译入译出文本比较研究: 以词语情感色彩的传达为例[J]. 西安外国语大学学报, 2020, 28(4): 81-86.
[16] (Li Yingyu. A Corpus-Based Comparison Between L1 and L2 Translation: Focusing on the Translation of Emotional Words in Shaanxi Literary Works[J]. Journal of Xi’an International Studies University, 2020, 28(4): 81-86.)
[17] 许明, 蒋跃. 《阿Q正传》译入译出文本的风格计量学对比[J]. 外语研究, 2020, 37(3): 86-92.
[17] (Xu Ming, Jiang Yue. A Stylometric Comparison of L1 Translations and L2 Translations of the True Story of Ah Q[J]. Foreign Languages Research, 2020, 37(3): 86-92.)
[18] 黄水清, 王东波. 国内语料库研究综述[J]. 信息资源管理学报, 2021, 11(3): 4-17.
[18] (Huang Shuiqing, Wang Dongbo. Review of Corpus Research in China[J]. Journal of Information Resources Management, 2021, 11(3): 4-17.)
[19] 黄水清, 王晓光, 夏翠娟, 等. 推进新时代古籍工作,加快创新智能化发展[J]. 农业图书情报学报, 2022, 34(5): 4-20.
doi: 10.13998/j.cnki.issn1002-1248.22-0359
[19] (Huang Shuiqing, Wang Xiaoguang, Xia Cuijuan, et al. Advancing the Work on Ancient Classics in the New Era and Accelerating Innovative and Intelligent Development[J]. Journal of Library and Information Science in Agriculture, 2022, 34(5): 4-20.)
doi: 10.13998/j.cnki.issn1002-1248.22-0359
[20] 袁悦, 王东波, 黄水清, 等. 不同词性标记集在典籍实体抽取上的差异性探究[J]. 数据分析与知识发现, 2019, 3(3): 57-65.
[20] (Yuan Yue, Wang Dongbo, Huang Shuiqing, et al. The Comparative Study of Different Tagging Sets on Entity Extraction of Classical Books[J]. Data Analysis and Knowledge Discovery, 2019, 3(3): 57-65.)
[21] Penn Treebank P.O.S. Tags[EB/OL]. [2022-11-11]. https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html.
[22] Hochreiter S, Schmidhuber J. Long Short-Term Memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
doi: 10.1162/neco.1997.9.8.1735 pmid: 9377276
[23] Devlin J, Chang M W, Lee K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805.
[24] Hao Y R, Dong L, Wei F R, et al. Visualizing and Understanding the Effectiveness of BERT[OL]. arXiv Preprint, arXiv: 1908.05620.
[25] Cui Y M, Che W X, Liu T, et al. Pre-Training with Whole Word Masking for Chinese BERT[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3504-3514.
doi: 10.1109/TASLP.2021.3124365
[26] 阎覃. Guwen BERT:古文预训练语言模型(古文BERT)[EB/OL]. [2022-07-28]. https://github.com/Ethan-yt/guwenbert.
[26] (Yan Tan. Guwen BERT: A Pre-Trained Language Model for Classical Chinese (Literary Chinese)[EB/OL]. [2022-07-28]. https://github.com/Ethan-yt/guwenbert.)
[27] 王东波, 刘畅, 朱子赫, 等. SikuBERT与SikuRoBERTa:面向数字人文的《四库全书》预训练模型构建及应用研究[J]. 图书馆论坛, 2022, 42(6): 31-43.
[27] (Wang Dongbo, Liu Chang, Zhu Zihe, et al. Construction and Application of Pre-Trained Models of Siku Quanshu in Orientation to Digital Humanities[J]. Library Tribune, 2022, 42(6): 31-43.)
[28] 刘江峰, 冯钰童, 王东波, 等. 数字人文视域下SikuBERT增强的史籍实体识别研究[J]. 图书馆论坛, 2022, 42(10): 61-72.
[28] (Liu Jiangfeng, Feng Yutong, Wang Dongbo, et al. Research on SikuBERT-Enhanced Entity Recognition of Historical Records from the Perspective of Digital Humanities[J]. Library Tribune, 2022, 42(10): 61-72.)
[29] 刘畅, 王东波, 胡昊天, 等. 面向数字人文的融合外部特征的典籍自动分词研究——以SikuBERT预训练模型为例[J]. 图书馆论坛, 2022, 42(6): 44-54.
[29] (Liu Chang, Wang Dongbo, Hu Haotian, et al. Automatic Word Segmentation of Classic Books with External Features for Digital Humanities: A Case Study of SikuBERT Pre-Training Model[J]. Library Tribune, 2022, 42(6): 44-54.)
[30] 胡昊天, 张逸勤, 邓三鸿, 等. 面向数字人文的《四库全书》子部自动分类研究——以SikuBERT和SikuRoBERTa预训练模型为例[J]. 图书馆论坛, 2022, 42(12): 138-148.
[30] (Hu Haotian, Zhang Yiqin, Deng Sanhong, et al. Automatic Text Classification of “Zi” Part of Siku Quanshu from the Perspective of Digital Humanities: Based on SikuBERT and SikuRoBERTa Pre-Trained Models[J]. Library Tribune, 2022, 42(12): 138-148.)
[31] 耿云冬, 张逸勤, 刘欢, 等. 面向数字人文的中国古代典籍词性自动标注研究——以SikuBERT预训练模型为例[J]. 图书馆论坛, 2022, 42(6): 55-63.
[31] (Geng Yundong, Zhang Yiqin, Liu Huan, et al. Automatic Part-of-Speech Tagging of Chinese Ancient Classics in the Context of Digital Humanities Research: A Case Study of SIKU-BERT Pre-training Model[J]. Library Tribune, 2022, 42(6): 55-63.)
[32] 端木三. 英汉音节分析及数量对比[J]. 语言科学, 2021, 20(6): 561-588.
doi: 10.7509/j.linsci.202102.033530
[32] (Duanmu San. Syllable Analysis and Syllable Inventories in English and Chinese[J]. Linguistic Sciences, 2021, 20(6): 561-588.)
doi: 10.7509/j.linsci.202102.033530
[33] Baker M. Corpora in Translation Studies: An Overview and Some Suggestions for Future Research[J]. Target, 1995, 7(2): 223-243.
doi: 10.1075/target
[34] Yu G X. Lexical Diversity in Writing and Speaking Task Performances[J]. Applied Linguistics, 2010, 31(2):236-259.
doi: 10.1093/applin/amp024
[35] Biber D, Johansson S, Leech G, et al. Longman Grammar of Spoken and Written English[M]. London: Pearson Education Limited, 1999.
[36] Laviosa S. Core Patterns of Lexical Use in a Comparable Corpus of English Narrative Prose[J]. Meta, 2002, 43(4): 557-570.
[37] 邓三鸿, 胡昊天, 王昊, 等. 古文自动处理研究现状与新时代发展趋势展望[J]. 科技情报研究, 2021, 3(1): 1-20.
[37] (Deng Sanhong, Hu Haotian, Wang Hao, et al. Review of Automatic Processing of Ancient Chinese Character and Prospects for Its Development Trends in the New Era[J]. Scientific Information Research, 2021, 3(1): 1-20.)
[1] Bao Tong, Zhang Chengzhi. Extracting Chinese Information with ChatGPT:An Empirical Study by Three Typical Tasks[J]. 数据分析与知识发现, 2023, 7(9): 1-11.
[2] Gao Jinsong, Zhang Qiang, Li Shuaike, Sun Yanling, Zhou Shubin. Poet’s Emotional Trajectory in Time and Space: Case Study of Li Bai for Digital Humanities[J]. 数据分析与知识发现, 2022, 6(9): 27-39.
[3] Fan Tao, Wang Hao, Li Yueyan, Deng Sanhong. Classifying Images of Intangible Cultural Heritages with Multimodal Fusion[J]. 数据分析与知识发现, 2022, 6(2/3): 329-337.
[4] Zhou Zeyu, Wang Hao, Zhang Xiaoqin, Tao Fao, Ren Qiutong. Classification Model for Chinese Traditional Embroidery Based on Xception-TD[J]. 数据分析与知识发现, 2022, 6(2/3): 338-347.
[5] Fan Tao, Wang Hao, Zhang Wei, Li Xiaomin. Extracting Entities from Intangible Cultural Heritage Texts Based on Machine Reading Comprehension[J]. 数据分析与知识发现, 2022, 6(12): 70-79.
[6] Zhang Qi,Jiang Chuan,Ji Youshu,Feng Minxuan,Li Bin,Xu Chao,Liu Liu. Unified Model for Word Segmentation and POS Tagging of Multi-Domain Pre-Qin Literature[J]. 数据分析与知识发现, 2021, 5(3): 2-11.
[7] Wang Qian,Wang Dongbo,Li Bin,Xu Chao. Deep Learning Based Automatic Sentence Segmentation and Punctuation Model for Massive Classical Chinese Literature[J]. 数据分析与知识发现, 2021, 5(3): 25-34.
[8] Zhao Yuxiang,Lian Jingwen. Review of Cultural Heritage Crowdsourcing in the Domain of Digital Humanities[J]. 数据分析与知识发现, 2021, 5(1): 36-55.
[9] Liang Jiwen,Jiang Chuan,Wang Dongbo. Chinese-English Sentence Alignment of Ancient Literature Based on Multi-feature Fusion[J]. 数据分析与知识发现, 2020, 4(9): 123-132.
[10] Xu Chenfei, Ye Haiying, Bao Ping. Automatic Recognition of Produce Entities from Local Chronicles with Deep Learning[J]. 数据分析与知识发现, 2020, 4(8): 86-97.
[11] Liu Liu,Qin Tianyun,Wang Dongbo. Automatic Extraction of Traditional Music Terms of Intangible Cultural Heritage[J]. 数据分析与知识发现, 2020, 4(12): 68-75.
[12] Haici Yang,Jun Wang. Visualizing Knowledge Graph of Academic Inheritance in Song Dynasty[J]. 数据分析与知识发现, 2019, 3(6): 109-116.
[13] Yue Yuan,Dongbo Wang,Shuiqing Huang,Bin Li. The Comparative Study of Different Tagging Sets on Entity Extraction of Classical Books[J]. 数据分析与知识发现, 2019, 3(3): 57-65.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn