Please wait a minute...
Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (11): 135-144    DOI: 10.11925/infotech.2096-3467.2021.0311
Current Issue | Archive | Adv Search |
Automatically Extracting Ancient Chinese Synonyms with Word Alignment——Case Study of Pre-Four-History Corpus
Ji Youshu,Wang Dongbo,Huang Shuiqing()
College of Information Management, Nanjing Agricultural University, Nanjing 210095, China
Research Center for Humanities and Social Computing, Nanjing Agricultural University, Nanjing 210095, China
Download: PDF (700 KB)   HTML ( 3
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes an unsupervised method to automatically extract synonyms from ancient Chinese, aiming to develop more effective algorithms in this field. [Methods] First, we constructed an Ancient-modern Chinese alignment corpus at sentence level. Then, we used the word alignment algorithm to process the corpus. Finally, we extracted the synonyms based on the word alignments. [Results] The proposed method could automatically extract ancient Chinese synonyms. It successfully generated 16,272 sets of synonyms with an accuracy rate of 40.12%. [Limitations] This method does not work with the corpus without Ancient-modern Chinese sentence level alignment. More research is needed to improve the effects of word segmentation and alignment algorithms, which will yield better extraction results. [Conclusions] The proposed method could expand the manually compiled thesaurus, and lead human computing research to the semantic level.

Key wordsSynonym      Word      Alignment      Ancient      Chinese      Digital      Humanities     
Received: 29 March 2021      Published: 23 December 2021
ZTFLH:  G353  
Fund:National Social Science Fund of China(15ZDB127);National Natural Science Foundation of China(71673143)
Corresponding Authors: Huang Shuiqing,ORCID:0000-0002-1646-9300     E-mail: sqhuang@njau.edu.cn

Cite this article:

Ji Youshu, Wang Dongbo, Huang Shuiqing. Automatically Extracting Ancient Chinese Synonyms with Word Alignment——Case Study of Pre-Four-History Corpus. Data Analysis and Knowledge Discovery, 2021, 5(11): 135-144.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2021.0311     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2021/V5/I11/135

典籍原文 现代汉语翻译 出处
高祖,沛丰邑中阳里人也,姓刘氏。 汉高祖,是沛县丰邑中阳里的人,姓刘。 汉书
母媪尝息大泽之陂,梦与神遇。 其母有一次在水塘堤坝上闭目小憩,梦与天神不期而遇。 汉书
是时雷电晦冥,父太公往视,则见交龙于上。 逢上雷电交加,天色阴暗,其父太公到塘坝接应其母,只见一条蛟龙蟠于母身。 汉书
已而有娠,遂产高祖。 随之就怀孕了,生下了汉高祖。 汉书
高祖为人,隆准而龙颜,美须髯,左股有七十二黑子。 高祖的容貌与性格特征:鼻梁高而眉骨隆起,胡须很美,左股上有七十二颗黑痣。 汉书
Example of Sentence-level Alignment Corpus of the Pre-Four-History Corpora
模型参数 参数取值
Attention层丢弃率 0.1
双向/单向 bidi
隐含层激活函数 gelu
隐函数丢弃率 0.1
隐含层大小 768
初始值范围 0.02
中间层维度 3 072
最大位置向量 512
Attention头数量 12
隐含层层数 12
池化层维度 768
池化层Attention头数量 12
池化层数量 3
池化层每个头的维度 128
池化种类 first_token_transform
词典大小种类 2
词典大小 21 128
Parameters of BERT Word Segmentation Model
字段名 备注 示例 数据量统计(条)
id 索引 4 65 820
raw 典籍原文 已而有娠,遂产高祖。 65 820
raw_seg 分词(分词结果以“/”划分) 已而/有/娠/,/遂/产/高祖/。/ 65 820
translation 现代文翻译 随之就怀孕了,生下了汉高祖。 65 820
translation_seg 现代文翻译分词 随之/就/怀孕/了/,/生下/了/汉高祖/。/ 65 820
source 出处 汉书 65 820
Experimental Corpus Used in Ancient Chinese Synonym Extraction
古汉语分词 现代汉语分词 对齐结果
父/大/惊/。/ 他的/父亲/大为/惊讶/。/ [0 1 2], [3], [4], [5]
Examples of Word Alignment Results
参数 参数值 备注
hs 1 训练技巧:1表示采用Hierarchical Softmax,本质是把N分类问题变成log(N)次二分类;0表示加入负采样,本质是预测总体类别的一个子集
size 128 词向量的维度
min_count 0 忽略出现频次小于该值的词
window 10 窗口:在一个句子内观察的当前词与预测词的最大距离
sg 1 训练模式:1表示采用Skip-gram;0表示采用CBOW
Word2Vec Experimental Parameters
任务词 次序1 次序2 次序3
任务词1(如:高祖) (高帝) (春) (高)
任务词2
任务词3
抽取准确率
Examples of Evaluation Methods
次序 是同义词 准确率
次序1 0 73 0 0
次序2 0 32 41 0
次序3 0 19 54 0
总体 0 124 95 0
Synonym Extraction Results Based on Word2Vec
目标词 结果词
戊辰 丙申 丁丑 丁亥 乙卯
数术 通知 杜林
癸巳 己丑 辛亥 丙申 庚子
阴谋 仇家 公孙庆 蒙毅 李太后
郡民 天性 点
吴景 杨秋 樊能 铚人 贺齐
三危 温水 潭 涧 汶 汩 广饶
Synonym Extraction Results Based on Word2Vec
频率 类别
convolutional neural network 992 Standard
CNN 5 901 Abbreviation
CNNs 502 Abbreviation
convolutional-neural-networks 469 Synonym
convnet 376 Abbreviation
convolution-neural-network 185 Synonym
convnets 99 Abbreviation
convolution neural networks 33 Synonym
Examples of English Synonyms
序号 任务词 结果词1 结果词2 结果词3 结果词4
1 硃英
2 反虏 黄巾 青州
3 乡里 骸骨 田野 田里
4 奇节 伟平 画策
5 侵冤 冤民 称冤
6 桂阳 江湖
7 荧惑 不祥 东南 铭谣
8 蕃本
9 安阳 绍行
10 惨怛 齐人
11 家教 屏居
12
13
14 屯候 屯吏
15 宣罢
Example of Synonym Extraction Results Based on Word Alignment
结果词 同义词 非同
义词
结果
总量
空白 准确率
结果词1 134 200 334 0 40.12%
结果词2 81 235 316 18 25.63%
结果词3 52 163 215 119 24.19%
其他结果词(4~15) 139 606 745 3 263 18.66%
总体 406 1 204 1 610 3 400 27.15%
Evaluation of Synonym Extraction Results Based on Word Alignment
古汉语对齐词语 频次 古汉语对齐词语 频次 古汉语对齐词语 频次
高祖 39 2 高皇 1
高帝 11 2 1
4 汉高祖 2 高寝 1
4 高庙 2 哙从 1
3 汉高帝 2 汉高 1
3 2 三年 1
3 1 1
Ancient Chinese Words Aligned with “Emperor Gaozu of Han” and Their Frequency
[1] 黄水清. 人文计算与数字人文: 概念、问题、范式及关键环节[J]. 图书馆建设, 2019(5):68-78.
[1] (Huang Shuiqing. Humanity Computing and Digital Humanities: Concept, Problem, Paradigm and Key Step[J]. Library Development, 2019(5):68-78.)
[2] 黄水清, 王东波. 古文信息处理研究的现状及趋势[J]. 图书情报工作, 2017, 61(12):43-49.
[2] (Huang Shuiqing, Wang Dongbo. Review and Trend of Researches on Ancient Chinese Character Information Processing[J]. Library and Information Service, 2017, 61(12):43-49.)
[3] Jurafsky D S, Martin J H. Speech and Language Processing[M]. The 2nd Edition. Prentice Hall, 2008.
[4] CCF. 中文微博情感分析&词汇语义关系抽取评测通知[EB/OL]. [2021-06-28]. http://tcci.ccf.org.cn/conference/2012/dldoc/%E8%AF%84%E6%B5%8B%E5%A4%A7%E7%BA%B2-%E8%AF%8D%E4%B9%89.pdf.
[4] (CCF. Affective Analysis of Chinese Microblog & Evaluation Notice of Lexical Semantic Relationship Extraction[EB/OL]. [2021-06-28]. http://tcci.ccf.org.cn/conference/2012/dldoc/%E8%AF%84%E6%B5%8B%E5%A4%A7%E7%BA%B2-%E8%AF%8D%E4%B9%89.pdf
[5] Takeuchi K, Takahashi H. Co-clustering with Recursive Elimination for Verb Synonym Extraction from Large Text Corpus[J]. IEICE Transactions on Information and Systems, 2009, E92D(12):2334-2340.
[6] van der Plas L, Tiedemann J, Manguin J L. Automatic Acquisition of Synonyms for French Using Parallel Corpora[C]// Proceedings of the 4th International Workshop on Distributed Agent-Based Retrieval Tools. 2010: 99.
[7] van der Plas L, Tiedemann J. Finding Synonyms Using Automatic Word Alignment and Measures of Distributional Similarity[C]// Proceedings of the COLING/ACL on Main Conference Poster Sessions. 2006: 866-873.
[8] 洪成玉. 古汉语常用同义词词典[M]. 北京: 商务印书馆, 2009.
[8] (Hong Chengyu. Dictionary of Common Synonyms in Ancient Chinese[M]. Beijing: Commercial Press, 2009)
[9] Zhang Y J, Li B, Dai X Y, et al. PQAC-WN: Constructing a WordNet for Pre-Qin Ancient Chinese[J]. Language Resources and Evaluation, 2017, 51(2):525-545.
doi: 10.1007/s10579-016-9366-3
[10] 知网简介[EB/OL]. [2021-06-28]. https://openhownet.thunlp.org/about_hownet/.
[10] (HowNet[EB/OL]. [2021-06-28]. https://openhownet. thunlp.org/about_hownet/.)
[11] 梅家驹. 同义词词林[M]. 上海: 上海辞书出版社, 1983.
[11] (Mei Jiaju. Synonym Cilin [M]. Shanghai: Shanghai Lexicographical Publishing House, 1983)
[12] 刘群, 李素建. 基于《知网》的词汇语义相似度计算[J]. 中文计算语言学, 2002, 7(2):59-76.
[12] (Liu Qun, Li Sujian. Calculation of Lexical Semantic Similarity Based on HowNet[J]. International Journal of Computational Linguistics & Chinese Language Processing, 2002, 7(2):59-76.)
[13] 王斌. 汉英双语语料库自动对齐研究[D]. 北京: 中国科学院计算技术研究所, 1999.
[13] (Wang Bin. Research on Automatic Alignment of Chinese English Bilingual Corpora[D]. Beijing: Institute of Computing Technology, Chinese Academy of Sciences, 1999.)
[14] 陈宏朝, 李飞, 朱新华, 等. 基于路径与深度的同义词词林词语相似度计算[J]. 中文信息学报, 2016, 30(5):80-88.
[14] (Chen Hongchao, Li Fei, Zhu Xinhua, et al. A Path and Depth—Based Approach to Word Semantic Similarity Calculation in CiLin[J]. Journal of Chinese Information Processing, 2016, 30(5):80-88.)
[15] 殷希红, 乔晓东, 张运良. 利用术语定义的汉语同义词发现[J]. 现代图书情报技术, 2014(4):41-47.
[15] (Yin Xihong, Qiao Xiaodong, Zhang Yunliang. Chinese Synonyms Discovery Based on the Term Definition[J]. New Technology of Library and Information Service, 2014(4):41-47.)
[16] 杨泉, 孙玉泉. 基于《同义词词林》深度的词义相似度计算研究[J]. 计算机工程与应用, 2020, 56(17):48-54.
[16] (Yang Quan, Sun Yuquan. Research on Semantic Similarity Calculation Based on Depth of CiLin[J]. Computer Engineering and Applications, 2020, 56(17):48-54.)
[17] 陆勇, 章成志, 侯汉清. 基于百科资源的多策略中文同义词自动抽取研究[J]. 中国图书馆学报, 2010, 36(1):56-62.
[17] (Lu Yong, Zhang Chengzhi, Hou Hanqing. Using Multiple Hybrid Strategies to Extract Chinese Synonyms from Encyclopedia Resources[J]. Journal of Library Science in China, 2010, 36(1):56-62.)
[18] Sottovia P, Paganelli M, Guerra F, et al. Finding Synonymous Attributes in Evolving Wikipedia Infoboxes[C]// Proceedings of European Conference on Advances in Databases and Information Systems. Springer, Cham, 2019: 169-185.
[19] 陆勇, 侯汉清. 基于PageRank算法的汉语同义词自动识别[J]. 西华大学学报(自然科学版), 2008, 27(2):13-15.
[19] (Lu Yong, Hou Hanqing. Automatic Recognition of Chinese Synonyms Based on PageRank Algorithm[J]. Journal of Xihua University (Natural Science Edition), 2008, 27(2):13-15.)
[20] 韩普, 王东波, 朱恒民. 基于复杂网络的汉语相似词挖掘和相似度计算研究[J]. 情报学报, 2015, 34(8):885-896.
[20] (Han Pu, Wang Dongbo, Zhu Hengmin. Research of Chinese Similar Words Mining and Similarity Calculation Based on Complex Network[J]. Journal of the China Society for Scientific and Technical Information, 2015, 34(8):885-896.)
[21] Blondel V D, Gajardo A, Heymans M, et al. A Measure of Similarity Between Graph Vertices: Applications to Synonym Extraction and Web Searching[J]. SIAM Review, 2004, 46(4):647-666.
doi: 10.1137/S0036144502415960
[22] Mohammed N. Extracting Word Synonyms from Text Using Neural Approaches[J]. The International Arab Journal of Information Technology, 2020, 17(1):45-51.
[23] Pak A A, Narynov S S, Zharmagambetov A S, et al. The Method of Synonyms Extraction from Unannotated Corpus[C]// Proceedings of 2015 3rd International Conference on Digital Information, Networking, and Wireless Communications (DINWC). IEEE, 2015: 1-5.
[24] Leeuwenberg A, Vela M, Dehdari J, et al. A Minimally Supervised Approach for Synonym Extraction with Word Embeddings[J]. The Prague Bulletin of Mathematical Linguistics, 2016, 105(1):111-142.
doi: 10.1515/pralin-2016-0006
[25] Henriksson A, Moen H, Skeppstedt M, et al. Synonym Extraction and Abbreviation Expansion with Ensembles of Semantic Spaces[J]. Journal of Biomedical Semantics, 2014, 5(1):6.
doi: 10.1186/2041-1480-5-6 pmid: 24499679
[26] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805.
[27] 张琪, 江川, 纪有书, 等. 面向多领域先秦典籍的分词词性一体化自动标注模型构建[J]. 数据分析与知识发现, 2021, 5(3):2-11.
[27] (Zhang Qi, Jiang Chuan, Ji Youshu, et al. Unified Model for Word Segmentation and POS Tagging of Multi-Domain Pre-Qin Literature[J]. Data Analysis and Knowledge Discovery, 2021, 5(3):2-11.)
[28] Sun J Y. Jieba[EB/OL]. [2021-06-28]. https://pypi.org/project/jieba/.
[29] Brown P F, Pietra S D A, Pietra V D J, et al. The Mathematics of Statistical Machine Translation: Parameter Estimation[J]. Computational Linguistics, 1993, 19(2):263-311.
[30] Dyer C, Chahuneau V, Smith N A. A Simple, Fast, and Effective Reparameterization of IBM Model 2[C]// Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2013: 644-648.
[31] Le Q, Mikolov T. Distributed Representations of Sentences and Documents[C]// Proceedings of the 31st International Conference on Machine Learning. 2014: 1188-1196.
[32] Chen X, Chen C Y, Zhang D, et al. SEthesaurus: WordNet in Software Engineering[J]. IEEE Transactions on Software Engineering, 2021, 47(9):1960-1979.
[1] Wang Hanxue,Cui Wenjuan,Zhou Yuanchun,Du Yi. Identifying Pathogens of Foodborne Diseases with Machine Learning[J]. 数据分析与知识发现, 2021, 5(9): 54-62.
[2] Zhang Jiandong, Chen Shiji, Xu Xiaoting, Zuo Wenge. Extracting PDF Tables Based on Word Vectors[J]. 数据分析与知识发现, 2021, 5(8): 34-44.
[3] Li Wenna, Zhang Zhixiong. Entity Alignment Method for Different Knowledge Repositories with Joint Semantic Representation[J]. 数据分析与知识发现, 2021, 5(7): 1-9.
[4] Yu Xuehan, He Lin, Xu Jian. Extracting Events from Ancient Books Based on RoBERTa-CRF[J]. 数据分析与知识发现, 2021, 5(7): 26-35.
[5] Liu Wenbin, He Yanqing, Wu Zhenfeng, Dong Cheng. Sentence Alignment Method Based on BERT and Multi-similarity Fusion[J]. 数据分析与知识发现, 2021, 5(7): 48-58.
[6] Huang Mingxuan,Jiang Caoqing,Lu Shoudong. Expanding Queries Based on Word Embedding and Expansion Terms[J]. 数据分析与知识发现, 2021, 5(6): 115-125.
[7] Yin Pengbo,Pan Weimin,Zhang Haijun,Chen Degang. Identifying Clickbait with BERT-BiGA Model[J]. 数据分析与知识发现, 2021, 5(6): 126-134.
[8] Yan Qiang,Zhang Xiaoyan,Zhou Simin. Extracting Keywords Based on Sememe Similarity[J]. 数据分析与知识发现, 2021, 5(4): 80-89.
[9] Lin Kerou,Wang Hao,Gong Lijuan,Zhang Baolong. Disambiguation of Chinese Author Names with Multiple Features[J]. 数据分析与知识发现, 2021, 5(4): 90-102.
[10] Zhang Qi,Jiang Chuan,Ji Youshu,Feng Minxuan,Li Bin,Xu Chao,Liu Liu. Unified Model for Word Segmentation and POS Tagging of Multi-Domain Pre-Qin Literature[J]. 数据分析与知识发现, 2021, 5(3): 2-11.
[11] Wang Qian,Wang Dongbo,Li Bin,Xu Chao. Deep Learning Based Automatic Sentence Segmentation and Punctuation Model for Massive Classical Chinese Literature[J]. 数据分析与知识发现, 2021, 5(3): 25-34.
[12] Shen Si,Li Qinyu,Ye Yuan,Sun Hao,Ye Wenhao. Topic Mining and Evolution Analysis of Medical Sci-Tech Reports with TWE Model[J]. 数据分析与知识发现, 2021, 5(3): 35-44.
[13] Wu Yanwen, Cai Qiuting, Liu Zhi, Deng Yunze. Digital Resource Recommendation Based on Multi-Source Data and Scene Similarity Calculation[J]. 数据分析与知识发现, 2021, 5(11): 114-123.
[14] Dai Zhihong, Hao Xiaoling. Extracting Hypernym-Hyponym Relationship for Financial Market Applications[J]. 数据分析与知识发现, 2021, 5(10): 60-70.
[15] Wang Yan, Wang Huyan, Yu Bengong. Chinese Text Classification with Feature Fusion[J]. 数据分析与知识发现, 2021, 5(10): 1-14.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn