Automatically Extracting Ancient Chinese Synonyms with Word Alignment——Case Study of Pre-Four-History Corpus
Ji Youshu,Wang Dongbo,Huang Shuiqing()
College of Information Management, Nanjing Agricultural University, Nanjing 210095, China Research Center for Humanities and Social Computing, Nanjing Agricultural University, Nanjing 210095, China
[Objective] This paper proposes an unsupervised method to automatically extract synonyms from ancient Chinese, aiming to develop more effective algorithms in this field. [Methods] First, we constructed an Ancient-modern Chinese alignment corpus at sentence level. Then, we used the word alignment algorithm to process the corpus. Finally, we extracted the synonyms based on the word alignments. [Results] The proposed method could automatically extract ancient Chinese synonyms. It successfully generated 16,272 sets of synonyms with an accuracy rate of 40.12%. [Limitations] This method does not work with the corpus without Ancient-modern Chinese sentence level alignment. More research is needed to improve the effects of word segmentation and alignment algorithms, which will yield better extraction results. [Conclusions] The proposed method could expand the manually compiled thesaurus, and lead human computing research to the semantic level.
纪有书, 王东波, 黄水清. 基于词对齐的古汉语同义词自动抽取研究*——以前四史典籍为例[J]. 数据分析与知识发现, 2021, 5(11): 135-144.
Ji Youshu, Wang Dongbo, Huang Shuiqing. Automatically Extracting Ancient Chinese Synonyms with Word Alignment——Case Study of Pre-Four-History Corpus. Data Analysis and Knowledge Discovery, 2021, 5(11): 135-144.
(Huang Shuiqing, Wang Dongbo. Review and Trend of Researches on Ancient Chinese Character Information Processing[J]. Library and Information Service, 2017, 61(12):43-49.)
[3]
Jurafsky D S, Martin J H. Speech and Language Processing[M]. The 2nd Edition. Prentice Hall, 2008.
(CCF. Affective Analysis of Chinese Microblog & Evaluation Notice of Lexical Semantic Relationship Extraction[EB/OL]. [2021-06-28]. http://tcci.ccf.org.cn/conference/2012/dldoc/%E8%AF%84%E6%B5%8B%E5%A4%A7%E7%BA%B2-%E8%AF%8D%E4%B9%89.pdf
[5]
Takeuchi K, Takahashi H. Co-clustering with Recursive Elimination for Verb Synonym Extraction from Large Text Corpus[J]. IEICE Transactions on Information and Systems, 2009, E92D(12):2334-2340.
[6]
van der Plas L, Tiedemann J, Manguin J L. Automatic Acquisition of Synonyms for French Using Parallel Corpora[C]// Proceedings of the 4th International Workshop on Distributed Agent-Based Retrieval Tools. 2010: 99.
[7]
van der Plas L, Tiedemann J. Finding Synonyms Using Automatic Word Alignment and Measures of Distributional Similarity[C]// Proceedings of the COLING/ACL on Main Conference Poster Sessions. 2006: 866-873.
[8]
洪成玉. 古汉语常用同义词词典[M]. 北京: 商务印书馆, 2009.
[8]
(Hong Chengyu. Dictionary of Common Synonyms in Ancient Chinese[M]. Beijing: Commercial Press, 2009)
[9]
Zhang Y J, Li B, Dai X Y, et al. PQAC-WN: Constructing a WordNet for Pre-Qin Ancient Chinese[J]. Language Resources and Evaluation, 2017, 51(2):525-545.
doi: 10.1007/s10579-016-9366-3
(Liu Qun, Li Sujian. Calculation of Lexical Semantic Similarity Based on HowNet[J]. International Journal of Computational Linguistics & Chinese Language Processing, 2002, 7(2):59-76.)
[13]
王斌. 汉英双语语料库自动对齐研究[D]. 北京: 中国科学院计算技术研究所, 1999.
[13]
(Wang Bin. Research on Automatic Alignment of Chinese English Bilingual Corpora[D]. Beijing: Institute of Computing Technology, Chinese Academy of Sciences, 1999.)
(Chen Hongchao, Li Fei, Zhu Xinhua, et al. A Path and Depth—Based Approach to Word Semantic Similarity Calculation in CiLin[J]. Journal of Chinese Information Processing, 2016, 30(5):80-88.)
(Yin Xihong, Qiao Xiaodong, Zhang Yunliang. Chinese Synonyms Discovery Based on the Term Definition[J]. New Technology of Library and Information Service, 2014(4):41-47.)
(Yang Quan, Sun Yuquan. Research on Semantic Similarity Calculation Based on Depth of CiLin[J]. Computer Engineering and Applications, 2020, 56(17):48-54.)
(Lu Yong, Zhang Chengzhi, Hou Hanqing. Using Multiple Hybrid Strategies to Extract Chinese Synonyms from Encyclopedia Resources[J]. Journal of Library Science in China, 2010, 36(1):56-62.)
[18]
Sottovia P, Paganelli M, Guerra F, et al. Finding Synonymous Attributes in Evolving Wikipedia Infoboxes[C]// Proceedings of European Conference on Advances in Databases and Information Systems. Springer, Cham, 2019: 169-185.
(Lu Yong, Hou Hanqing. Automatic Recognition of Chinese Synonyms Based on PageRank Algorithm[J]. Journal of Xihua University (Natural Science Edition), 2008, 27(2):13-15.)
(Han Pu, Wang Dongbo, Zhu Hengmin. Research of Chinese Similar Words Mining and Similarity Calculation Based on Complex Network[J]. Journal of the China Society for Scientific and Technical Information, 2015, 34(8):885-896.)
[21]
Blondel V D, Gajardo A, Heymans M, et al. A Measure of Similarity Between Graph Vertices: Applications to Synonym Extraction and Web Searching[J]. SIAM Review, 2004, 46(4):647-666.
doi: 10.1137/S0036144502415960
[22]
Mohammed N. Extracting Word Synonyms from Text Using Neural Approaches[J]. The International Arab Journal of Information Technology, 2020, 17(1):45-51.
[23]
Pak A A, Narynov S S, Zharmagambetov A S, et al. The Method of Synonyms Extraction from Unannotated Corpus[C]// Proceedings of 2015 3rd International Conference on Digital Information, Networking, and Wireless Communications (DINWC). IEEE, 2015: 1-5.
[24]
Leeuwenberg A, Vela M, Dehdari J, et al. A Minimally Supervised Approach for Synonym Extraction with Word Embeddings[J]. The Prague Bulletin of Mathematical Linguistics, 2016, 105(1):111-142.
doi: 10.1515/pralin-2016-0006
[25]
Henriksson A, Moen H, Skeppstedt M, et al. Synonym Extraction and Abbreviation Expansion with Ensembles of Semantic Spaces[J]. Journal of Biomedical Semantics, 2014, 5(1):6.
doi: 10.1186/2041-1480-5-6
pmid: 24499679
[26]
Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805.
(Zhang Qi, Jiang Chuan, Ji Youshu, et al. Unified Model for Word Segmentation and POS Tagging of Multi-Domain Pre-Qin Literature[J]. Data Analysis and Knowledge Discovery, 2021, 5(3):2-11.)
[28]
Sun J Y. Jieba[EB/OL]. [2021-06-28]. https://pypi.org/project/jieba/.
[29]
Brown P F, Pietra S D A, Pietra V D J, et al. The Mathematics of Statistical Machine Translation: Parameter Estimation[J]. Computational Linguistics, 1993, 19(2):263-311.
[30]
Dyer C, Chahuneau V, Smith N A. A Simple, Fast, and Effective Reparameterization of IBM Model 2[C]// Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2013: 644-648.
[31]
Le Q, Mikolov T. Distributed Representations of Sentences and Documents[C]// Proceedings of the 31st International Conference on Machine Learning. 2014: 1188-1196.
[32]
Chen X, Chen C Y, Zhang D, et al. SEthesaurus: WordNet in Software Engineering[J]. IEEE Transactions on Software Engineering, 2021, 47(9):1960-1979.