|
|
Automatically Extracting Ancient Chinese Synonyms with Word Alignment——Case Study of Pre-Four-History Corpus |
Ji Youshu,Wang Dongbo,Huang Shuiqing() |
College of Information Management, Nanjing Agricultural University, Nanjing 210095, China Research Center for Humanities and Social Computing, Nanjing Agricultural University, Nanjing 210095, China |
|
|
Abstract [Objective] This paper proposes an unsupervised method to automatically extract synonyms from ancient Chinese, aiming to develop more effective algorithms in this field. [Methods] First, we constructed an Ancient-modern Chinese alignment corpus at sentence level. Then, we used the word alignment algorithm to process the corpus. Finally, we extracted the synonyms based on the word alignments. [Results] The proposed method could automatically extract ancient Chinese synonyms. It successfully generated 16,272 sets of synonyms with an accuracy rate of 40.12%. [Limitations] This method does not work with the corpus without Ancient-modern Chinese sentence level alignment. More research is needed to improve the effects of word segmentation and alignment algorithms, which will yield better extraction results. [Conclusions] The proposed method could expand the manually compiled thesaurus, and lead human computing research to the semantic level.
|
Received: 29 March 2021
Published: 23 December 2021
|
|
Fund:National Social Science Fund of China(15ZDB127);National Natural Science Foundation of China(71673143) |
Corresponding Authors:
Huang Shuiqing,ORCID:0000-0002-1646-9300
E-mail: sqhuang@njau.edu.cn
|
[1] |
黄水清. 人文计算与数字人文: 概念、问题、范式及关键环节[J]. 图书馆建设, 2019(5):68-78.
|
[1] |
(Huang Shuiqing. Humanity Computing and Digital Humanities: Concept, Problem, Paradigm and Key Step[J]. Library Development, 2019(5):68-78.)
|
[2] |
黄水清, 王东波. 古文信息处理研究的现状及趋势[J]. 图书情报工作, 2017, 61(12):43-49.
|
[2] |
(Huang Shuiqing, Wang Dongbo. Review and Trend of Researches on Ancient Chinese Character Information Processing[J]. Library and Information Service, 2017, 61(12):43-49.)
|
[3] |
Jurafsky D S, Martin J H. Speech and Language Processing[M]. The 2nd Edition. Prentice Hall, 2008.
|
[4] |
CCF. 中文微博情感分析&词汇语义关系抽取评测通知[EB/OL]. [2021-06-28]. http://tcci.ccf.org.cn/conference/2012/dldoc/%E8%AF%84%E6%B5%8B%E5%A4%A7%E7%BA%B2-%E8%AF%8D%E4%B9%89.pdf.
|
[4] |
(CCF. Affective Analysis of Chinese Microblog & Evaluation Notice of Lexical Semantic Relationship Extraction[EB/OL]. [2021-06-28]. http://tcci.ccf.org.cn/conference/2012/dldoc/%E8%AF%84%E6%B5%8B%E5%A4%A7%E7%BA%B2-%E8%AF%8D%E4%B9%89.pdf
|
[5] |
Takeuchi K, Takahashi H. Co-clustering with Recursive Elimination for Verb Synonym Extraction from Large Text Corpus[J]. IEICE Transactions on Information and Systems, 2009, E92D(12):2334-2340.
|
[6] |
van der Plas L, Tiedemann J, Manguin J L. Automatic Acquisition of Synonyms for French Using Parallel Corpora[C]// Proceedings of the 4th International Workshop on Distributed Agent-Based Retrieval Tools. 2010: 99.
|
[7] |
van der Plas L, Tiedemann J. Finding Synonyms Using Automatic Word Alignment and Measures of Distributional Similarity[C]// Proceedings of the COLING/ACL on Main Conference Poster Sessions. 2006: 866-873.
|
[8] |
洪成玉. 古汉语常用同义词词典[M]. 北京: 商务印书馆, 2009.
|
[8] |
(Hong Chengyu. Dictionary of Common Synonyms in Ancient Chinese[M]. Beijing: Commercial Press, 2009)
|
[9] |
Zhang Y J, Li B, Dai X Y, et al. PQAC-WN: Constructing a WordNet for Pre-Qin Ancient Chinese[J]. Language Resources and Evaluation, 2017, 51(2):525-545.
doi: 10.1007/s10579-016-9366-3
|
[10] |
知网简介[EB/OL]. [2021-06-28]. https://openhownet.thunlp.org/about_hownet/.
|
[10] |
(HowNet[EB/OL]. [2021-06-28]. https://openhownet. thunlp.org/about_hownet/.)
|
[11] |
梅家驹. 同义词词林[M]. 上海: 上海辞书出版社, 1983.
|
[11] |
(Mei Jiaju. Synonym Cilin [M]. Shanghai: Shanghai Lexicographical Publishing House, 1983)
|
[12] |
刘群, 李素建. 基于《知网》的词汇语义相似度计算[J]. 中文计算语言学, 2002, 7(2):59-76.
|
[12] |
(Liu Qun, Li Sujian. Calculation of Lexical Semantic Similarity Based on HowNet[J]. International Journal of Computational Linguistics & Chinese Language Processing, 2002, 7(2):59-76.)
|
[13] |
王斌. 汉英双语语料库自动对齐研究[D]. 北京: 中国科学院计算技术研究所, 1999.
|
[13] |
(Wang Bin. Research on Automatic Alignment of Chinese English Bilingual Corpora[D]. Beijing: Institute of Computing Technology, Chinese Academy of Sciences, 1999.)
|
[14] |
陈宏朝, 李飞, 朱新华, 等. 基于路径与深度的同义词词林词语相似度计算[J]. 中文信息学报, 2016, 30(5):80-88.
|
[14] |
(Chen Hongchao, Li Fei, Zhu Xinhua, et al. A Path and Depth—Based Approach to Word Semantic Similarity Calculation in CiLin[J]. Journal of Chinese Information Processing, 2016, 30(5):80-88.)
|
[15] |
殷希红, 乔晓东, 张运良. 利用术语定义的汉语同义词发现[J]. 现代图书情报技术, 2014(4):41-47.
|
[15] |
(Yin Xihong, Qiao Xiaodong, Zhang Yunliang. Chinese Synonyms Discovery Based on the Term Definition[J]. New Technology of Library and Information Service, 2014(4):41-47.)
|
[16] |
杨泉, 孙玉泉. 基于《同义词词林》深度的词义相似度计算研究[J]. 计算机工程与应用, 2020, 56(17):48-54.
|
[16] |
(Yang Quan, Sun Yuquan. Research on Semantic Similarity Calculation Based on Depth of CiLin[J]. Computer Engineering and Applications, 2020, 56(17):48-54.)
|
[17] |
陆勇, 章成志, 侯汉清. 基于百科资源的多策略中文同义词自动抽取研究[J]. 中国图书馆学报, 2010, 36(1):56-62.
|
[17] |
(Lu Yong, Zhang Chengzhi, Hou Hanqing. Using Multiple Hybrid Strategies to Extract Chinese Synonyms from Encyclopedia Resources[J]. Journal of Library Science in China, 2010, 36(1):56-62.)
|
[18] |
Sottovia P, Paganelli M, Guerra F, et al. Finding Synonymous Attributes in Evolving Wikipedia Infoboxes[C]// Proceedings of European Conference on Advances in Databases and Information Systems. Springer, Cham, 2019: 169-185.
|
[19] |
陆勇, 侯汉清. 基于PageRank算法的汉语同义词自动识别[J]. 西华大学学报(自然科学版), 2008, 27(2):13-15.
|
[19] |
(Lu Yong, Hou Hanqing. Automatic Recognition of Chinese Synonyms Based on PageRank Algorithm[J]. Journal of Xihua University (Natural Science Edition), 2008, 27(2):13-15.)
|
[20] |
韩普, 王东波, 朱恒民. 基于复杂网络的汉语相似词挖掘和相似度计算研究[J]. 情报学报, 2015, 34(8):885-896.
|
[20] |
(Han Pu, Wang Dongbo, Zhu Hengmin. Research of Chinese Similar Words Mining and Similarity Calculation Based on Complex Network[J]. Journal of the China Society for Scientific and Technical Information, 2015, 34(8):885-896.)
|
[21] |
Blondel V D, Gajardo A, Heymans M, et al. A Measure of Similarity Between Graph Vertices: Applications to Synonym Extraction and Web Searching[J]. SIAM Review, 2004, 46(4):647-666.
doi: 10.1137/S0036144502415960
|
[22] |
Mohammed N. Extracting Word Synonyms from Text Using Neural Approaches[J]. The International Arab Journal of Information Technology, 2020, 17(1):45-51.
|
[23] |
Pak A A, Narynov S S, Zharmagambetov A S, et al. The Method of Synonyms Extraction from Unannotated Corpus[C]// Proceedings of 2015 3rd International Conference on Digital Information, Networking, and Wireless Communications (DINWC). IEEE, 2015: 1-5.
|
[24] |
Leeuwenberg A, Vela M, Dehdari J, et al. A Minimally Supervised Approach for Synonym Extraction with Word Embeddings[J]. The Prague Bulletin of Mathematical Linguistics, 2016, 105(1):111-142.
doi: 10.1515/pralin-2016-0006
|
[25] |
Henriksson A, Moen H, Skeppstedt M, et al. Synonym Extraction and Abbreviation Expansion with Ensembles of Semantic Spaces[J]. Journal of Biomedical Semantics, 2014, 5(1):6.
doi: 10.1186/2041-1480-5-6
pmid: 24499679
|
[26] |
Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805.
|
[27] |
张琪, 江川, 纪有书, 等. 面向多领域先秦典籍的分词词性一体化自动标注模型构建[J]. 数据分析与知识发现, 2021, 5(3):2-11.
|
[27] |
(Zhang Qi, Jiang Chuan, Ji Youshu, et al. Unified Model for Word Segmentation and POS Tagging of Multi-Domain Pre-Qin Literature[J]. Data Analysis and Knowledge Discovery, 2021, 5(3):2-11.)
|
[28] |
Sun J Y. Jieba[EB/OL]. [2021-06-28]. https://pypi.org/project/jieba/.
|
[29] |
Brown P F, Pietra S D A, Pietra V D J, et al. The Mathematics of Statistical Machine Translation: Parameter Estimation[J]. Computational Linguistics, 1993, 19(2):263-311.
|
[30] |
Dyer C, Chahuneau V, Smith N A. A Simple, Fast, and Effective Reparameterization of IBM Model 2[C]// Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2013: 644-648.
|
[31] |
Le Q, Mikolov T. Distributed Representations of Sentences and Documents[C]// Proceedings of the 31st International Conference on Machine Learning. 2014: 1188-1196.
|
[32] |
Chen X, Chen C Y, Zhang D, et al. SEthesaurus: WordNet in Software Engineering[J]. IEEE Transactions on Software Engineering, 2021, 47(9):1960-1979.
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|