1School of Information Management, Nanjing University, Nanjing 210023, China 2College of Information Management, Nanjing Agricultural University, Nanjing 210095, China 3College of Literature, Nanjing Normal University, Nanjing 210097, China
[Objective] This paper constructs a deep learning model for automatic word segmentation and part-of-speech (POS) tagging of ancient literature, aiming to build an automatic annotation solution for Chinese books from multiple fields. [Methods] We used 25 Pre-Qin literature as training corpus, which covers the Confucian classics, history, philosophy and miscellaneous works. Then, we constructed a unified model with BERT for word segmentation and POS tagging without adding new features. Third, we examined our model with The Records of the Grand Historian, which was not included in the training corpus. Finally, we analyzed the four basic parts constituting historical events (names, locations, time, actions) with statistics and case studies. [Results] The proposed model’s F-score for word segmentation and the POS-tagging reached 95.98% and 88.97%. [Limitations] After analyzing the confusion heat map of POS tagging, it is found that the mislabeling, which is caused by the imbalanced part-of-speech distribution, the similar syntactic features of some parts of speech instances, and the multi-category words, needs further research and resolutions. [Conclusions] Our deep learning model is stable and applicable for word segmentation and POS tagging with Pre-Qin literature.
张琪,江川,纪有书,冯敏萱,李斌,许超,刘浏. 面向多领域先秦典籍的分词词性一体化自动标注模型构建*[J]. 数据分析与知识发现, 2021, 5(3): 2-11.
Zhang Qi,Jiang Chuan,Ji Youshu,Feng Minxuan,Li Bin,Xu Chao,Liu Liu. Unified Model for Word Segmentation and POS Tagging of Multi-Domain Pre-Qin Literature. Data Analysis and Knowledge Discovery, 2021, 5(3): 2-11.
( Huang Shuiqing, Wang Dongbo. Review and Trend of Researches on Ancient Chinese Character Information Processing[J]. Library and Information Service, 2017,61(12):43-49.)
( Wang Xiaoguang. The Emergence, Development and Frontier of “Digital Humanities”[C]// Proceedings of the Ministry of Education Humanities and Social Sciences Research Method Innovation Forum 2009. Wuhan: Wuhan University Press, 2010.)
( Liu Jinteng, Song Yan, Xia Fei. The Construction of a Segmented and Part-of-Speech Tagged Archaic Chinese Corpus: A Case Study on Huainanzi[J]. Journal of Chinese Information Processing, 2013,27(6):6-15.)
( Wang Dongbo, Huang Shuiqing, He Lin. Researches of Automatic Part-of-Speech Tagging for Pre-Qin Literature Based on Multi-Feature Knowledge[J]. Library and Information Service, 2017,61(12):64-70.)
( Shi Min, Li Bin, Chen Xiaohe. CRF Based Research on a Unified Approach to Word Segmentation and POS Tagging for Pre-Qin Chinese[J]. Journal of Chinese Information Processing, 2010,24(2):39-45.)
( Xi Xuefeng, Zhou Guodong. A Survey on Deep Learning for Natural Language Processing[J]. Journal of Automatica Sinica, 2016,42(10):1445-1465.)
[8]
Collobert R, Weston J, Bottou L, et al. Natural Language Processing (Almost) from Scratch[J]. The Journal of Machine Learning Research, 2011,12:2493-2537.
[9]
Zheng X, Chen H, Xu T. Deep Learning for Chinese Word Segmentation and POS Tagging[C]// Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 2013: 647-657.
[10]
Bosco A, Laganà D, Musmanno R, et al. Modeling and Solving the Mixed Capacitated General Routing Problem[J]. Optimization Letters, 2013,7(7):1451-1469.
[11]
Huang Z, Xu W, Yu K. Bidirectional LSTM-CRF Models for Sequence Tagging [OL]. arXiv Preprint,arXiv: 1508.01991,2015.
[12]
Plank B, Søgaard A, Goldberg Y. Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss[OL]. arXiv Preprint, arXiv: 1604.05529,2016.
[13]
Yang J, Teng Z Y, Zhang M S, et al. Combining Discrete and Neural Features for Sequence Labeling[C]// Proceedings of International Conference on Intelligent Text Processing and Computational Linguistics. Springer, Cham, 2016: 140-154.
[14]
Guo J J, Wang S P, Yu C H, et al. Chinese POS Tagging Method Based on Bi-GRU+CRF Hybrid Model[C]// Proceedings of International Conference on Intelligent Networking and Collaborative Systems. Springer, Cham, 2018: 453-460.
[15]
Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint,arXiv: 1810.04805,2018.
[16]
陈小荷. 先秦文献信息处理[M]. 北京: 世界图书出版公司北京公司, 2013.
[16]
( Chen Xiaohe. Pre-Qin Literatures Information Processing[M]. Beijing: Beijing World Publishing Corporation, 2013.)
( Yuan Yue, Wang Dongbo, Huang Shuiqing, et al. The Comparative Study of Different Tagging Sets on Entity Extraction of Classical Books[J]. Data Analysis and Knowledge Discovery, 2019,3(3):57-65.)
[18]
段国超, 丁德科. 《史记》人物大辞典[M]. 北京: 商务印书馆, 2017.
[18]
( Duan Guochao, Ding Deke. Figures Dictionary of Historical Records[M]. Beijing: The Commercial Press, 2017.)