Deep Learning Based Automatic Sentence Segmentation and Punctuation Model for Massive Classical Chinese Literature
Wang Qian1,Wang Dongbo1,2(),Li Bin3,Xu Chao3
1College of Information Management, Nanjing Agricultural University, Nanjing 210095, China 2Research Center for Correlation of Domain Knowledge, Nanjing Agricultural University, Nanjing 210095, China 3College of Literature, Nanjing Normal University, Nanjing 210097, China
[Objective] This study establishes an annotation system with cascaded deep learning model, aiming to automatically conduct sentence segmentation and punctuation for ancient Chinese literature. [Methods] First, we created a massive corpus of Chinese books from “Siku Quanshu”. Then, we studied the automatic sentence segmentation and punctuation as sequence labeling issues, and determined the cascaded ideas. Third, we obtained the results of automatic sentence segmentation for the uninterrupted sentences based on the BERT-LSTM-CRF model. Fourth, we processed these results with the multi-feature LSTM-CRF model and received the final punctuation marks after iterative learning. [Results] We built an application platform with the trained model and the Django framework. The average F values of the proposed method for automatic sentence segmentation and punctuation were 86.41% and 90.84%, respectively. [Limitations] The punctuation system needs to be refined. [Conclusions] The proposed model and platform significantly improve the sentence segmentation and punctuation of ancient Chinese literature, which benefits digital humanity and social science projects in China.
王倩,王东波,李斌,许超. 面向海量典籍文本的深度学习自动断句与标点平台构建研究*[J]. 数据分析与知识发现, 2021, 5(3): 25-34.
Wang Qian,Wang Dongbo,Li Bin,Xu Chao. Deep Learning Based Automatic Sentence Segmentation and Punctuation Model for Massive Classical Chinese Literature. Data Analysis and Knowledge Discovery, 2021, 5(3): 25-34.
( Kan Jingzhong. Cultural Interpretation of Non-Punctuation[J]. Journal of Xuzhou Normal University (Philosophy and Social Sciences Edition), 2005,31(2):67-69.)
( Chen Tianying, Chen Rong, Pan Pan, et al. Archaic Chinese Punctuating Sentences Based on Context n-gram Model[J]. Computer Engineering, 2007,33(3):192-193, 196.)
( Huang Jiannian, Hou Hanqing. On Sentence Segmentation and Punctuation Model for Ancient Books on Agriculture[J]. Journal of Chinese Information Processing, 2008,22(4):31-38.)
( Zhang Kaixu, Xia Yunqing, Yu Hang. CRF-Based Approach to Sentence Segmentation and Punctuation for Ancient Chinese Prose[J]. Journal of Tsinghua University (Science and Technology), 2009,49(10):1733-1736.)
( Zhang He, Wang Xiaodong, Yang Jianyu, et al. Method of Sentence Segmentation and Punctuating for Ancient Chinese Literatures Based on Cascaded CRF[J]. Application Research of Computers, 2009,26(9):3326-3329.)
( Wang Boli, Shi Xiaodong, Su Jinsong. A Sentence Segmentation Method for Ancient Chinese Texts Based on Recurrent Neural Network[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2017,53(2):255-261.)
[8]
Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805,2018.
[9]
Williams D, Hinton G. Learning Representations by Back-Propagating Errors[J]. Nature, 1986,323(6088):533-538.
[10]
Hochreiter S, Schmidhuber J. Long Short-Term Memory[J]. Neural Computation, 1997,9(8):1735-1780.
pmid: 9377276
[11]
Schuster M, Paliwal K K. Bidirectional Recurrent Neural Networks[J]. IEEE Transactions on Signal Processing, 1997,45(11):2673-2681.
[12]
Srivastava N, Hinton G, Krizhevsky A, et al. Dropout: A Simple Way to Prevent Neural Networks from Overfitting[J]. Journal of Machine Learning Research, 2014,15(1):1929-1958.