|
|
Deep Learning Based Automatic Sentence Segmentation and Punctuation Model for Massive Classical Chinese Literature |
Wang Qian1,Wang Dongbo1,2(),Li Bin3,Xu Chao3 |
1College of Information Management, Nanjing Agricultural University, Nanjing 210095, China 2Research Center for Correlation of Domain Knowledge, Nanjing Agricultural University, Nanjing 210095, China 3College of Literature, Nanjing Normal University, Nanjing 210097, China |
|
|
Abstract [Objective] This study establishes an annotation system with cascaded deep learning model, aiming to automatically conduct sentence segmentation and punctuation for ancient Chinese literature. [Methods] First, we created a massive corpus of Chinese books from “Siku Quanshu”. Then, we studied the automatic sentence segmentation and punctuation as sequence labeling issues, and determined the cascaded ideas. Third, we obtained the results of automatic sentence segmentation for the uninterrupted sentences based on the BERT-LSTM-CRF model. Fourth, we processed these results with the multi-feature LSTM-CRF model and received the final punctuation marks after iterative learning. [Results] We built an application platform with the trained model and the Django framework. The average F values of the proposed method for automatic sentence segmentation and punctuation were 86.41% and 90.84%, respectively. [Limitations] The punctuation system needs to be refined. [Conclusions] The proposed model and platform significantly improve the sentence segmentation and punctuation of ancient Chinese literature, which benefits digital humanity and social science projects in China.
|
Received: 11 September 2019
Published: 12 April 2021
|
|
Fund:National Natural Science Foundation of China(71673143);National Social Science Fund of China(15ZDB127) |
Corresponding Authors:
Wang Dongbo
E-mail: db.wang@njau.edu.cn
|
[1] |
阚景忠. 古文不标点断句的文化阐释[J]. 徐州师范大学学报(哲学社会科学版), 2005,31(2):67-69.
|
[1] |
( Kan Jingzhong. Cultural Interpretation of Non-Punctuation[J]. Journal of Xuzhou Normal University (Philosophy and Social Sciences Edition), 2005,31(2):67-69.)
|
[2] |
叶方石. 文言文断句标点的方法与技巧[J]. 长江工程职业技术学院学报, 2012,29(1):75-77.
|
[2] |
( Ye Fangshi. On Methods and Skills of Punctuation in Classical Chinese[J]. Journal of Changjiang Engineering Vocational College, 2012,29(1):75-77.)
|
[3] |
陈天莹, 陈蓉, 潘璐璐, 等. 基于前后文n-gram模型的古汉语句子切分[J]. 计算机工程, 2007,33(3):192-193, 196.
|
[3] |
( Chen Tianying, Chen Rong, Pan Pan, et al. Archaic Chinese Punctuating Sentences Based on Context n-gram Model[J]. Computer Engineering, 2007,33(3):192-193, 196.)
|
[4] |
黄建年, 侯汉清. 农业古籍断句标点模式研究[J]. 中文信息学报, 2008,22(4):31-38.
|
[4] |
( Huang Jiannian, Hou Hanqing. On Sentence Segmentation and Punctuation Model for Ancient Books on Agriculture[J]. Journal of Chinese Information Processing, 2008,22(4):31-38.)
|
[5] |
张开旭, 夏云庆, 宇航. 基于条件随机场的古汉语自动断句与标点方法[J]. 清华大学学报(自然科学版), 2009,49(10):1733-1736.
|
[5] |
( Zhang Kaixu, Xia Yunqing, Yu Hang. CRF-Based Approach to Sentence Segmentation and Punctuation for Ancient Chinese Prose[J]. Journal of Tsinghua University (Science and Technology), 2009,49(10):1733-1736.)
|
[6] |
张合, 王晓东, 杨建宇, 等. 一种基于层叠CRF的古文断句与句读标记方法[J]. 计算机应用研究, 2009,26(9):3326-3329.
|
[6] |
( Zhang He, Wang Xiaodong, Yang Jianyu, et al. Method of Sentence Segmentation and Punctuating for Ancient Chinese Literatures Based on Cascaded CRF[J]. Application Research of Computers, 2009,26(9):3326-3329.)
|
[7] |
王博立, 史晓东, 苏劲松. 一种基于循环神经网络的古文断句方法[J]. 北京大学学报(自然科学版), 2017,53(2):255-261.
|
[7] |
( Wang Boli, Shi Xiaodong, Su Jinsong. A Sentence Segmentation Method for Ancient Chinese Texts Based on Recurrent Neural Network[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2017,53(2):255-261.)
|
[8] |
Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805,2018.
|
[9] |
Williams D, Hinton G. Learning Representations by Back-Propagating Errors[J]. Nature, 1986,323(6088):533-538.
|
[10] |
Hochreiter S, Schmidhuber J. Long Short-Term Memory[J]. Neural Computation, 1997,9(8):1735-1780.
pmid: 9377276
|
[11] |
Schuster M, Paliwal K K. Bidirectional Recurrent Neural Networks[J]. IEEE Transactions on Signal Processing, 1997,45(11):2673-2681.
|
[12] |
Srivastava N, Hinton G, Krizhevsky A, et al. Dropout: A Simple Way to Prevent Neural Networks from Overfitting[J]. Journal of Machine Learning Research, 2014,15(1):1929-1958.
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|