Please wait a minute...
Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (3): 25-34    DOI: 10.11925/infotech.2096-3467.2019.1033
Current Issue | Archive | Adv Search |
Deep Learning Based Automatic Sentence Segmentation and Punctuation Model for Massive Classical Chinese Literature
Wang Qian1,Wang Dongbo1,2(),Li Bin3,Xu Chao3
1College of Information Management, Nanjing Agricultural University, Nanjing 210095, China
2Research Center for Correlation of Domain Knowledge, Nanjing Agricultural University, Nanjing 210095, China
3College of Literature, Nanjing Normal University, Nanjing 210097, China
Download: PDF (7539 KB)   HTML ( 15
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This study establishes an annotation system with cascaded deep learning model, aiming to automatically conduct sentence segmentation and punctuation for ancient Chinese literature. [Methods] First, we created a massive corpus of Chinese books from “Siku Quanshu”. Then, we studied the automatic sentence segmentation and punctuation as sequence labeling issues, and determined the cascaded ideas. Third, we obtained the results of automatic sentence segmentation for the uninterrupted sentences based on the BERT-LSTM-CRF model. Fourth, we processed these results with the multi-feature LSTM-CRF model and received the final punctuation marks after iterative learning. [Results] We built an application platform with the trained model and the Django framework. The average F values of the proposed method for automatic sentence segmentation and punctuation were 86.41% and 90.84%, respectively. [Limitations] The punctuation system needs to be refined. [Conclusions] The proposed model and platform significantly improve the sentence segmentation and punctuation of ancient Chinese literature, which benefits digital humanity and social science projects in China.

Key wordsAutomatic Sentence Segmentation      Digital Humanities      BERT      Ancient Chinese     
Received: 11 September 2019      Published: 12 April 2021
ZTFLH:  G255  
Fund:National Natural Science Foundation of China(71673143);National Social Science Fund of China(15ZDB127)
Corresponding Authors: Wang Dongbo     E-mail: db.wang@njau.edu.cn

Cite this article:

Wang Qian,Wang Dongbo,Li Bin,Xu Chao. Deep Learning Based Automatic Sentence Segmentation and Punctuation Model for Massive Classical Chinese Literature. Data Analysis and Knowledge Discovery, 2021, 5(3): 25-34.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2019.1033     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2021/V5/I3/25

Contextual Word Embedding by BERT
LSTM Model
Flow Chart of the Experiment
类别 训练集 验证集 测试集 总计
经部 4 572 819 575 947 572 576 5 721 342
史部 31 446 274 3 920 904 3 930 548 39 297 726
子部 19 434 858 2 426 688 2 428 228 24 289 774
集部 26 795 226 3 343 104 3 344 001 33 482 331
Data Size of Each Category of Ancient Books
Schematic Diagram of BERT-LSTM-CRF
LSTM-CRF Model with Multiple Features
观测序列 5-tag
S
B
I
I
I
J
E
B
J
E
Example of Labeling System of BERT-LSTM-CRF
观测序列 特征 标签
B O
I O
J O
E D
B O
I O
J- O
E D
B O
I O
J O
E D
B O
I O
I O
I O
I O
J O
E J
Example of Labeling System of LSTM-CRF with Multiple Features
The Evaluation of Automatic Sentence Segmentation and Punctuation Model
The Effects of Pre-training Models on Sentence Segmentation
指标 S(书名号) W(问号) F(分号) G(感叹号) D(逗号) M(冒号) J(句号) 总计
P 92.98 83.39 63.40 70.81 90.73 97.14 91.55 91.05
R 91.45 87.22 37.90 38.76 94.80 95.63 87.88 91.08
F 92.21 85.26 47.44 50.10 92.72 96.38 89.42 91.07
The Results of Automatic Punctuation for Confucian Classics (%)
Home Page of Automatic Punctuating Platform for Classical Chinese
Page for Segmenting and Punctuating Sentences Automatically
Page for Segmenting and Punctuating Texts Automatically
[1] 阚景忠. 古文不标点断句的文化阐释[J]. 徐州师范大学学报(哲学社会科学版), 2005,31(2):67-69.
[1] ( Kan Jingzhong. Cultural Interpretation of Non-Punctuation[J]. Journal of Xuzhou Normal University (Philosophy and Social Sciences Edition), 2005,31(2):67-69.)
[2] 叶方石. 文言文断句标点的方法与技巧[J]. 长江工程职业技术学院学报, 2012,29(1):75-77.
[2] ( Ye Fangshi. On Methods and Skills of Punctuation in Classical Chinese[J]. Journal of Changjiang Engineering Vocational College, 2012,29(1):75-77.)
[3] 陈天莹, 陈蓉, 潘璐璐, 等. 基于前后文n-gram模型的古汉语句子切分[J]. 计算机工程, 2007,33(3):192-193, 196.
[3] ( Chen Tianying, Chen Rong, Pan Pan, et al. Archaic Chinese Punctuating Sentences Based on Context n-gram Model[J]. Computer Engineering, 2007,33(3):192-193, 196.)
[4] 黄建年, 侯汉清. 农业古籍断句标点模式研究[J]. 中文信息学报, 2008,22(4):31-38.
[4] ( Huang Jiannian, Hou Hanqing. On Sentence Segmentation and Punctuation Model for Ancient Books on Agriculture[J]. Journal of Chinese Information Processing, 2008,22(4):31-38.)
[5] 张开旭, 夏云庆, 宇航. 基于条件随机场的古汉语自动断句与标点方法[J]. 清华大学学报(自然科学版), 2009,49(10):1733-1736.
[5] ( Zhang Kaixu, Xia Yunqing, Yu Hang. CRF-Based Approach to Sentence Segmentation and Punctuation for Ancient Chinese Prose[J]. Journal of Tsinghua University (Science and Technology), 2009,49(10):1733-1736.)
[6] 张合, 王晓东, 杨建宇, 等. 一种基于层叠CRF的古文断句与句读标记方法[J]. 计算机应用研究, 2009,26(9):3326-3329.
[6] ( Zhang He, Wang Xiaodong, Yang Jianyu, et al. Method of Sentence Segmentation and Punctuating for Ancient Chinese Literatures Based on Cascaded CRF[J]. Application Research of Computers, 2009,26(9):3326-3329.)
[7] 王博立, 史晓东, 苏劲松. 一种基于循环神经网络的古文断句方法[J]. 北京大学学报(自然科学版), 2017,53(2):255-261.
[7] ( Wang Boli, Shi Xiaodong, Su Jinsong. A Sentence Segmentation Method for Ancient Chinese Texts Based on Recurrent Neural Network[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2017,53(2):255-261.)
[8] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805,2018.
[9] Williams D, Hinton G. Learning Representations by Back-Propagating Errors[J]. Nature, 1986,323(6088):533-538.
[10] Hochreiter S, Schmidhuber J. Long Short-Term Memory[J]. Neural Computation, 1997,9(8):1735-1780.
pmid: 9377276
[11] Schuster M, Paliwal K K. Bidirectional Recurrent Neural Networks[J]. IEEE Transactions on Signal Processing, 1997,45(11):2673-2681.
[12] Srivastava N, Hinton G, Krizhevsky A, et al. Dropout: A Simple Way to Prevent Neural Networks from Overfitting[J]. Journal of Machine Learning Research, 2014,15(1):1929-1958.
[1] Chen Jie,Ma Jing,Li Xiaofeng. Short-Text Classification Method with Text Features from Pre-trained Models[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2] Zhou Zeyu,Wang Hao,Zhao Zibo,Li Yueyan,Zhang Xiaoqin. Construction and Application of GCN Model for Text Classification with Associated Information[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[3] Ma Jiangwei, Lv Xueqiang, You Xindong, Xiao Gang, Han Junmei. Extracting Relationship Among Military Domains with BERT and Relation Position Features[J]. 数据分析与知识发现, 2021, 5(8): 1-12.
[4] Li Wenna, Zhang Zhixiong. Entity Alignment Method for Different Knowledge Repositories with Joint Semantic Representation[J]. 数据分析与知识发现, 2021, 5(7): 1-9.
[5] Wang Hao, Lin Kerou, Meng Zhen, Li Xinlei. Identifying Multi-Type Entities in Legal Judgments with Text Representation and Feature Generation[J]. 数据分析与知识发现, 2021, 5(7): 10-25.
[6] Yu Xuehan, He Lin, Xu Jian. Extracting Events from Ancient Books Based on RoBERTa-CRF[J]. 数据分析与知识发现, 2021, 5(7): 26-35.
[7] Lu Quan, He Chao, Chen Jing, Tian Min, Liu Ting. A Multi-Label Classification Model with Two-Stage Transfer Learning[J]. 数据分析与知识发现, 2021, 5(7): 91-100.
[8] Liu Wenbin, He Yanqing, Wu Zhenfeng, Dong Cheng. Sentence Alignment Method Based on BERT and Multi-similarity Fusion[J]. 数据分析与知识发现, 2021, 5(7): 48-58.
[9] Yin Pengbo,Pan Weimin,Zhang Haijun,Chen Degang. Identifying Clickbait with BERT-BiGA Model[J]. 数据分析与知识发现, 2021, 5(6): 126-134.
[10] Song Ruoxuan,Qian Li,Du Yu. Identifying Academic Creative Concept Topics Based on Future Work of Scientific Papers[J]. 数据分析与知识发现, 2021, 5(5): 10-20.
[11] Hu Haotian,Ji Jinfeng,Wang Dongbo,Deng Sanhong. An Integrated Platform for Food Safety Incident Entities Based on Deep Learning[J]. 数据分析与知识发现, 2021, 5(3): 12-24.
[12] Zhang Qi,Jiang Chuan,Ji Youshu,Feng Minxuan,Li Bin,Xu Chao,Liu Liu. Unified Model for Word Segmentation and POS Tagging of Multi-Domain Pre-Qin Literature[J]. 数据分析与知识发现, 2021, 5(3): 2-11.
[13] Chang Chengyang,Wang Xiaodong,Zhang Shenglei. Polarity Analysis of Dynamic Political Sentiments from Tweets with Deep Learning Method[J]. 数据分析与知识发现, 2021, 5(3): 121-131.
[14] Dong Miao, Su Zhongqi, Zhou Xiaobei, Lan Xue, Cui Zhigang, Cui Lei. Improving PubMedBERT for CID-Entity-Relation Classification Using Text-CNN[J]. 数据分析与知识发现, 2021, 5(11): 145-152.
[15] Liu Huan,Zhang Zhixiong,Wang Yufei. A Review on Main Optimization Methods of BERT[J]. 数据分析与知识发现, 2021, 5(1): 3-15.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn