Please wait a minute...
Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (3): 2-11    DOI: 10.11925/infotech.2096-3467.2019.1032
Current Issue | Archive | Adv Search |
Unified Model for Word Segmentation and POS Tagging of Multi-Domain Pre-Qin Literature
Zhang Qi1,Jiang Chuan2,Ji Youshu2,Feng Minxuan3,Li Bin3,Xu Chao3,Liu Liu2()
1School of Information Management, Nanjing University, Nanjing 210023, China
2College of Information Management, Nanjing Agricultural University, Nanjing 210095, China
3College of Literature, Nanjing Normal University, Nanjing 210097, China
Download: PDF (877 KB)   HTML ( 16
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper constructs a deep learning model for automatic word segmentation and part-of-speech (POS) tagging of ancient literature, aiming to build an automatic annotation solution for Chinese books from multiple fields. [Methods] We used 25 Pre-Qin literature as training corpus, which covers the Confucian classics, history, philosophy and miscellaneous works. Then, we constructed a unified model with BERT for word segmentation and POS tagging without adding new features. Third, we examined our model with The Records of the Grand Historian, which was not included in the training corpus. Finally, we analyzed the four basic parts constituting historical events (names, locations, time, actions) with statistics and case studies. [Results] The proposed model’s F-score for word segmentation and the POS-tagging reached 95.98% and 88.97%. [Limitations] After analyzing the confusion heat map of POS tagging, it is found that the mislabeling, which is caused by the imbalanced part-of-speech distribution, the similar syntactic features of some parts of speech instances, and the multi-category words, needs further research and resolutions. [Conclusions] Our deep learning model is stable and applicable for word segmentation and POS tagging with Pre-Qin literature.

Key wordsDigital Humanities      Pre-Qin Literature      Ancient Books Intelligent Processing      Word Segmentation      Part-of-Speech Tagging      Deep Learning     
Received: 11 September 2019      Published: 12 April 2021
ZTFLH:  G353  
  TP393  
Fund:National Natural Science Foundation of China(71673143);National Social Science Fund of China(15ZDB127)
Corresponding Authors: Liu Liu     E-mail: liuliu@njau.edu.cn

Cite this article:

Zhang Qi,Jiang Chuan,Ji Youshu,Feng Minxuan,Li Bin,Xu Chao,Liu Liu. Unified Model for Word Segmentation and POS Tagging of Multi-Domain Pre-Qin Literature. Data Analysis and Knowledge Discovery, 2021, 5(3): 2-11.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2019.1032     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2021/V5/I3/2

一级类 二级类 包含典籍
经部 易经 《周易》
春秋 《谷梁传》《吕氏春秋》《晏子春秋》《左传》
《公羊传》
尚书 《尚书》
诗经 《诗经》
《仪礼》《周礼》《孝经》《礼记》
论语 《论语》
史部 《国语》
子部 《孟子》《管子》《吴子》《荀子》《老子》《墨子》《韩非子》《孙子兵法》《商君书》《庄子》
集部 《楚辞》
The Basic Information of the Corpus
编号 标记
符号
词性 数量 编号 标记
符号
词性 数量
1 n 普通名词 327 688 11 p 介词 45 213
2 nr 人名 42 437 12 c 连词 65 216
3 ns 地名 21 106 13 u 助词 37 129
4 t 时间名词 9 597 14 d 副词 86 784
5 v 动词 347 331 15 y 语气词 52 983
6 gv 古代动词 1 736 16 s 拟声词 156
7 a 形容词 40 123 17 j 兼词 1 507
8 m 数词 23 508 18 w 标点 350 832
9 q 量词 2 170 19 i 词缀 365
10 r 代词 104 145
The Tag of Part-of-Speech and Word Case Number
A Sample of Corpus Preprocess
模型 参数
GRU Dimension of wordvec=200, hidden layers=512, learning rate=0.001,Batch-size=64,Dropout=1,Clip gradient=5
BERT Learning rate=2.0e-5; max sequence length=256; batch size=64
The Setting of Hyper-Parameters
10折编码 GRU+CRF BERT
Precision Recall F值 Precision Recall F值
1 94.97 95.76 95.36 95.76 96.06 95.91
2 95.20 95.31 95.26 95.78 96.14 95.96
3 97.48 97.07 97.27 95.81 96.23 96.02
4 95.32 95.18 95.25 95.74 96.15 95.94
5 95.18 95.75 95.47 95.80 96.17 95.98
6 96.85 96.70 96.77 95.86 96.18 96.02
7 95.21 95.42 95.32 95.89 96.21 96.05
8 95.00 95.74 95.37 95.83 96.22 96.02
9 95.21 95.76 95.48 95.84 96.27 96.05
10 95.04 95.69 95.36 95.69 96.03 95.86
均值 95.55 95.84 95.69 95.80 96.17 95.98
Word Segmentation Result (%)
10折编码 GRU+CRF BERT
Accuracy Precision Recall F值 Accuracy Precision Recall F值
1 87.48 88.14 88.81 88.48 88.86 89.41 89.79 89.60
2 87.51 88.49 88.53 88.51 88.93 89.41 89.83 89.62
3 87.22 88.52 88.08 88.30 88.91 89.42 89.87 89.65
4 87.37 88.27 88.50 88.39 88.95 89.45 89.89 89.67
5 87.63 88.45 88.77 88.61 89.07 89.57 90.00 89.78
6 87.26 88.37 88.18 88.28 88.90 89.36 89.77 89.57
7 87.54 88.45 88.58 88.52 89.07 89.63 89.99 89.81
8 87.55 88.26 88.90 88.58 89.05 89.6 89.99 89.79
9 88.31 89.04 89.29 89.17 89.11 89.57 90.04 89.81
10 87.39 88.06 88.79 88.42 88.83 89.37 89.74 89.56
平均值 87.53 88.41 88.64 88.53 88.97 89.48 89.89 89.69
Part of Speech Tagging Result (%)
词性 GRU+CRF BERT
Precision Recall F值 Precision Recall F值
n 79.08 80.36 79.71 81.35 82.47 81.91
nr 86.56 84.68 85.60 87.73 87.34 87.53
ns 81.01 81.20 81.07 82.79 83.97 83.36
t 85.68 86.75 86.16 87.13 87.17 87.14
v 86.98 88.16 87.56 88.75 89.15 88.95
gv 34.16 6.07 9.62 37.98 17.85 23.80
a 58.47 51.48 54.58 60.71 59.44 60.02
c 91.03 89.58 90.30 90.56 91.45 91.00
d 87.67 86.63 87.12 88.60 88.33 88.46
i 37.28 23.44 27.46 41.93 26.65 31.90
j 71.03 66.24 66.91 73.61 68.32 70.84
m 86.49 89.90 88.14 88.28 91.09 89.66
p 87.97 89.27 88.60 89.59 88.97 89.28
q 72.02 69.28 69.75 73.08 75.12 74.01
r 91.03 94.06 92.52 91.87 94.31 93.07
s 53.51 26.30 33.79 41.01 25.44 30.11
u 93.51 88.97 91.17 93.65 90.18 91.88
w 99.96 99.98 99.97 99.91 99.95 99.93
y 95.66 95.94 95.79 95.99 96.46 96.22
Tagging Results of Each Part of Speech (%)
模型 分词 词性标注
Precision Recall F值 Precision Recall F值
石民等[6] 94.23 94.91 94.57 89.35 89.95 89.65
本文 95.84 96.27 96.05 89.57 90.04 89.81
Comparison with the Existing Unified Word Segmentation and Pos Tagging Model (%)
Confusion Heat Map of POS Tagging
Unified Word Segmentation and POS Tagging Platform for Pre-Qin Literature
序号 人名 地名 动词 时间词
词例 频率 词例 频率 词例 频率 词例 频率
1 孔子 394 2 190 6 553 元年 1 227
2 漢王 221 1 554 5 266 1 073
3 趙王 208 1 453 2 696 五年 221
4 秦王 180 915 使 1 871 六年 221
5 齊王 179 826 1750 四年 221
6 高祖 166 748 1 734 二年 221
7 159 731 1 680 四月 221
8 太史公 149 524 1 323 221
9 單于 146 488 1 294 三年 221
10 張儀 145 458 1 266 三月 221
Four Basic Composition of Events in The Records of the Grand Historian
A Sample of the Relation of Person Entity and Official Position Entity
[1] 黄水清, 王东波. 古文信息处理研究的现状及趋势[J]. 图书情报工作, 2017,61(12):43-49.
[1] ( Huang Shuiqing, Wang Dongbo. Review and Trend of Researches on Ancient Chinese Character Information Processing[J]. Library and Information Service, 2017,61(12):43-49.)
[2] 王晓光. “数字人文”的产生、发展与前沿[C]// 2009年教育部人文社会科学研究方法创新论坛论文集. 武汉: 武汉大学出版社, 2010.
[2] ( Wang Xiaoguang. The Emergence, Development and Frontier of “Digital Humanities”[C]// Proceedings of the Ministry of Education Humanities and Social Sciences Research Method Innovation Forum 2009. Wuhan: Wuhan University Press, 2010.)
[3] 李文中. 语料库标记与标注: 以中国英语语料库为例[J]. 外语教学与研究, 2012(3):336-345.
[3] ( Li Wenzhong. Corpus Markup and Annotation: China English Corpus as an Example[J]. Foreign Language Teaching and Research, 2012(3):336-345.)
[4] 留金腾, 宋彦, 夏飞. 上古汉语分词及词性标注语料库的构建——以《淮南子》为范例[J]. 中文信息学报, 2013,27(6):6-15.
[4] ( Liu Jinteng, Song Yan, Xia Fei. The Construction of a Segmented and Part-of-Speech Tagged Archaic Chinese Corpus: A Case Study on Huainanzi[J]. Journal of Chinese Information Processing, 2013,27(6):6-15.)
[5] 王东波, 黄水清, 何琳. 基于多特征知识的先秦典籍词性自动标注研究[J]. 图书情报工作, 2017,61(12):64-70.
[5] ( Wang Dongbo, Huang Shuiqing, He Lin. Researches of Automatic Part-of-Speech Tagging for Pre-Qin Literature Based on Multi-Feature Knowledge[J]. Library and Information Service, 2017,61(12):64-70.)
[6] 石民, 李斌, 陈小荷. 基于CRF的先秦汉语分词标注一体化研究[J]. 中文信息学报, 2010,24(2):39-45.
[6] ( Shi Min, Li Bin, Chen Xiaohe. CRF Based Research on a Unified Approach to Word Segmentation and POS Tagging for Pre-Qin Chinese[J]. Journal of Chinese Information Processing, 2010,24(2):39-45.)
[7] 奚雪峰, 周国栋. 面向自然语言处理的深度学习研究[J]. 自动化学报, 2016,42(10):1445-1465.
[7] ( Xi Xuefeng, Zhou Guodong. A Survey on Deep Learning for Natural Language Processing[J]. Journal of Automatica Sinica, 2016,42(10):1445-1465.)
[8] Collobert R, Weston J, Bottou L, et al. Natural Language Processing (Almost) from Scratch[J]. The Journal of Machine Learning Research, 2011,12:2493-2537.
[9] Zheng X, Chen H, Xu T. Deep Learning for Chinese Word Segmentation and POS Tagging[C]// Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 2013: 647-657.
[10] Bosco A, Laganà D, Musmanno R, et al. Modeling and Solving the Mixed Capacitated General Routing Problem[J]. Optimization Letters, 2013,7(7):1451-1469.
[11] Huang Z, Xu W, Yu K. Bidirectional LSTM-CRF Models for Sequence Tagging [OL]. arXiv Preprint,arXiv: 1508.01991,2015.
[12] Plank B, Søgaard A, Goldberg Y. Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss[OL]. arXiv Preprint, arXiv: 1604.05529,2016.
[13] Yang J, Teng Z Y, Zhang M S, et al. Combining Discrete and Neural Features for Sequence Labeling[C]// Proceedings of International Conference on Intelligent Text Processing and Computational Linguistics. Springer, Cham, 2016: 140-154.
[14] Guo J J, Wang S P, Yu C H, et al. Chinese POS Tagging Method Based on Bi-GRU+CRF Hybrid Model[C]// Proceedings of International Conference on Intelligent Networking and Collaborative Systems. Springer, Cham, 2018: 453-460.
[15] Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint,arXiv: 1810.04805,2018.
[16] 陈小荷. 先秦文献信息处理[M]. 北京: 世界图书出版公司北京公司, 2013.
[16] ( Chen Xiaohe. Pre-Qin Literatures Information Processing[M]. Beijing: Beijing World Publishing Corporation, 2013.)
[17] 袁悦, 王东波, 黄水清, 等. 不同词性标记集在典籍实体抽取上的差异性探究[J]. 数据分析与知识发现, 2019,3(3):57-65.
[17] ( Yuan Yue, Wang Dongbo, Huang Shuiqing, et al. The Comparative Study of Different Tagging Sets on Entity Extraction of Classical Books[J]. Data Analysis and Knowledge Discovery, 2019,3(3):57-65.)
[18] 段国超, 丁德科. 《史记》人物大辞典[M]. 北京: 商务印书馆, 2017.
[18] ( Duan Guochao, Ding Deke. Figures Dictionary of Historical Records[M]. Beijing: The Commercial Press, 2017.)
[1] Cheng Bin,Shi Shuicai,Du Yuncheng,Xiao Shibin. Keyword Extraction for Journals Based on Part-of-Speech and BiLSTM-CRF Combined Model[J]. 数据分析与知识发现, 2021, 5(3): 101-108.
[2] Chang Chengyang,Wang Xiaodong,Zhang Shenglei. Polarity Analysis of Dynamic Political Sentiments from Tweets with Deep Learning Method[J]. 数据分析与知识发现, 2021, 5(3): 121-131.
[3] Feng Yong,Liu Yang,Xu Hongyan,Wang Rongbing,Zhang Yonggang. Recommendation Model Incorporating Neighbor Reviews for GRU Products[J]. 数据分析与知识发现, 2021, 5(3): 78-87.
[4] Hu Haotian,Ji Jinfeng,Wang Dongbo,Deng Sanhong. An Integrated Platform for Food Safety Incident Entities Based on Deep Learning[J]. 数据分析与知识发现, 2021, 5(3): 12-24.
[5] Wang Qian,Wang Dongbo,Li Bin,Xu Chao. Deep Learning Based Automatic Sentence Segmentation and Punctuation Model for Massive Classical Chinese Literature[J]. 数据分析与知识发现, 2021, 5(3): 25-34.
[6] Lv Xueqiang,Luo Yixiong,Li Jiaquan,You Xindong. Review of Studies on Detecting Chinese Patent Infringements[J]. 数据分析与知识发现, 2021, 5(3): 60-68.
[7] Li Danyang, Gan Mingxin. Music Recommendation Method Based on Multi-Source Information Fusion[J]. 数据分析与知识发现, 2021, 5(2): 94-105.
[8] Zhao Yuxiang,Lian Jingwen. Review of Cultural Heritage Crowdsourcing in the Domain of Digital Humanities[J]. 数据分析与知识发现, 2021, 5(1): 36-55.
[9] Huang Lu,Zhou Enguo,Li Daifeng. Text Representation Learning Model Based on Attention Mechanism with Task-specific Information[J]. 数据分析与知识发现, 2020, 4(9): 111-122.
[10] Liang Jiwen,Jiang Chuan,Wang Dongbo. Chinese-English Sentence Alignment of Ancient Literature Based on Multi-feature Fusion[J]. 数据分析与知识发现, 2020, 4(9): 123-132.
[11] Xu Chenfei, Ye Haiying, Bao Ping. Automatic Recognition of Produce Entities from Local Chronicles with Deep Learning[J]. 数据分析与知识发现, 2020, 4(8): 86-97.
[12] Zhao Yang, Zhang Zhixiong, Liu Huan, Ding Liangping. Classification of Chinese Medical Literature with BERT Model[J]. 数据分析与知识发现, 2020, 4(8): 41-49.
[13] Yu Chuanming, Wang Manyi, Lin Hongjun, Zhu Xingyu, Huang Tingting, An Lu. A Comparative Study of Word Representation Models Based on Deep Learning[J]. 数据分析与知识发现, 2020, 4(8): 28-40.
[14] Wang Xinyun,Wang Hao,Deng Sanhong,Zhang Baolong. Classification of Academic Papers for Periodical Selection[J]. 数据分析与知识发现, 2020, 4(7): 96-109.
[15] Jiao Qihang,Le Xiaoqiu. Generating Sentences of Contrast Relationship[J]. 数据分析与知识发现, 2020, 4(6): 43-50.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn