[Objective] The purpose of this paper is to explore the influence of the word segmentation consistency and the corpus types in Middle Ancient Chinese (MAC). It tries to improve the accuracy and efficiency of the automatic word segmentation, a basic procedure in processing ancient Chinese, based on the CRFs model. [Methods] First, we optimized the segmentation principles for MAC historical records, Buddhist scriptures and novels. Then, we combined the CRFs model with dictionary to reduce the segmentation inconsistency in the manual procedures. Finally, we added two features to the CRFs model (i.e. character classification and dictionary information), and identified the best word segmentation template by comparison experiments. [Results] The F-score was higher than 99% in the closed test, while it was from 89% to 95% in the open test. [Limitations] The segmentation consistency was improved on the words with two characters, and more studies were needed on the segmentation of words with more than three characters. [Conclusions] The proposed method could effectively improve the accuracy of automatic word segmentation for mediaeval Chinese corpus.
王晓玉, 李斌. 基于CRFs和词典信息的中古汉语自动分词*[J]. 数据分析与知识发现, 2017, 1(5): 62-70.
Wang Xiaoyu,Li Bin. Automatically Segmenting Middle Ancient Chinese Words with CRFs. Data Analysis and Knowledge Discovery, 2017, 1(5): 62-70.
(Hua Zhenhong.Some Problems in the Deep Processing of the Medieval Chinese Corpus Construction[J]. Journal of Southwest University: Social Science Edition, 2014, 40(3): 136-142.)
doi: 10.3969/j.issn.1673-9841.2014.03.020
[2]
王嘉灵. 以《汉书》为例的中古汉语自动分词[D]. 南京: 南京师范大学, 2014.
[2]
(Wang Jialing.The Medieval Chinese Automatic Segmentation Using the “Han Shu” as an Example [D]. Nanjing: Nanjing Normal University, 2014. )
(Wang Xiaoyu, Dong Zhiqiao.The Investigation of Middle Ancient Chinese Word Segmentation’s Inconsistency[J]. The Collected Papers of the Chinese History Study, 2015, 19:20-33.)
(Huang Juren, Chen Kejian, Chen Fengyi, et al.A Segmentation Standard for Chinese Information Processing: Design Criteria and Content[J]. Journal of Applied Linguistics, 1997(1): 94-102.)
[11]
黄昌宁, 赵海. 中文分词十年回顾[J]. 中文信息学报, 2007, 21(3): 8-19.
[11]
(Huang Changning, Zhao Hai.Chinese Word Segmentation: A Decade Review[J]. Journal of Chinese Information Processing, 2007, 21(3): 8-19.)
(Wu Qiong, Huang Degen.Temporal Information Extraction Based on CRF and Time Thesaurus[J]. Journal of Chinese Information Processing, 2014, 28(6): 169-174.)
doi: 10.3969/j.issn.1003-0077.2014.06.024
(Duan Yufeng, Zhu Wenjing, Chen Qiao, et al.The Study on Out-of-Vocabulary Identification on a Model Based on the Combination of CRFs and Domain Ontology Elements Set[J]. New Technology of Library and Information Service, 2015(4): 41-49.)
[14]
修驰. 适应于不同领域的中文分词方法研究与实现[D]. 北京: 北京工业大学, 2013.
[14]
(Xiu Chi.The Research and Implementation of Chinese Word Segmentation for Different Domains [D]. Beijing: Beijing University of Technology, 2013.)
(Song Yan, Cai Dongfeng, Zhang Guiping, et al.Approach to Chinese Word Segmentation Based on Character-Word Joint Decoding[J]. Journal of Software, 2009, 20(9): 2366-2375.)
doi: 10.3724/SP.J.1001.2009.03606
(Shi Min, Li Bin, Chen Xiaohe.CRF Based Research on a Unified Approach to Word Segmentation and POS Tagging for Pre-Qin Chinese[J]. Journal of Chinese Information Processing, 2010, 24(2): 39-45.)
doi: 10.3969/j.issn.1003-0077.2010.02.005
[17]
Zhao H, Kit C Y.An Empirical Comparison of Goodness Measures for Unsupervised Chinese Word Segmentation with a Unified Framework[C]//Proceedings of IJCNLP 2008, Hyderabad, India. 2008: 9-16.