Data Analysis and Knowledge Discovery  2017, Vol. 1 Issue (5): 62-70    DOI: 10.11925/infotech.2096-3467.2017.05.08
Automatically Segmenting Middle Ancient Chinese Words with CRFs
Xiaoyu Wang,Bin Li()
School of Chinese Language and Literature, Nanjing Normal University, Nanjing 210097, China
[Objective] The purpose of this paper is to explore the influence of the word segmentation consistency and the corpus types in Middle Ancient Chinese (MAC). It tries to improve the accuracy and efficiency of the automatic word segmentation, a basic procedure in processing ancient Chinese, based on the CRFs model. [Methods] First, we optimized the segmentation principles for MAC historical records, Buddhist scriptures and novels. Then, we combined the CRFs model with dictionary to reduce the segmentation inconsistency in the manual procedures. Finally, we added two features to the CRFs model (i.e. character classification and dictionary information), and identified the best word segmentation template by comparison experiments. [Results] The F-score was higher than 99% in the closed test, while it was from 89% to 95% in the open test. [Limitations] The segmentation consistency was improved on the words with two characters, and more studies were needed on the segmentation of words with more than three characters. [Conclusions] The proposed method could effectively improve the accuracy of automatic word segmentation for mediaeval Chinese corpus.

Key wordsConditional Random Fields Model      Segmentation Consistency      Middle Ancient Chinese      Word Segmentation     
Received: 14 March 2017      Published: 06 June 2017

Xiaoyu Wang,Bin Li. Automatically Segmenting Middle Ancient Chinese Words with CRFs. Data Analysis and Knowledge Discovery, 2017, 1(5): 62-70.

