Please wait a minute...
Data Analysis and Knowledge Discovery  2017, Vol. 1 Issue (5): 62-70    DOI: 10.11925/infotech.2096-3467.2017.05.08
Orginal Article Current Issue | Archive | Adv Search |
Automatically Segmenting Middle Ancient Chinese Words with CRFs
Xiaoyu Wang,Bin Li()
School of Chinese Language and Literature, Nanjing Normal University, Nanjing 210097, China
Download: PDF(477 KB)   HTML ( 2
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] The purpose of this paper is to explore the influence of the word segmentation consistency and the corpus types in Middle Ancient Chinese (MAC). It tries to improve the accuracy and efficiency of the automatic word segmentation, a basic procedure in processing ancient Chinese, based on the CRFs model. [Methods] First, we optimized the segmentation principles for MAC historical records, Buddhist scriptures and novels. Then, we combined the CRFs model with dictionary to reduce the segmentation inconsistency in the manual procedures. Finally, we added two features to the CRFs model (i.e. character classification and dictionary information), and identified the best word segmentation template by comparison experiments. [Results] The F-score was higher than 99% in the closed test, while it was from 89% to 95% in the open test. [Limitations] The segmentation consistency was improved on the words with two characters, and more studies were needed on the segmentation of words with more than three characters. [Conclusions] The proposed method could effectively improve the accuracy of automatic word segmentation for mediaeval Chinese corpus.

Key wordsConditional Random Fields Model      Segmentation Consistency      Middle Ancient Chinese      Word Segmentation     
Received: 14 March 2017      Published: 06 June 2017

Cite this article:

Xiaoyu Wang,Bin Li. Automatically Segmenting Middle Ancient Chinese Words with CRFs. Data Analysis and Knowledge Discovery, 2017, 1(5): 62-70.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2017.05.08     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2017/V1/I5/62

[1] 化振红. 深加工中古汉语语料库建设的若干问题[J]. 西南大学学报: 社会科学版, 2014, 40(3): 136-142.
[1] (Hua Zhenhong.Some Problems in the Deep Processing of the Medieval Chinese Corpus Construction[J]. Journal of Southwest University: Social Science Edition, 2014, 40(3): 136-142.)
[2] 王嘉灵. 以《汉书》为例的中古汉语自动分词[D]. 南京: 南京师范大学, 2014.
[2] (Wang Jialing.The Medieval Chinese Automatic Segmentation Using the “Han Shu” as an Example [D]. Nanjing: Nanjing Normal University, 2014. )
[3] 王晓玉, 董志翘. 中古汉语分词不一致原因探讨[J]. 汉语史研究集刊, 2015, 19: 20-33
[3] (Wang Xiaoyu, Dong Zhiqiao.The Investigation of Middle Ancient Chinese Word Segmentation’s Inconsistency[J]. The Collected Papers of the Chinese History Study, 2015, 19:20-33.)
[4] GB-T13715-1992. 信息处理用现代汉语分词规范[S].北京: 中国标准出版社, 1993.
[4] (GB-T13715-1992. Contemporary Chinese Language Word Segmentation Specification for Information Processing [S]. Beijing: China Standard Press, 1993.)
[5] 罗竹风,等. 汉语大词典[M]. 上海: 上海辞书出版社, 2011.
[5] (Luo Zhufeng, et al.The Great Chinese Dictionary [M]. Shanghai: Shanghai Lexicographical Publishing House, 2011.)
[6] 蔡镜浩. 魏晋南北朝词语例释[M]. 南京: 江苏古籍出版社, 1990.
[6] (Cai Jinghao. Wei, Jin, Southern and Northern Dynasties Words and Expressions [M]. Nanjing: Jiangsu Ancient Books Publishing House, 1990.)
[7] 董志翘, 蔡镜浩. 中古虚词语法例释[M]. 长春: 吉林教育出版社, 1994.
[7] (Dong Zhiqiao, Cai Jinghao.Middle Ancient Function Words and Expressions [M]. Changchun: Jilin Education Publishing House, 1994.)
[8] 丁福保. 佛学大辞典[M]. 北京: 中国书店出版社, 2011.
[8] (Ding Fubao.Buddhist Dictionary [M]. Beijing: China Bookstore Publishing House, 2011.)
[9] 李维琦, 蒋冀骋. 佛经词语汇释[M]. 长沙: 湖南师大出版社, 2004.
[9] (Li Weiqi, Jiang Jicheng.Sutras Words Explanations [M]. Changsha: Hunan Normal University Publishing House, 2004.)
[10] 黄居仁, 陈克健, 陈凤仪,等. 《资讯处理用中文分词规范》设计理念及规范内容[J]. 语言文字应用, 1997(1):94-102.
[10] (Huang Juren, Chen Kejian, Chen Fengyi, et al.A Segmentation Standard for Chinese Information Processing: Design Criteria and Content[J]. Journal of Applied Linguistics, 1997(1): 94-102.)
[11] 黄昌宁, 赵海. 中文分词十年回顾[J]. 中文信息学报, 2007, 21(3): 8-19.
[11] (Huang Changning, Zhao Hai.Chinese Word Segmentation: A Decade Review[J]. Journal of Chinese Information Processing, 2007, 21(3): 8-19.)
[12] 吴琼, 黄德根. 基于条件随机场与时间词库的中文时间表达式识别[J]. 中文信息学报, 2014, 28(6): 169-174.
[12] (Wu Qiong, Huang Degen.Temporal Information Extraction Based on CRF and Time Thesaurus[J]. Journal of Chinese Information Processing, 2014, 28(6): 169-174.)
[13] 段宇锋, 朱雯晶, 陈巧, 等. 条件随机场与领域本体元素集相结合的未登录词识别研究[J]. 现代图书情报技术, 2015(4): 41-49.
[13] (Duan Yufeng, Zhu Wenjing, Chen Qiao, et al.The Study on Out-of-Vocabulary Identification on a Model Based on the Combination of CRFs and Domain Ontology Elements Set[J]. New Technology of Library and Information Service, 2015(4): 41-49.)
[14] 修驰. 适应于不同领域的中文分词方法研究与实现[D]. 北京: 北京工业大学, 2013.
[14] (Xiu Chi.The Research and Implementation of Chinese Word Segmentation for Different Domains [D]. Beijing: Beijing University of Technology, 2013.)
[15] 宋彦, 蔡东风, 张桂平, 等. 一种基于字词联合解码的中文分词方法[J]. 软件学报, 2009, 20(9): 2366-2375.
[15] (Song Yan, Cai Dongfeng, Zhang Guiping, et al.Approach to Chinese Word Segmentation Based on Character-Word Joint Decoding[J]. Journal of Software, 2009, 20(9): 2366-2375.)
[16] 石民, 李斌, 陈小荷. 基于CRF的先秦汉语分词标注一体化研究[J]. 中文信息学报, 2010, 24(2): 39-45.
[16] (Shi Min, Li Bin, Chen Xiaohe.CRF Based Research on a Unified Approach to Word Segmentation and POS Tagging for Pre-Qin Chinese[J]. Journal of Chinese Information Processing, 2010, 24(2): 39-45.)
[17] Zhao H, Kit C Y.An Empirical Comparison of Goodness Measures for Unsupervised Chinese Word Segmentation with a Unified Framework[C]//Proceedings of IJCNLP 2008, Hyderabad, India. 2008: 9-16.
[1] Guoming Feng,Xiaodong Zhang,Suhui Liu. DBLC Model for Word Segmentation Based on Autonomous Learning[J]. 数据分析与知识发现, 2018, 2(5): 40-47.
[2] Weijian Ni,Haohao Sun,Tong Liu,Qingtian Zeng. An Unsupervised Approach to Optimize Chinese Word Segmentation on Domain Literature[J]. 数据分析与知识发现, 2018, 2(2): 96-104.
[3] Yue Zhang,Dongbo Wang,Danhao Zhu. Segmenting Chinese Words from Food Safety Emergencies[J]. 数据分析与知识发现, 2017, 1(2): 64-72.
[4] Yu Xincong, Li Honglian, Lv Xueqiang. Research on the Application of Hyponymy in the Enrollment Robot[J]. 现代图书情报技术, 2015, 31(12): 65-71.
[5] Zhang Jie, Zhang Haichao, Zhai Dongsheng. Research of the Word Segmentation for Chinese Patent Claims[J]. 现代图书情报技术, 2014, 30(9): 91-98.
[6] Li Wenjiang, Chen Shiqin. Application of AIMLBot Intelligent Robot in Real-time Virtual Reference Service[J]. 现代图书情报技术, 2012, 28(7): 127-132.
[7] Jiang Hua, Su Xiaoguang. Chinese High-frequency Words Extraction Algorithm Without Thesaurus[J]. 现代图书情报技术, 2012, 28(6): 50-53.
[8] Shi Chongde, Wang Huilin. Research on Chinese Word Segmentation Optimization in Statistical Machine Translation[J]. 现代图书情报技术, 2012, 28(4): 29-34.
[9] Gu Jun, Wang Hao. Study on Term Extraction on the Basis of Chinese Domain Texts[J]. 现代图书情报技术, 2011, 27(4): 29-34.
[10] Mai Fanjin,Wang Ting. Sense Disambiguation of Chinese Segmentation Based on Bi-direction Matching Method and HMM[J]. 现代图书情报技术, 2008, 24(8): 37-41.
[11] Xie Hui,Qin Jie,Hu Shuangshuang. The Study on the Duplicated Web Pages Detection Algorithm Based on the Keyword from User’s Submission[J]. 现代图书情报技术, 2008, 24(7): 43-46.
[12] Zhang Jinzhu,Zhang Dong,Wang Huilin. The Research of Character-Position-Based Chinese Word Segmentation[J]. 现代图书情报技术, 2008, 24(5): 39-43.
[13] Tan Chunmei,Yan Shiwei,Liu Zimu. Design and Realization of Knowledge Element Automatic Extraction of Network Special Subject Knowledge Organization[J]. 现代图书情报技术, 2008, 24(3): 62-67.
[14] Yao Xingshan. The Improvement in a Chinese Word Segmentation Based on Hash Algorism[J]. 现代图书情报技术, 2008, 24(3): 78-81.
[15] Gao Xiaoyun,Yang Jianlin . Chinese Time Words and Numerals Automatic Segmentation Method Based on Rules[J]. 现代图书情报技术, 2007, 2(3): 46-50.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn