Please wait a minute...
Advanced Search
数据分析与知识发现  2017, Vol. 1 Issue (5): 62-70     https://doi.org/10.11925/infotech.2096-3467.2017.05.08
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于CRFs和词典信息的中古汉语自动分词*
王晓玉, 李斌()
南京师范大学文学院 南京 210097
Automatically Segmenting Middle Ancient Chinese Words with CRFs
Wang Xiaoyu, Li Bin()
School of Chinese Language and Literature, Nanjing Normal University, Nanjing 210097, China
全文: PDF (477 KB)   HTML ( 2
输出: BibTeX | EndNote (RIS)      
摘要 

目的】验证中古时期分词一致性和语料类别对CRFs分词效率的影响, 在此基础上进一步提高分词效率, 降低人工校对的工作量。【方法】以中古时期的史书、佛经、小说类语料为例, 针对中古汉语的自动分词问题, 优化分词原则, 运用CRFs模型和词典相结合的方法, 消除中古汉语人工分词结果中易出现的分词不一致问题; 同时在CRFs分词中引入字符分类、字典信息两种特征, 并通过对比实验选取每种特征最合适的分词模板。【结果】实验结果显示, 分词结果的总F值在封闭测试中达到99%以上, 开放测试的综合测试中也达到89%-95%。【局限】分词不一致研究主要针对双字词, 因此三字以上词语(多字词)的识别效果稍有欠缺。【结论】在有效提高分词一致性的前提下, 字符分类、词典标记特征能够有效提高中古汉语CRFs分词的精确度。同时本文提出的中古汉语分词系统可以服务于中古时期多类别的汉语语料。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
王晓玉
李斌
关键词 CRFs模型分词一致性中古汉语自动分词    
Abstract

[Objective] The purpose of this paper is to explore the influence of the word segmentation consistency and the corpus types in Middle Ancient Chinese (MAC). It tries to improve the accuracy and efficiency of the automatic word segmentation, a basic procedure in processing ancient Chinese, based on the CRFs model. [Methods] First, we optimized the segmentation principles for MAC historical records, Buddhist scriptures and novels. Then, we combined the CRFs model with dictionary to reduce the segmentation inconsistency in the manual procedures. Finally, we added two features to the CRFs model (i.e. character classification and dictionary information), and identified the best word segmentation template by comparison experiments. [Results] The F-score was higher than 99% in the closed test, while it was from 89% to 95% in the open test. [Limitations] The segmentation consistency was improved on the words with two characters, and more studies were needed on the segmentation of words with more than three characters. [Conclusions] The proposed method could effectively improve the accuracy of automatic word segmentation for mediaeval Chinese corpus.

Key wordsConditional Random Fields Model    Segmentation Consistency    Middle Ancient Chinese    Word Segmentation
收稿日期: 2017-03-14      出版日期: 2017-06-06
ZTFLH:  TP391  
基金资助:*本文系国家社会科学基金重大项目“汉语史研究语料库建设研究”(项目编号: 10&ZD117)、教育部人文社会科学青年项目“汉语历时词汇数据库的构建与计量研究”(项目编号: 16YJC740034)和国家社会科学基金重大项目“基于《汉学引得丛刊》的典籍知识库构建及人文计算研究”(项目编号: 15ZDB127)的研究成果之一
引用本文:   
王晓玉, 李斌. 基于CRFs和词典信息的中古汉语自动分词*[J]. 数据分析与知识发现, 2017, 1(5): 62-70.
Wang Xiaoyu,Li Bin. Automatically Segmenting Middle Ancient Chinese Words with CRFs. Data Analysis and Knowledge Discovery, 2017, 1(5): 62-70.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2017.05.08      或      http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2017/V1/I5/62
语料类别 训练语料 测试语料
语料来源 字数 总字数 语料来源 字数 总字数
史书类 后汉书(卷1、34、74; 卷2、75、38未完) 70 344 145 292 北齐书(卷1-4, 开放测试) 27 189 44 979
三国志(魏书卷1-3; 卷4未完; 吴书卷46、卷49) 62 093 三国志(魏书卷1-2, 封闭测试) 17 790
陈书(卷1-16; 卷27-36) 12 855
佛经类 撰集百缘经 80 588 99 157 百喻经(开放测试) 21 552 35 209
杂譬喻经二种 18 569 杂譬喻经–失译(封闭测试) 13 657
小说类 幽明录 36 718 36 718
总计 281 167 80 188
  语料情况说明表
字符 字符类别 词典标记 标准答案
HZ B S
HZ T B
HZ T E
: Punc W W
CNum B B
HZ E E
HZ S S
HZ S S
HZ T B
HZ T M
HZ E E
, SenPunc W W
  CRFs语料标记示例
特征
统计结果
仅字面信息 字面(1W+2C)+字符分类 字面(1W+2C)+词典 Template-all
1W 2W 1W+2C 2W+2C 0W 1W 2C 1W+2C 0W 1W 2C 1W+2C
单字词数 1 710 568 1 918 648 1 300 1 403 1 814 1 384 1 532 1 597 1 866 1 790 1 747
双字词数 970 1 541 866 1 501 1 094 1 042 819 1 045 923 906 803 837 833
多字词数 0 0 0 0 0 0 4 4 16 16 20 20 22
正确分词数 1 127 849 1 223 885 1 120 1 135 1 354 1 147 1 910 1 975 2 033 2 164 2 229
总P(%) 42.05% 40.26% 43.93% 41.18% 46.78% 46.42% 51.35% 47.14% 77.30% 78.40% 75.60% 81.75% 85.66%
总R(%) 43.02% 32.40% 46.68% 33.78% 42.75% 43.32% 51.68% 43.78% 72.90% 75.38% 77.60% 82.60% 85.08%
总F(%) 42.53% 35.91% 45.26% 37.11% 44.67% 44.82% 51.51% 45.40% 75.03% 76.86% 76.59% 82.17% 85.37%
双字词正确数 361 535 334 522 446 414 347 423 592 640 620 634 662
双字词P(%) 37.22% 34.72% 38.57% 34.78% 40.77% 39.73% 42.37% 40.48% 64.14% 70.64% 77.21% 75.75% 79.47%
双字词R(%) 47.81% 70.86% 44.24% 69.14% 59.07% 54.83% 45.96% 56.03% 78.41% 84.77% 82.12% 83.97% 87.68%
双字词F(%) 41.86% 46.60% 41.21% 46.28% 48.24% 46.08% 44.09% 47.00% 70.56% 77.06% 79.59% 79.65% 83.38%
  加入字符分类、词典标记特征后分词对比
训练
语料
测试语料 分 词 结 果(CRFs分词结果的词数与PRF值)
单字词 双字词 多字词 总P
(%)
总R
(%)
总F
(%)
F值
变化率
双字词P(%) 双字词R(%) 双字词F(%) F值
变化率
多字词P(%) 多字词R(%) 多字词F(%) F值
变化率
原语料 史书 7 764 3 263 80 82.05% 85.62% 83.79%
15.70%
81.67% 80.86% 81.26%
18.15%
70.00% 20.59% 31.82%
66.71%
一致后 7 058 3 309 270 99.53% 99.46% 99.50% 99.21% 99.61% 99.41% 98.89% 98.16% 98.52%
原语料 佛经 5 333 2 690 70 88.08% 85.67% 86.86%
12.38%
78.55% 89.95% 83.87%
15.33%
50.00% 26.12% 34.31%
58.28%
一致后 5 823 2 355 136 99.28% 99.21% 99.24% 99.07% 99.32% 99.19% 91.91% 93.28% 92.59%
  分词一致性对CRFs分词结果影响(封闭测试)
训练语料 测试语料 分 词 结 果(CRFs分词结果的词数与PRF值)
单字词 双字词 多字词 总P
(%)
总R
(%)
总F
(%)
F值
变化率
双字词P(%) 双字词R(%) 双字词F(%) F值
变化率
多字词P(%) 多字词R(%) 多字词F(%) F值
变化率
史书 史书 7 764 3 263 80 99.73% 99.71% 99.72%
0.22%
99.61% 99.79% 99.70%
0.29%
99.26% 98.90% 99.08%
0.56%
综合 7 058 3 309 270 99.53% 99.46% 99.50% 99.21% 99.61% 99.41% 98.89% 98.16% 98.52%
佛经 佛经 5 333 2 690 70 99.44% 99.45% 99.44%
0.20%
99.53% 99.32% 99.42%
0.23%
93.43% 95.52% 94.46%
1.87%
综合 5 823 2 355 136 99.28% 99.21% 99.24% 99.07% 99.32% 99.19% 91.91% 93.28% 92.59%
  语料混杂度对CRFs分词结果影响(封闭测试)
训练
语料
测试语料 分 词 结 果(CRFs分词结果的词数与PRF值)
单字
双字
多字词 总P
(%)
总R
(%)
总F
(%)
F值
变化率
双字词P(%) 双字词R(%) 双字词F(%) F值
变化率
多字词P(%) 多字词R(%) 多字词F(%) F值
变化率
原语料 史书 10 745 5 520 230 80.24% 85.27% 82.67%
6.98%
83.22% 84.51% 83.86%
6.40%
62.17% 20.11% 30.39%
29.74%
一致后 9 834 5 503 513 88.73% 90.61% 89.66% 89.71% 90.82% 90.26% 71.73% 51.76% 60.13%
原语料 佛经 8 482 4 203 76 92.30% 88.46% 90.34%
4.13%
81.51% 94.72% 87.62%
5.38%
60.53% 52.87% 56.44%
7.89%
一致后 9 113 3 875 84 95.35% 93.61% 94.47% 89.91% 96.32% 93.01% 65.48% 63.22% 64.33%
  分词一致性对CRFs分词结果影响(开放测试)
训练语料 测试语料 分词结果(CRFs分词结果的词数与PRF值)
单字词 双字
多字词 总P
(%)
总R
(%)
总F
(%)
F值
变化率
双字词P(%) 双字词R(%) 双字词F(%) F值变化率 多字词P(%) 多字词R(%) 多字词F(%) F值变化率
史书 史书 9 668 5 562 526 88.61% 89.94% 89.27%
0.39%
88.76% 90.82% 89.78%
0.48%
68.82% 50.91% 58.53%
1.60%
综合 9 834 5 503 513 88.73% 90.61% 89.66% 89.71% 90.82% 90.26% 71.73% 51.76% 60.13%
佛经 佛经 9 085 3 902 76 94.82% 93.02% 93.91%
0.56%
89.06% 96.07% 92.43%
0.57%
65.79% 57.47% 61.35%
2.98%
综合 9 113 3 875 84 95.35% 93.61% 94.47% 89.91% 96.32% 93.01% 65.48% 63.22% 64.33%
  语料混杂度对CRFs分词结果影响(开放测试)
[1] 化振红. 深加工中古汉语语料库建设的若干问题[J]. 西南大学学报: 社会科学版, 2014, 40(3): 136-142.
doi: 10.3969/j.issn.1673-9841.2014.03.020
[1] (Hua Zhenhong.Some Problems in the Deep Processing of the Medieval Chinese Corpus Construction[J]. Journal of Southwest University: Social Science Edition, 2014, 40(3): 136-142.)
doi: 10.3969/j.issn.1673-9841.2014.03.020
[2] 王嘉灵. 以《汉书》为例的中古汉语自动分词[D]. 南京: 南京师范大学, 2014.
[2] (Wang Jialing.The Medieval Chinese Automatic Segmentation Using the “Han Shu” as an Example [D]. Nanjing: Nanjing Normal University, 2014. )
[3] 王晓玉, 董志翘. 中古汉语分词不一致原因探讨[J]. 汉语史研究集刊, 2015, 19: 20-33
[3] (Wang Xiaoyu, Dong Zhiqiao.The Investigation of Middle Ancient Chinese Word Segmentation’s Inconsistency[J]. The Collected Papers of the Chinese History Study, 2015, 19:20-33.)
[4] GB-T13715-1992. 信息处理用现代汉语分词规范[S].北京: 中国标准出版社, 1993.
[4] (GB-T13715-1992. Contemporary Chinese Language Word Segmentation Specification for Information Processing [S]. Beijing: China Standard Press, 1993.)
[5] 罗竹风,等. 汉语大词典[M]. 上海: 上海辞书出版社, 2011.
[5] (Luo Zhufeng, et al.The Great Chinese Dictionary [M]. Shanghai: Shanghai Lexicographical Publishing House, 2011.)
[6] 蔡镜浩. 魏晋南北朝词语例释[M]. 南京: 江苏古籍出版社, 1990.
[6] (Cai Jinghao. Wei, Jin, Southern and Northern Dynasties Words and Expressions [M]. Nanjing: Jiangsu Ancient Books Publishing House, 1990.)
[7] 董志翘, 蔡镜浩. 中古虚词语法例释[M]. 长春: 吉林教育出版社, 1994.
[7] (Dong Zhiqiao, Cai Jinghao.Middle Ancient Function Words and Expressions [M]. Changchun: Jilin Education Publishing House, 1994.)
[8] 丁福保. 佛学大辞典[M]. 北京: 中国书店出版社, 2011.
[8] (Ding Fubao.Buddhist Dictionary [M]. Beijing: China Bookstore Publishing House, 2011.)
[9] 李维琦, 蒋冀骋. 佛经词语汇释[M]. 长沙: 湖南师大出版社, 2004.
[9] (Li Weiqi, Jiang Jicheng.Sutras Words Explanations [M]. Changsha: Hunan Normal University Publishing House, 2004.)
[10] 黄居仁, 陈克健, 陈凤仪,等. 《资讯处理用中文分词规范》设计理念及规范内容[J]. 语言文字应用, 1997(1):94-102.
[10] (Huang Juren, Chen Kejian, Chen Fengyi, et al.A Segmentation Standard for Chinese Information Processing: Design Criteria and Content[J]. Journal of Applied Linguistics, 1997(1): 94-102.)
[11] 黄昌宁, 赵海. 中文分词十年回顾[J]. 中文信息学报, 2007, 21(3): 8-19.
[11] (Huang Changning, Zhao Hai.Chinese Word Segmentation: A Decade Review[J]. Journal of Chinese Information Processing, 2007, 21(3): 8-19.)
[12] 吴琼, 黄德根. 基于条件随机场与时间词库的中文时间表达式识别[J]. 中文信息学报, 2014, 28(6): 169-174.
doi: 10.3969/j.issn.1003-0077.2014.06.024
[12] (Wu Qiong, Huang Degen.Temporal Information Extraction Based on CRF and Time Thesaurus[J]. Journal of Chinese Information Processing, 2014, 28(6): 169-174.)
doi: 10.3969/j.issn.1003-0077.2014.06.024
[13] 段宇锋, 朱雯晶, 陈巧, 等. 条件随机场与领域本体元素集相结合的未登录词识别研究[J]. 现代图书情报技术, 2015(4): 41-49.
[13] (Duan Yufeng, Zhu Wenjing, Chen Qiao, et al.The Study on Out-of-Vocabulary Identification on a Model Based on the Combination of CRFs and Domain Ontology Elements Set[J]. New Technology of Library and Information Service, 2015(4): 41-49.)
[14] 修驰. 适应于不同领域的中文分词方法研究与实现[D]. 北京: 北京工业大学, 2013.
[14] (Xiu Chi.The Research and Implementation of Chinese Word Segmentation for Different Domains [D]. Beijing: Beijing University of Technology, 2013.)
[15] 宋彦, 蔡东风, 张桂平, 等. 一种基于字词联合解码的中文分词方法[J]. 软件学报, 2009, 20(9): 2366-2375.
doi: 10.3724/SP.J.1001.2009.03606
[15] (Song Yan, Cai Dongfeng, Zhang Guiping, et al.Approach to Chinese Word Segmentation Based on Character-Word Joint Decoding[J]. Journal of Software, 2009, 20(9): 2366-2375.)
doi: 10.3724/SP.J.1001.2009.03606
[16] 石民, 李斌, 陈小荷. 基于CRF的先秦汉语分词标注一体化研究[J]. 中文信息学报, 2010, 24(2): 39-45.
doi: 10.3969/j.issn.1003-0077.2010.02.005
[16] (Shi Min, Li Bin, Chen Xiaohe.CRF Based Research on a Unified Approach to Word Segmentation and POS Tagging for Pre-Qin Chinese[J]. Journal of Chinese Information Processing, 2010, 24(2): 39-45.)
doi: 10.3969/j.issn.1003-0077.2010.02.005
[17] Zhao H, Kit C Y.An Empirical Comparison of Goodness Measures for Unsupervised Chinese Word Segmentation with a Unified Framework[C]//Proceedings of IJCNLP 2008, Hyderabad, India. 2008: 9-16.
[1] 孟美任, 丁晟春. 在线中文商品评论可信度研究[J]. 现代图书情报技术, 2013, 29(9): 60-66.
[2] 张李义,李亚子 . 基于反序词典的中文逆向最大匹配分词系统设计*[J]. 现代图书情报技术, 2006, 1(8): 42-45.
[3] 孙巍 . 一种面向中文信息检索的汉语自动分词方法[J]. 现代图书情报技术, 2006, 1(7): 33-36.
[4] 黄水清,程冲 . 基于既定词表的自适应汉语分词技术研究[J]. 现代图书情报技术, 2006, 1(5): 13-17.
[5] 文庭孝,邱均平,侯经川. 汉语自动分词研究展望[J]. 现代图书情报技术, 2004, 20(7): 6-10.
[6] 黄崑,符绍宏. 自动分词技术及其在信息检索中应用的研究[J]. 现代图书情报技术, 2001, 17(3): 26-29.
[7] 尹锋. 汉语自动分词研究的现状与新思维[J]. 现代图书情报技术, 1998, 14(4): 22-26.
[8] 徐进鸿,邵品洪,李明霞. 情报检索数学模型及若干技术进展*[J]. 现代图书情报技术, 1990, 6(3): 5-10.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn