Please wait a minute...
Advanced Search
数据分析与知识发现  2018, Vol. 2 Issue (2): 96-104    DOI: 10.11925/infotech.2096-3467.2017.0990
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
面向领域文献的无监督中文分词自动优化方法*
倪维健,孙浩浩,刘彤(),曾庆田
山东科技大学计算机学院 青岛 266510
An Unsupervised Approach to Optimize Chinese Word Segmentation on Domain Literature
Weijian Ni,Haohao Sun,Tong Liu(),Qingtian Zeng
College of Computer Science and Technology, Shandong University of Science and Technology, Qingdao 266510, China
全文: PDF(1111 KB)   HTML
输出: BibTeX | EndNote (RIS)      
摘要 

目的】对现有中文分词方法在领域文献上的分词结果进行调整, 以提升领域文献上的分词效果。【方法】对传统中文分词方法处理领域文献的不足进行分析, 以此为基础设计一个反映领域文献构词特点的分词指标——词频偏差, 并基于该指标提出一个无监督的分词结果优化方法。【结果】基于农业领域语料开展实验, 结果表明该方法对比ICTCLAS、THULAC和LTP的分词结果F1值提升2%-3%, 并具有实现简单、参数鲁棒性强的特点。【局限】提升召回率方面效果不佳。【结论】基于词频偏差的分词结果优化算法能够有效提升已有分词结果的准确性, 且无需领域词表及人工标注语料, 具有良好的领域适用性。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
倪维健
孙浩浩
刘彤
曾庆田
关键词 领域文献中文分词分词优化词频偏差    
Abstract

[Objective] This paper aims to improve the performance of Chinese word segmentation techniques on domain literature by optimizing results of existing approaches. [Methods] First, we proposed a new criteria of Term Frequency Deviation (TFD) to capture word formation characteristics of domain literature based on the analysis of segmentation errors. Then, we developed an unsupervised segmentation refining approach with the help of TFD. [Results] We examined the proposed approach with agriculture documents. It improved the segmentation results of three popular Chinese word segmentation approaches (i.e., ICTCLAS, THULAC and LTP) by 2%~3% in F1 measure. The proposed approach was easy to use and robustness to parameters. [Limitations] The recall of the proposed approach needs to be improved. [Conclusions] The new Chinese word segmentation approach, which imrpoves the performance of traditional methods on domain literature, could be applied to other fields due to its independence of domain-specific vocabulary and annotated corpus.

Key wordsDomain Literature    Chinese Word Segmentation    Segmentation Refining    Term Frequency Deviation
收稿日期: 2017-09-28     
基金资助:*本文系国家自然科学基金项目“面向用户群组的结构化推荐技术及其应用研究”(项目编号: 61602278)、“应急预案流程图谱自动建模方法及其在场景式诊断中的应用”(项目编号: 71704096)和“农业大数据环境下多粒度知识融合方法研究”(项目编号: 31671588)的研究成果之一
引用本文:   
倪维健,孙浩浩,刘彤,曾庆田. 面向领域文献的无监督中文分词自动优化方法*[J]. 数据分析与知识发现, 2018, 2(2): 96-104.
Weijian Ni,Haohao Sun,Tong Liu,Qingtian Zeng. An Unsupervised Approach to Optimize Chinese Word Segmentation on Domain Literature. Data Analysis and Knowledge Discovery, DOI:10.11925/infotech.2096-3467.2017.0990.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2017.0990
图1  ICTCLAS分词实例
图2  分词优化方法流程
图3  排序对数TFD值曲线
文档数 句子数 字数 词数
ICTCLAS 10 697 852 051 6 351 134 4 275 300
THULAC 4 078 779
LTP 4 008 380
人工标注 500 40 173 298 434 223 011
表1  实验语料统计信息
图4  实验语料示例
图5  TFDMI之间的相关度
图6  分词指标错误率
Approach ICTCLAS THULAC LTP
Precision Recall F1 Precision Recall F1 Precision Recall F1
BASEICTCLAS 74.60% 97.72% 84.61% 76.62% 94.08% 84.46% 77.84% 92.57% 84.57%
TFD 83.49% 89.18% 86.24% 82.09% 90.66% 86.16% 82.52% 90.28% 86.22%
rTFD 82.98% 92.12% 87.31% 83.29% 90.27% 86.65% 83.66% 89.71% 86.58%
MI 83.31% 84.98% 84.13% 83.26% 83.06% 83.16% 83.12% 85.20% 84.14%
rMI 84.47% 87.70% 86.05% 79.03% 91.85% 84.96% 81.08% 89.98% 85.30%
表2  分词优化结果
图7  指标计算参数敏感性
[1] 黄昌宁, 赵海. 中文分词十年回顾[J]. 中文信息学报, 2007, 21(3): 8-19.
doi: 10.3969/j.issn.1003-0077.2007.03.002
(Huang Changning, Zhao Hai.Chinese Word Segmentation: A Decade Review[J]. Journal of Chinese Information Processing, 2007, 21(3): 8-19.)
[2] ICTCLAS 2016 [EB/OL]. [2016-05-31]. .
[3] THULAC[EB/OL]. [2016-03-27]. .
[4] 语言技术平台云 [EB/OL]. [2015-10-31]. .
(LTP-Cloud [EB/OL]. [2015-10-31].
[5] 张桂平, 刘东生, 尹宝生, 等. 面向专利文献的中文分词技术的研究[J]. 中文信息学报, 2010, 24(3): 112-117.
doi: 10.3969/j.issn.1003-0077.2010.03.017
(Zhang Guiping, Liu Dongsheng, Yin Baosheng, et al.Research on Chinese Word Segmentation for Patent Documents[J]. Journal of Chinese Information Processing, 2010, 24(3): 112-117.)
[6] 岳金媛, 徐金安, 张玉洁. 面向专利文献的汉语分词技术研究[J]. 北京大学学报, 2013, 49(1): 159-164.
(Yue Jinyuan, Xu Jin’an, Zhang Yujie.Chinese Word Segmentation for Patent Documents[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2013, 49(1): 159-164.)
[7] 张杰, 张海超, 翟东升. 面向中文专利权利要求书的分词方法研究[J]. 现代图书情报技术, 2014(9): 91-98.
(Zhang Jie, Zhang Haichao, Zhai Dongsheng.Research of the Word Segmentation for Chinese Patent Claims[J]. New Technology of Library and Information Service, 2014(9): 91-98.)
[8] Li S, Xue N.Effective Document-Level Features for Chinese Patent Word Segmentation[C]//Proceedings of the 52nd Annual Meeting of the ACL. 2014:199-205.
[9] 王军辉, 胡铁军, 李丹亚, 等. 中文生物医学文本无词典分词方法研究[J]. 情报学报, 2011, 30(2): 197-203.
doi: 10.3772/j.issn.1000-0135.2011.02.012
(Wang Junhui, Hu Tiejun, Li Danya, et al.Research on Method for Chinese Word Segmentation Without Thesaurus in Chinese Biomedical Text[J]. Journal of the China Society for Scientific and Technical Information, 2011, 30(2): 197-203.)
[10] 李国垒, 陈先来, 夏冬, 等. 中文病历文本分词方法研究[J]. 中国生物医学工程学报, 2016, 35(4): 477-481.
(Li Guolei, Chen Xianlai, Xia Dong, et al.Research on Segmentation of Chinese Text in Medical Record[J]. Chinese Journal of Biomedical Engineering, 2016, 35(4): 477-481.)
[11] 王晓玉, 李斌. 基于CRFs和词典信息的中古汉语自动分词[J]. 数据分析与知识发现, 2017, 1(5): 62-70.
(Wang Xiaoyu, Li Bin.Automatically Segmenting Middle Ancient Chinese Words with CRFs[J]. Data Analysis and Knowledge Discovery, 2017, 1(5): 62-70.)
[12] 黄水清, 王东波, 何琳. 以《汉学引得丛刊》为领域词表的先秦典籍自动分词探讨[J]. 图书情报工作, 2015, 59(11): 127-133.
doi: 10.13266/j.issn.0252-3116.2015.11.018
(Huang Shuiqing, Wang Dongbo, He Lin.Exploring of Word Segmentation for For-Qin Literature Based on the Domain Glossary of Sinological Index Series[J]. Library and Information Service, 2015, 59(11): 127-133.)
[13] 张越, 王东波, 朱丹浩. 面向食品安全突发事件汉语分词的特征选择及模型优化研究[J]. 数据分析与知识发现, 2017, 1(2): 64-72.
(Zhang Yue, Wang Dongbo, Zhu Danhao.Segmenting Chinese Words from Food Safety Emergencies[J]. Data Analysis and Knowledge Discovery, 2017, 1(2): 64-72.)
[14] 张琳, 秦策, 叶文豪. 基于条件随机场的法言法语实体自动识别模型研究[J]. 数据分析与知识发现, 2017, 1(11): 46-52
(Zhang Lin, Qin Ce, Ye Wenhao.Automatic Recognition of Legal Language Entities Based on Conditional Random Fields[J]. Data Analysis and Knowledge Discovery, 2017, 1(11): 46-52.)
[15] 石崇德, 王惠临. 统计机器翻译中文分词优化技术研究[J]. 现代图书情报技术, 2012(4): 29-34.
(Shi Chongde, Wang Huilin.Research on Chinese Word Segmentation Optimization in Statistical Machine Translation[J]. New Technology of Library and Information Service, 2012(4): 29-34.)
[16] 韩冬煦, 常宝宝. 中文分词模型的领域适应性方法[J]. 计算机学报, 2015, 38(2): 272-281.
doi: 10.3724/SP.J.1016.2015.00272
(Han Dongxu, Chang Baobao.Approches to Domain Adaptive Chinese Segmetation Model[J]. Chinese Journal of Computers, 2015, 38(2): 272-281.)
[17] Zeng D, Wei D, Chau M, et al.Domain-specific Chinese Word Segmentation Using Suffix Tree and Mutual Information[J]. Information Systems Frontiers, 2011, 13(1): 115-125.
doi: 10.1007/s10796-010-9278-5
[18] Song Y, Xia F.Using a Goodness Measurement for Domain Adaptation: A Case Study on Chinese Word Segmentation[C]//Proceedings of the 6th Language Resources and Evaluation Conference. 2012: 3853-3860.
[19] 谷俊, 王昊. 基于领域中文文本的术语抽取方法研究[J]. 现代图书情报技术, 2011(4): 29-34.
(Gu Jun, Wang Hao.Study on Term Extraction on the Basis of Chinese Domain Texts[J]. New Technology of Library and Information Service, 2011(4): 29-34.)
[20] 许华婷, 张玉洁, 杨晓晖, 等. 基于Active Learning的中文分词领域自适应[J]. 中文信息学报, 2015, 29(5): 55-63.
(Xu Huating, Zhang yujie, Yang Xiaohui, et al. Active Learning Based Domain Adaptation for Chinese Word Segmentation[J]. Journal of Chinese Information Processing, 2015, 29(5): 55-63.)
[21] Liu Y, Zhang Y, Che W, et al.Domain Adaptation for CRF-based Chinese Word Segmentation Using Free Annotations[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 864-874.
[22] 张梅山, 邓知龙, 车万翔, 等. 统计与词典相结合的领域自适应中文分词[J]. 中文信息学报, 2012, 26(2): 8-13.
doi: 10.3969/j.issn.1003-0077.2012.02.002
(Zhang Meishan, Deng Zhilong, Che Wanxiang, et al.Combing Statistical Model and Dictionary for Domain Adaption of Chinese Word Segmentation[J]. Journal of Chinese Information Processing, 2012, 26(2): 8-13.)
[23] Beeferman D, Berger A, Lafferty J.Statistical Models for Text Segmentation[J]. Machine Learning, 1999, 34(1-3): 177-210.
doi: 10.1023/A:1007506220214
[24] 俞士汶, 段慧明, 朱学锋, 等. 北京大学现代汉语语料库基本加工规范(续)[J]. 中文信息学报, 2002, 16(6): 58-65.
(Yu Shiwen, Duan Huiming, Zhu Xuefeng, et al.The Basic Processing of Contemporary Chinese Corpus at Peking University Specification[J]. Journal of Chinese Information Processing, 2002, 16(6): 58-65.)
[1] 张越,王东波,朱丹浩. 面向食品安全突发事件汉语分词的特征选择及模型优化研究*[J]. 数据分析与知识发现, 2017, 1(2): 64-72.
[2] 余昕聪, 李红莲, 吕学强. 本体上下位关系在招生问答机器人中的应用研究[J]. 现代图书情报技术, 2015, 31(12): 65-71.
[3] 张杰, 张海超, 翟东升. 面向中文专利权利要求书的分词方法研究[J]. 现代图书情报技术, 2014, 30(9): 91-98.
[4] 李文江, 陈诗琴. AIMLBot智能机器人在实时虚拟参考咨询中的应用[J]. 现代图书情报技术, 2012, 28(7): 127-132.
[5] 江华, 苏晓光. 无词典中文高频词快速抽取算法[J]. 现代图书情报技术, 2012, 28(6): 50-53.
[6] 石崇德, 王惠临. 统计机器翻译中文分词优化技术研究[J]. 现代图书情报技术, 2012, 28(4): 29-34.
[7] 谷俊, 王昊. 基于领域中文文本的术语抽取方法研究[J]. 现代图书情报技术, 2011, 27(4): 29-34.
[8] 常智荣,马自卫,李高虎. 基于Nutch的专题网页资源采集服务系统的设计与实现[J]. 现代图书情报技术, 2010, 26(3): 19-26.
[9] 程肖, 陆蓓, 谌志群. 热点主题词提取方法研究[J]. 现代图书情报技术, 2010, 26(10): 43-48.
[10] 谢蕙,秦杰,胡双双. 基于用户查询关键词的网页去重方法研究[J]. 现代图书情报技术, 2008, 24(7): 43-46.
[11] 张金柱,张东,王惠临. 基于字位信息的中文分词方法研究*[J]. 现代图书情报技术, 2008, 24(5): 39-43.
[12] 姚兴山. 基于Hash算法的中文分词的研究[J]. 现代图书情报技术, 2008, 24(3): 78-81.
[13] 化柏林 . 知识抽取中的停用词处理技术[J]. 现代图书情报技术, 2007, 2(8): 48-51.
[14] 丁晟春,成晓 . 基于用户提问的领域本体知识库的知识检索*[J]. 现代图书情报技术, 2007, 2(1): 62-64.
[15] 向晖,郭一平,王亮 . 基于Lucene的中文字典分词模块的设计与实现[J]. 现代图书情报技术, 2006, 1(8): 46-50.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn