基于字位信息的中文分词方法研究*

doi:10.11925/infotech.1003-3513.2008.05.07

现代图书情报技术

2008, Vol. 24

Issue (5): 39-43 https://doi.org/10.11925/infotech.1003-3513.2008.05.07

知识组织与知识管理

本期目录 | 过刊浏览 | 高级检索

基于字位信息的中文分词方法研究*

张金柱张东王惠临

（中国科学技术信息研究所北京 100038）

The Research of Character-Position-Based Chinese Word Segmentation

Zhang Jinzhu Zhang Dong Wang Huilin

(Institute of Scientific and Technical Information of China， Beijing 100038，China)

摘要
参考文献
相关文章
Metrics

全文: PDF (449 KB)
输出: BibTeX | EndNote (RIS)

摘要

分析中文自动分词的现状，介绍和描述几种不同的分词思想和方法，提出一种基于字位的分词方法。此分词方法以字为最小单位，根据字的概率分布得到组合成词的概率分布，因此在未登录词识别方面比其它方法有更优秀的表现。使用最大熵的机器学习方法来进行实现并通过两个实验得出实验结果的比较分析。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	张金柱
	王惠临
	张东

关键词 ：中文分词, 字位, 最大熵, 未登录词识别

Abstract：

This paper analyses the actuality and introduces several different representative approaches of Chinese word segmentation, then brings out a character-position-based segmentation method which takes the Chinese character as the least unit.It indicates the probability distribution of a word through the probability distribution of Chinese character,so it plays much better than other approaches in unknown word recognition.This idea takes a machine-learning method called maximum entropy for implementation and two experiments for comparing and analyzing the results.

Key words： Chinese word segmentation Character-position Maximum entropy Unknown word recognition

收稿日期: 2007-12-28 出版日期: 2008-05-25

:	TP311
	TP18

基金资助:

*本文系中国科学技术信息研究所学科建设项目“语言技术与知识技术”(项目编号：2007DP01-8)和国家科技支撑计划课题“多语言信息服务环境关键技术研究与应用”(项目编号：2006BAH03B02)的研究成果之一。

通讯作者: 张金柱 E-mail: zhjzh1016@163.com

作者简介: 张金柱,张东,王惠临

引用本文:

张金柱,张东,王惠临. 基于字位信息的中文分词方法研究*[J]. 现代图书情报技术, 2008, 24(5): 39-43.
Zhang Jinzhu,Zhang Dong,Wang Huilin. The Research of Character-Position-Based Chinese Word Segmentation. New Technology of Library and Information Service, 2008, 24(5): 39-43.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2008.05.07 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2008/V24/I5/39

［1］姚敏.汉语自动分词和中文人名识别技术研究［D］.浙江：浙江大学,2006.
［2］刘武.基于统计机器学习算法的汉语分词系统的研究［D］.北京:北京邮电大学,2006.
［3］祁正华.基于无词库的中文分词方法的研究［D］.南京:南京邮电学院,2005.
［4］ Gan K W.Integrating Word Boundary Disambiguation with Sentence Understanding［D］. Singapore: National University of Singapore,1995.
［5］ Xue N, Shen L.Cinese Word Segmentation as LMR Tagging［C］.Proceedings of the Second SIGHAN Workshop on Chinese Language Processing,2003:176-179.
［6］ Xue N.Chinese Word Segmentation as Character Tagging［J］.International Journal of Computational Linguistics and Chinese Language Processing, 2003:29-48.
［7］ Sproat R,Shih C L. A Statistical Method for Finding Word Boundaries in Chinese Text［J］.Computer Processing of Chinese and Oriental Languages, 1990,4(4):336-351.
［8］ Berger A L,Della Pietra V J,Della Pietra S A.A Maximum Entropy Approach to Natural Language Processing［J］. Computational Linguistics, 1996, 22(1):8-15.
［9］ Darroch J N, Ratcliff D. Generalized Iterative Scaling for Log-Linear models［J］. Annals of Mathematical Statistics, 1972,43(5): 1470-1480.
［10］ Della Pietra S, Della Pietra V, Lafferty J. Inducing Features of Random Fields［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence,1997,19(4):380-393.
［11］ Ratnaparkhi A. A Maximum Entropy Part-of-speech Tagger［C］.In Proceedings of the Empirical Methods in Natural Language Processing Conference,University of Pennsylvania,1996.
［12］ Nakagawa T. Chinese and Japanese Word Segmentation Using Word-Level and Character-Level Information［C］. In Proceedings of COLING,2004.

[1]	唐琳,郭崇慧,陈静锋. 中文分词技术研究综述^*[J]. 数据分析与知识发现, 2020, 4(2/3): 1-17.
[2]	尤众喜,华薇娜,潘雪莲. 中文分词器对图书评论和情感词典匹配程度的影响 ^*[J]. 数据分析与知识发现, 2019, 3(7): 23-33.
[3]	冯国明, 张晓冬, 刘素辉. 基于自主学习的专业领域文本DBLC分词模型[J]. 数据分析与知识发现, 2018, 2(5): 40-47.
[4]	倪维健, 孙浩浩, 刘彤, 曾庆田. 面向领域文献的无监督中文分词自动优化方法^*[J]. 数据分析与知识发现, 2018, 2(2): 96-104.
[5]	张越, 王东波, 朱丹浩. 面向食品安全突发事件汉语分词的特征选择及模型优化研究^*[J]. 数据分析与知识发现, 2017, 1(2): 64-72.
[6]	段宇锋, 朱雯晶, 陈巧, 刘伟, 刘凤红. 条件随机场与领域本体元素集相结合的未登录词识别研究[J]. 现代图书情报技术, 2015, 31(4): 41-49.
[7]	余昕聪, 李红莲, 吕学强. 本体上下位关系在招生问答机器人中的应用研究[J]. 现代图书情报技术, 2015, 31(12): 65-71.
[8]	张杰, 张海超, 翟东升. 面向中文专利权利要求书的分词方法研究[J]. 现代图书情报技术, 2014, 30(9): 91-98.
[9]	李文江, 陈诗琴. AIMLBot智能机器人在实时虚拟参考咨询中的应用[J]. 现代图书情报技术, 2012, 28(7): 127-132.
[10]	江华, 苏晓光. 无词典中文高频词快速抽取算法[J]. 现代图书情报技术, 2012, 28(6): 50-53.
[11]	石崇德, 王惠临. 统计机器翻译中文分词优化技术研究[J]. 现代图书情报技术, 2012, 28(4): 29-34.
[12]	余传明, 黄建秋, 郭飞. 从客户评论中识别命名实体——基于最大熵模型的实现[J]. 现代图书情报技术, 2011, 27(5): 77-82.
[13]	谷俊, 王昊. 基于领域中文文本的术语抽取方法研究[J]. 现代图书情报技术, 2011, 27(4): 29-34.
[14]	常智荣,马自卫,李高虎. 基于Nutch的专题网页资源采集服务系统的设计与实现[J]. 现代图书情报技术, 2010, 26(3): 19-26.
[15]	程肖, 陆蓓, 谌志群. 热点主题词提取方法研究[J]. 现代图书情报技术, 2010, 26(10): 43-48.

Viewed

Full text

Abstract

Cited

Shared

Discussed