|
|
The Research of Character-Position-Based Chinese Word Segmentation |
Zhang Jinzhu Zhang Dong Wang Huilin |
(Institute of Scientific and Technical Information of China, Beijing 100038,China) |
|
|
Abstract This paper analyses the actuality and introduces several different representative approaches of Chinese word segmentation, then brings out a character-position-based segmentation method which takes the Chinese character as the least unit.It indicates the probability distribution of a word through the probability distribution of Chinese character,so it plays much better than other approaches in unknown word recognition.This idea takes a machine-learning method called maximum entropy for implementation and two experiments for comparing and analyzing the results.
|
Received: 28 December 2007
Published: 25 May 2008
|
|
Corresponding Authors:
Zhang Jinzhu
E-mail: zhjzh1016@163.com
|
About author:: Zhang Jinzhu,Zhang Dong,Wang Huilin |
[1] 姚敏.汉语自动分词和中文人名识别技术研究[D].浙江:浙江大学,2006.
[2] 刘武.基于统计机器学习算法的汉语分词系统的研究[D].北京:北京邮电大学,2006.
[3] 祁正华.基于无词库的中文分词方法的研究[D].南京:南京邮电学院,2005.
[4] Gan K W.Integrating Word Boundary Disambiguation with Sentence Understanding[D]. Singapore: National University of Singapore,1995.
[5] Xue N, Shen L.Cinese Word Segmentation as LMR Tagging[C].Proceedings of the Second SIGHAN Workshop on Chinese Language Processing,2003:176-179.
[6] Xue N.Chinese Word Segmentation as Character Tagging[J].International Journal of Computational Linguistics and Chinese Language Processing, 2003:29-48.
[7] Sproat R,Shih C L. A Statistical Method for Finding Word Boundaries in Chinese Text[J].Computer Processing of Chinese and Oriental Languages, 1990,4(4):336-351.
[8] Berger A L,Della Pietra V J,Della Pietra S A.A Maximum Entropy Approach to Natural Language Processing[J]. Computational Linguistics, 1996, 22(1):8-15.
[9] Darroch J N, Ratcliff D. Generalized Iterative Scaling for Log-Linear models[J]. Annals of Mathematical Statistics, 1972,43(5): 1470-1480.
[10] Della Pietra S, Della Pietra V, Lafferty J. Inducing Features of Random Fields[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,1997,19(4):380-393.
[11] Ratnaparkhi A. A Maximum Entropy Part-of-speech Tagger[C].In Proceedings of the Empirical Methods in Natural Language Processing Conference,University of Pennsylvania,1996.
[12] Nakagawa T. Chinese and Japanese Word Segmentation Using Word-Level and Character-Level Information[C]. In Proceedings of COLING,2004. |
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|