Sense Disambiguation of Chinese Segmentation Based on Bi-direction Matching Method and HMM
Mai Fanjin1 Wang Ting2
1(Modern Education Technology Center, Guilin University of Technology, Guilin 541004, China) 2(Department of Electronic and Computer Science, Guilin University of Technology, Guilin 541004, China)
This paper puts forward a model which can eliminate sense ambiguity of Chinese segmentation. This model segments word based on MM and RMM at first. Then it compares the segmentation results with each other, and output a more accurate result for the segmentation. The process can be divided into three parts:discovery, extraction and disambiguation. The test result shows that this model is able to reduce the error rate of segmentation, which is caused by the ambiguity of word segmentation.
麦范金,王挺.
基于双向最大匹配和HMM的分词消歧模型*[J]. 现代图书情报技术, 2008, 24(8): 37-41.
Mai Fanjin,Wang Ting. Sense Disambiguation of Chinese Segmentation Based on Bi-direction Matching Method and HMM. New Technology of Library and Information Service, 2008, 24(8): 37-41.
[1] 王晓龙,关毅.计算机自然语言处理[M].北京:清华大学出版社,2005.
[2] 黄昌宁,赵海.中文分词十年回顾[J].中文信息学报,2007,21(3):8-19.
[3] 刘颖.计算语言学[M].北京:清华大学出版社,2002.
[4] 梁南元.书面汉语自动分词系统——CDWS[J].中文信息学报,1987(2):44-52.
[5] 王小捷,常宝宝.自然语言处理技术基础[M].北京:北京邮电大学出版社,2002.
[6] Duda R O, Hart P E, Stork D G. Pattern Classification[M]. 2nd Edition. York:Wiley New, 2001.
[7] Jurafsky D, Martin J H. Speech and Language Processing:An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition[M].USA:Prentice Hall, 2000.
[8] Jeffrey H. Theory of Probability[M]. Oxford:Oxford University Press, 1948.
[9] Good I J. The Population Frequencies of Species and the Estimation of Population Parameters[J]. Biometrika, 1953, 40(3-4):237-264.
[10] Jelinek F, Mercer R L. Interpolated Estimation of Markov Source Parameters from Sparse Data[C]. In:Gelsema E.S. and Kanal L.N.(eds.) Pattern Recognition in Practice, North Holland, Amsterdam, 1980:381-397.
[11] Katz S M. Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer[J]. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1987, 35(3):400-401.
[12] Kneser R, Ney H. Improved Backing-off for M-Gram Language Modeling[C]. In:Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 1995(1):181-184.
[13] Witten I H, Bell T C. The Zero-frequency Problem:Estimating the Probabilities of Novel Events in Adaptive Text Compression[J]. IEEE Transactions on Information Theory, 1991, 37(4):1085-1094.
[14] 郑林曦.普通话三千常用词表[M].北京:语文出版社,1987.