Please wait a minute...
New Technology of Library and Information Service  2008, Vol. 24 Issue (5): 39-43    DOI: 10.11925/infotech.1003-3513.2008.05.07
Current Issue | Archive | Adv Search |
The Research of Character-Position-Based Chinese Word Segmentation
Zhang Jinzhu   Zhang Dong   Wang Huilin
(Institute of Scientific and Technical Information of China, Beijing 100038,China)
Download: PDF(449 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  

This paper analyses the actuality and introduces several different representative approaches of Chinese word segmentation, then brings out a character-position-based segmentation method which takes the Chinese character as the least unit.It indicates the probability distribution of a word through the probability distribution of Chinese character,so it plays much better than other approaches in unknown word recognition.This idea takes a machine-learning method called maximum entropy for implementation and two experiments for comparing and analyzing the results.

Key wordsChinese word segmentation      Character-position      Maximum entropy      Unknown word recognition     
Received: 28 December 2007      Published: 25 May 2008
: 

TP311 

 
  TP18

 
Corresponding Authors: Zhang Jinzhu     E-mail: zhjzh1016@163.com
About author:: Zhang Jinzhu,Zhang Dong,Wang Huilin

Cite this article:

Zhang Jinzhu,Zhang Dong,Wang Huilin. The Research of Character-Position-Based Chinese Word Segmentation. New Technology of Library and Information Service, 2008, 24(5): 39-43.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2008.05.07     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2008/V24/I5/39

[1] 姚敏.汉语自动分词和中文人名识别技术研究[D].浙江:浙江大学,2006.
[2] 刘武.基于统计机器学习算法的汉语分词系统的研究[D].北京:北京邮电大学,2006.
[3] 祁正华.基于无词库的中文分词方法的研究[D].南京:南京邮电学院,2005.
[4] Gan K W.Integrating Word Boundary Disambiguation with Sentence Understanding[D]. Singapore: National University of Singapore,1995.
[5] Xue N, Shen L.Cinese Word Segmentation as LMR Tagging[C].Proceedings of the Second SIGHAN Workshop on Chinese Language Processing,2003:176-179.
[6] Xue N.Chinese Word Segmentation as Character Tagging[J].International Journal of Computational Linguistics and Chinese Language Processing, 2003:29-48.
[7] Sproat R,Shih C L. A Statistical Method for Finding Word Boundaries in Chinese Text[J].Computer Processing of  Chinese and Oriental Languages, 1990,4(4):336-351.
[8] Berger A L,Della Pietra V J,Della Pietra S A.A Maximum Entropy Approach to Natural Language Processing[J]. Computational Linguistics, 1996, 22(1):8-15.
[9] Darroch J N, Ratcliff D. Generalized Iterative Scaling for Log-Linear models[J].  Annals of Mathematical Statistics, 1972,43(5): 1470-1480.
[10] Della Pietra S, Della Pietra V, Lafferty J. Inducing Features of Random Fields[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,1997,19(4):380-393.
[11] Ratnaparkhi A. A Maximum Entropy Part-of-speech Tagger[C].In Proceedings of the Empirical Methods in Natural Language Processing Conference,University of Pennsylvania,1996.
[12] Nakagawa T. Chinese and Japanese Word Segmentation Using Word-Level and Character-Level Information[C]. In Proceedings of COLING,2004.

[1] Guoming Feng,Xiaodong Zhang,Suhui Liu. DBLC Model for Word Segmentation Based on Autonomous Learning[J]. 数据分析与知识发现, 2018, 2(5): 40-47.
[2] Weijian Ni,Haohao Sun,Tong Liu,Qingtian Zeng. An Unsupervised Approach to Optimize Chinese Word Segmentation on Domain Literature[J]. 数据分析与知识发现, 2018, 2(2): 96-104.
[3] Yue Zhang,Dongbo Wang,Danhao Zhu. Segmenting Chinese Words from Food Safety Emergencies[J]. 数据分析与知识发现, 2017, 1(2): 64-72.
[4] Yu Xincong, Li Honglian, Lv Xueqiang. Research on the Application of Hyponymy in the Enrollment Robot[J]. 现代图书情报技术, 2015, 31(12): 65-71.
[5] Zhang Jie, Zhang Haichao, Zhai Dongsheng. Research of the Word Segmentation for Chinese Patent Claims[J]. 现代图书情报技术, 2014, 30(9): 91-98.
[6] Li Wenjiang, Chen Shiqin. Application of AIMLBot Intelligent Robot in Real-time Virtual Reference Service[J]. 现代图书情报技术, 2012, 28(7): 127-132.
[7] Jiang Hua, Su Xiaoguang. Chinese High-frequency Words Extraction Algorithm Without Thesaurus[J]. 现代图书情报技术, 2012, 28(6): 50-53.
[8] Shi Chongde, Wang Huilin. Research on Chinese Word Segmentation Optimization in Statistical Machine Translation[J]. 现代图书情报技术, 2012, 28(4): 29-34.
[9] Yu Chuanming, Huang Jianqiu, Guo Fei. Recognizing Named Entity from Free-text Customer Reviews——A Maximum Entropy Model-based Approach[J]. 现代图书情报技术, 2011, 27(5): 77-82.
[10] Gu Jun, Wang Hao. Study on Term Extraction on the Basis of Chinese Domain Texts[J]. 现代图书情报技术, 2011, 27(4): 29-34.
[11] Xie Hui,Qin Jie,Hu Shuangshuang. The Study on the Duplicated Web Pages Detection Algorithm Based on the Keyword from User’s Submission[J]. 现代图书情报技术, 2008, 24(7): 43-46.
[12] Yao Xingshan. The Improvement in a Chinese Word Segmentation Based on Hash Algorism[J]. 现代图书情报技术, 2008, 24(3): 78-81.
[13] Wu Shaogen . Study of Scheme Automaton for Chinese Word Automatic Segmentation[J]. 现代图书情报技术, 2006, 1(5): 47-49.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn