New Technology of Library and Information Service  2012, Vol. 28 Issue (6): 50-53    DOI: 10.11925/infotech.1003-3513.2012.06.08
Chinese High-frequency Words Extraction Algorithm Without Thesaurus
Jiang Hua, Su Xiaoguang
Department of Equipment Economics and Management, Naval University of Engineering, Wuhan 430033, China
Abstract  Based on PAT array,introducing LCP array to count the length of the common prefixes of text suffixes, a new algorithm without thesaurus is presented for extracting high-frequency words of Chinese text by scanning LCP arrary.The algorithm does not depend on segmentation dictionary and can extract any repeated string,especially the new words and combined words.Experimental results show that high-frequency words extracted by the algorithm achieve a high acceptance rate and this algorithm is more effective in extracting combined words than ICTCLAS.
Key wordsChinese information processing      High-frequency word extraction      PAT array      Chinese word segmentation      Keyword detection     
Received: 27 March 2012      Published: 30 August 2012



Cite this article:

Jiang Hua, Su Xiaoguang. Chinese High-frequency Words Extraction Algorithm Without Thesaurus. New Technology of Library and Information Service, 2012, 28(6): 50-53.

