New Technology of Library and Information Service  2013, Vol. Issue (4): 54-61    DOI: 10.11925/infotech.1003-3513.2013.04.09
Study on Text Language Recognition Based on N-Gram
Wang Hao, Li Sishu, Deng Sanhong
School of Information Management, Nanjing University, Nanjing 210093, China
Abstract  A language recognition program which is used to auto recognize the textures of the most popular languages on Internet including Chinese-simple, Chinese-traditional, English, French, German, Russian and Korean, is realized in this paper based on the N-Gram language module. The speech recognition experiments are divided into two stages of training of multilingual corpus and testing of language recognition, the texts of training and testing come from the Open Directory Project. The program is used to participate in the language recognition test, as well as to make contrast tests to another language recognition program based on N-Gram named TextCat. The result of the language recognition experiment proves that the program has a fine performance on recognizing Chinese-simple, Chinese-traditional and German, and the accuracy of recognition on Russian, French and English in the next place, the Korean is always interfered with Chinese in these experiments.
Key wordsN-Gram      Language recognition      Corpus      Text classification     
Received: 21 March 2013      Published: 17 June 2013
:  TP391  

Cite this article:

Wang Hao, Li Sishu, Deng Sanhong. Study on Text Language Recognition Based on N-Gram. New Technology of Library and Information Service, 2013, (4): 54-61.

