Please wait a minute...
New Technology of Library and Information Service  2013, Vol. Issue (4): 54-61    DOI: 10.11925/infotech.1003-3513.2013.04.09
Current Issue | Archive | Adv Search |
Study on Text Language Recognition Based on N-Gram
Wang Hao, Li Sishu, Deng Sanhong
School of Information Management, Nanjing University, Nanjing 210093, China
Download: PDF(567 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  A language recognition program which is used to auto recognize the textures of the most popular languages on Internet including Chinese-simple, Chinese-traditional, English, French, German, Russian and Korean, is realized in this paper based on the N-Gram language module. The speech recognition experiments are divided into two stages of training of multilingual corpus and testing of language recognition, the texts of training and testing come from the Open Directory Project. The program is used to participate in the language recognition test, as well as to make contrast tests to another language recognition program based on N-Gram named TextCat. The result of the language recognition experiment proves that the program has a fine performance on recognizing Chinese-simple, Chinese-traditional and German, and the accuracy of recognition on Russian, French and English in the next place, the Korean is always interfered with Chinese in these experiments.
Key wordsN-Gram      Language recognition      Corpus      Text classification     
Received: 21 March 2013      Published: 17 June 2013
:  TP391  

Cite this article:

Wang Hao, Li Sishu, Deng Sanhong. Study on Text Language Recognition Based on N-Gram. New Technology of Library and Information Service, 2013, (4): 54-61.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2013.04.09     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2013/V/I4/54

[1] Bauer D, Segond F, Zaenen A. LOCOLEX: The Translation Rolls off Your Tongue[C]. In: Proceedings of ACH-ALLC, Santa-Barbara, California, USA. 1995.
[2] Grefenstette G. Comparing Two Language Identification Schemes[C]. In: Proceedings of the 3rd International Conference on Statistical Analysis of Textual Data, Rome, Italy. 1995.
[3] 冯冲, 黄河燕, 陈肇雄, 等. 基于字符层马尔科夫模型的多语种识别[J]. 计算机科学, 2006,33(1): 226-228. (Feng Chong, Huang Heyan, Chen Zhaoxiong, et al. Multiple Language Identification Based on Character-level Markov Models[J]. Computer Science, 2006,33(1): 226-228.)
[4] Dunning T. Statistical Identification of Language[R]. Technical Report CRL MCCS-94-273. Computing Research Laboratory, New Mexico State University, 1994.
[5] Pingali P, Varma V. Multi-lingual Indexing Support for CLIR Using Language Modeling[J]. IEEE Data Engineering Bulletin, 2007,30(1): 70-85.
[6] Makin R, Pandey N, Pingali P, et al. Experiments in Cross-lingual IR Among Indian Languages[C]. In: Proceedings of the International Workshop on Cross Language Information Processing(CLIP), Genova,Italy. 2007.
[7] Nguyen D T, Nguyen C T. Cross-lingual Information Retrieval Model for Vietnamese-English Websites[C]. In: Proceedings of the 2nd International Conference on Computer Modeling and Simulation (ICCMS ’10). 2010: 254-257.
[8] Shannon C E. Prediction and Entropy of Printed English[J]. Bell System Technical Journal, 1951,30:50-64.
[9] 李继锋, 刘群. 基于N-Gram模型的高速汉字编码识别系统[J]. 计算机工程与应用, 2004,40(3):39-41,177. (Li Jifeng, Liu Qun. N-Gram Based High Speed Chinese Encoding Recognizing System[J]. Computer Engineering and Applications, 2004,40(3): 39-41,177.)
[10] Torres-Carrasquillo P A, Reynolds D A, Jr Deller J R. Language Identification Using Gaussian Mixture Model Tokenization[C]. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 2002: 757-760.
[11] Schmitt J C. Trigram-based Method of Language Identification: United States, US5062143 A[P]. 1991-10-29.
[12] 郑敏. 跨语言信息检索的理论与实践[J]. 情报理论与实践, 2003, 26(3):223-226. (Zheng Min. The Theory & Application of Cross-language Information Retrieval[J]. Information Studies: Theory & Application, 2003, 26(3):223-226.)
[13] Niels J, Thomas M. Different Indexing Strategies for Multilingual Web Retrieval: Experiments with the EuroGOV Corpus[C]. In: Proceedings of the 17th Conference on Hypertext and Hypermedia (ERTEXT ’06), Odense, Denmark. 2006: 169-170.
[14] 林伟, 柳荣其, 徐熙. 一种基于N-Gram的垃圾邮件过滤方法研究[J]. 计算机应用与软件, 2010,27(2):121-123.(Lin Wei, Liu Rongqi, Xu Xi. On Approach of Spam Filtering Based on N-Gram[J]. Computer Applications and Software, 2010,27(2):121-123.)
[15] 赵珀璋, 徐力. 计算机中文信息处理(下册)[M].北京: 中国宇航出版社, 1989. (Zhao Pozhang, Xu Li. Computer-based Chinese Information Processing[M]. Beijing: China Astronautic Publishing House, 1989.)
[16] ODP(Open Directory Project) [EB/OL]. [2012-05-09]. http://baike.baidu.com/view/5069.htm#1.
[17] ODP_emoz[EB/OL]. [2012-05-09]. http://www.dmoz.org/docs/en/about.htm.
[18] 开放式目录[EB/OL]. [2012-05-09]. http://zh.wikipedia.org/wiki/开放式目录. (Open Directory[EB/OL]. [2012-05-09]. http://zh.wikipedia.org/wiki/开放式目录.)
[19] 世界语系的概要[EB/OL]. [2012-05-09]. http://zh.wikipedia.org/zh-cn/语言. (The Essentials of World Language Family[EB/OL]. [2012-05-09]. http://zh.wikipedia.org/zh-cn/语言.)
[20] 芒·牧林. 古突厥文来源新探[C].见: 中国民族古文字研究会第七次学术研讨会论文集, 北京. 2004.(Mang·Mulin. A New Thought on the Origin of Ancient Turkic[C]. In: Proceedings of the 7th Seminar of Society of Ancient Chinese National Characters, Beijing, China. 2004.).
[1] Bengong Yu,Yangnan Chen,Ying Yang. Classifying Short Text Complaints with nBD-SVM Model[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[2] Zixuan Zhang,Hao Wang,Liping Zhu,Sanhong eng. Identifying Risks of HS Codes by China Customs[J]. 数据分析与知识发现, 2019, 3(1): 72-84.
[3] Xinlei Li,Hao Wang,Xiaomin Liu,Sanhong Deng. Comparing Text Vector Generators for Weibo Short Text Classification[J]. 数据分析与知识发现, 2018, 2(8): 41-50.
[4] Liu Liu,Dongbo Wang. Identifying Interdisciplinary Social Science Research Based on Article Classification[J]. 数据分析与知识发现, 2018, 2(3): 30-38.
[5] Xiangdong Li,Tao Ruan,Kang Liu. Automatic Classification of Documents from Wikipedia[J]. 数据分析与知识发现, 2017, 1(10): 43-52.
[6] Yonghe Lu,Jinghuang Chen. Optimizing Feature Selection Method for Text Classification with Shuffled Frog Leaping Algorithm[J]. 数据分析与知识发现, 2017, 1(1): 91-101.
[7] Duan Jianyong,. Auto-Correction Search Model Based on Statistics and Characteristics[J]. 现代图书情报技术, 2016, 32(2): 34-42.
[8] Qun Zhang, Hongjun Wang, Lunwen Wang. Classifying Short Texts with Word Embedding and LDA Model[J]. 数据分析与知识发现, 2016, 32(12): 27-35.
[9] Hu Juxiang, Lv Xueqiang, Liu Kehui. Complaint Text Classification Based on Guiding Words[J]. 现代图书情报技术, 2015, 31(7-8): 97-103.
[10] Li Xiangdong, Ba Zhichao, Huang Li. Allocation and Multi-granularity[J]. 现代图书情报技术, 2015, 31(5): 42-49.
[11] Lu Yonghe, Wang Hongbin. Feature Weighting Method Affected by Part of Speech in Text Classification[J]. 现代图书情报技术, 2015, 31(4): 18-25.
[12] Xiao Tianjiu, Liu Ying. Words and N-gram Models Analysis for “A Dream of Red Mansions”[J]. 现代图书情报技术, 2015, 31(4): 50-57.
[13] Li Xiangdong, Cao Huan, Ding Cong, Huang Li. Short-text Classification Based on HowNet and Domain Keyword Set Extension[J]. 现代图书情报技术, 2015, 31(2): 31-38.
[14] Liu Huailiang, Du Kun, Qin Chunxiu. Research on Chinese Text Categorization Based on Semantic Similarity of HowNet[J]. 现代图书情报技术, 2015, 31(2): 39-45.
[15] Du Kun, Liu Huailiang, Guo Lujie. Study on the Modified Method of Feature Weighting with Complex Networks[J]. 现代图书情报技术, 2015, 31(11): 26-32.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn