Abstract:A language recognition program which is used to auto recognize the textures of the most popular languages on Internet including Chinese-simple, Chinese-traditional, English, French, German, Russian and Korean, is realized in this paper based on the N-Gram language module. The speech recognition experiments are divided into two stages of training of multilingual corpus and testing of language recognition, the texts of training and testing come from the Open Directory Project. The program is used to participate in the language recognition test, as well as to make contrast tests to another language recognition program based on N-Gram named TextCat. The result of the language recognition experiment proves that the program has a fine performance on recognizing Chinese-simple, Chinese-traditional and German, and the accuracy of recognition on Russian, French and English in the next place, the Korean is always interfered with Chinese in these experiments.
王昊, 李思舒, 邓三鸿. 基于N-Gram的文本语种识别研究[J]. 现代图书情报技术, 2013, (4): 54-61.
Wang Hao, Li Sishu, Deng Sanhong. Study on Text Language Recognition Based on N-Gram. New Technology of Library and Information Service, 2013, (4): 54-61.
[1] Bauer D, Segond F, Zaenen A. LOCOLEX: The Translation Rolls off Your Tongue[C]. In: Proceedings of ACH-ALLC, Santa-Barbara, California, USA. 1995. [2] Grefenstette G. Comparing Two Language Identification Schemes[C]. In: Proceedings of the 3rd International Conference on Statistical Analysis of Textual Data, Rome, Italy. 1995. [3] 冯冲, 黄河燕, 陈肇雄, 等. 基于字符层马尔科夫模型的多语种识别[J]. 计算机科学, 2006,33(1): 226-228. (Feng Chong, Huang Heyan, Chen Zhaoxiong, et al. Multiple Language Identification Based on Character-level Markov Models[J]. Computer Science, 2006,33(1): 226-228.) [4] Dunning T. Statistical Identification of Language[R]. Technical Report CRL MCCS-94-273. Computing Research Laboratory, New Mexico State University, 1994. [5] Pingali P, Varma V. Multi-lingual Indexing Support for CLIR Using Language Modeling[J]. IEEE Data Engineering Bulletin, 2007,30(1): 70-85. [6] Makin R, Pandey N, Pingali P, et al. Experiments in Cross-lingual IR Among Indian Languages[C]. In: Proceedings of the International Workshop on Cross Language Information Processing(CLIP), Genova,Italy. 2007. [7] Nguyen D T, Nguyen C T. Cross-lingual Information Retrieval Model for Vietnamese-English Websites[C]. In: Proceedings of the 2nd International Conference on Computer Modeling and Simulation (ICCMS ’10). 2010: 254-257. [8] Shannon C E. Prediction and Entropy of Printed English[J]. Bell System Technical Journal, 1951,30:50-64. [9] 李继锋, 刘群. 基于N-Gram模型的高速汉字编码识别系统[J]. 计算机工程与应用, 2004,40(3):39-41,177. (Li Jifeng, Liu Qun. N-Gram Based High Speed Chinese Encoding Recognizing System[J]. Computer Engineering and Applications, 2004,40(3): 39-41,177.) [10] Torres-Carrasquillo P A, Reynolds D A, Jr Deller J R. Language Identification Using Gaussian Mixture Model Tokenization[C]. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 2002: 757-760. [11] Schmitt J C. Trigram-based Method of Language Identification: United States, US5062143 A[P]. 1991-10-29. [12] 郑敏. 跨语言信息检索的理论与实践[J]. 情报理论与实践, 2003, 26(3):223-226. (Zheng Min. The Theory & Application of Cross-language Information Retrieval[J]. Information Studies: Theory & Application, 2003, 26(3):223-226.) [13] Niels J, Thomas M. Different Indexing Strategies for Multilingual Web Retrieval: Experiments with the EuroGOV Corpus[C]. In: Proceedings of the 17th Conference on Hypertext and Hypermedia (ERTEXT ’06), Odense, Denmark. 2006: 169-170. [14] 林伟, 柳荣其, 徐熙. 一种基于N-Gram的垃圾邮件过滤方法研究[J]. 计算机应用与软件, 2010,27(2):121-123.(Lin Wei, Liu Rongqi, Xu Xi. On Approach of Spam Filtering Based on N-Gram[J]. Computer Applications and Software, 2010,27(2):121-123.) [15] 赵珀璋, 徐力. 计算机中文信息处理(下册)[M].北京: 中国宇航出版社, 1989. (Zhao Pozhang, Xu Li. Computer-based Chinese Information Processing[M]. Beijing: China Astronautic Publishing House, 1989.) [16] ODP(Open Directory Project) [EB/OL]. [2012-05-09]. http://baike.baidu.com/view/5069.htm#1. [17] ODP_emoz[EB/OL]. [2012-05-09]. http://www.dmoz.org/docs/en/about.htm. [18] 开放式目录[EB/OL]. [2012-05-09]. http://zh.wikipedia.org/wiki/开放式目录. (Open Directory[EB/OL]. [2012-05-09]. http://zh.wikipedia.org/wiki/开放式目录.) [19] 世界语系的概要[EB/OL]. [2012-05-09]. http://zh.wikipedia.org/zh-cn/语言. (The Essentials of World Language Family[EB/OL]. [2012-05-09]. http://zh.wikipedia.org/zh-cn/语言.) [20] 芒·牧林. 古突厥文来源新探[C].见: 中国民族古文字研究会第七次学术研讨会论文集, 北京. 2004.(Mang·Mulin. A New Thought on the Origin of Ancient Turkic[C]. In: Proceedings of the 7th Seminar of Society of Ancient Chinese National Characters, Beijing, China. 2004.).