Please wait a minute...
New Technology of Library and Information Service  2013, Vol. Issue (4): 54-61    DOI: 10.11925/infotech.1003-3513.2013.04.09
Current Issue | Archive | Adv Search |
Study on Text Language Recognition Based on N-Gram
Wang Hao, Li Sishu, Deng Sanhong
School of Information Management, Nanjing University, Nanjing 210093, China
Download:
Export: BibTeX | EndNote (RIS)      
Abstract  A language recognition program which is used to auto recognize the textures of the most popular languages on Internet including Chinese-simple, Chinese-traditional, English, French, German, Russian and Korean, is realized in this paper based on the N-Gram language module. The speech recognition experiments are divided into two stages of training of multilingual corpus and testing of language recognition, the texts of training and testing come from the Open Directory Project. The program is used to participate in the language recognition test, as well as to make contrast tests to another language recognition program based on N-Gram named TextCat. The result of the language recognition experiment proves that the program has a fine performance on recognizing Chinese-simple, Chinese-traditional and German, and the accuracy of recognition on Russian, French and English in the next place, the Korean is always interfered with Chinese in these experiments.
Key wordsN-Gram      Language recognition      Corpus      Text classification     
Received: 21 March 2013      Published: 17 June 2013
:  TP391  

Cite this article:

Wang Hao, Li Sishu, Deng Sanhong. Study on Text Language Recognition Based on N-Gram. New Technology of Library and Information Service, 2013, (4): 54-61.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2013.04.09     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2013/V/I4/54

[1] Bauer D, Segond F, Zaenen A. LOCOLEX: The Translation Rolls off Your Tongue[C]. In: Proceedings of ACH-ALLC, Santa-Barbara, California, USA. 1995.
[2] Grefenstette G. Comparing Two Language Identification Schemes[C]. In: Proceedings of the 3rd International Conference on Statistical Analysis of Textual Data, Rome, Italy. 1995.
[3] 冯冲, 黄河燕, 陈肇雄, 等. 基于字符层马尔科夫模型的多语种识别[J]. 计算机科学, 2006,33(1): 226-228. (Feng Chong, Huang Heyan, Chen Zhaoxiong, et al. Multiple Language Identification Based on Character-level Markov Models[J]. Computer Science, 2006,33(1): 226-228.)
[4] Dunning T. Statistical Identification of Language[R]. Technical Report CRL MCCS-94-273. Computing Research Laboratory, New Mexico State University, 1994.
[5] Pingali P, Varma V. Multi-lingual Indexing Support for CLIR Using Language Modeling[J]. IEEE Data Engineering Bulletin, 2007,30(1): 70-85.
[6] Makin R, Pandey N, Pingali P, et al. Experiments in Cross-lingual IR Among Indian Languages[C]. In: Proceedings of the International Workshop on Cross Language Information Processing(CLIP), Genova,Italy. 2007.
[7] Nguyen D T, Nguyen C T. Cross-lingual Information Retrieval Model for Vietnamese-English Websites[C]. In: Proceedings of the 2nd International Conference on Computer Modeling and Simulation (ICCMS ’10). 2010: 254-257.
[8] Shannon C E. Prediction and Entropy of Printed English[J]. Bell System Technical Journal, 1951,30:50-64.
[9] 李继锋, 刘群. 基于N-Gram模型的高速汉字编码识别系统[J]. 计算机工程与应用, 2004,40(3):39-41,177. (Li Jifeng, Liu Qun. N-Gram Based High Speed Chinese Encoding Recognizing System[J]. Computer Engineering and Applications, 2004,40(3): 39-41,177.)
[10] Torres-Carrasquillo P A, Reynolds D A, Jr Deller J R. Language Identification Using Gaussian Mixture Model Tokenization[C]. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 2002: 757-760.
[11] Schmitt J C. Trigram-based Method of Language Identification: United States, US5062143 A[P]. 1991-10-29.
[12] 郑敏. 跨语言信息检索的理论与实践[J]. 情报理论与实践, 2003, 26(3):223-226. (Zheng Min. The Theory & Application of Cross-language Information Retrieval[J]. Information Studies: Theory & Application, 2003, 26(3):223-226.)
[13] Niels J, Thomas M. Different Indexing Strategies for Multilingual Web Retrieval: Experiments with the EuroGOV Corpus[C]. In: Proceedings of the 17th Conference on Hypertext and Hypermedia (ERTEXT ’06), Odense, Denmark. 2006: 169-170.
[14] 林伟, 柳荣其, 徐熙. 一种基于N-Gram的垃圾邮件过滤方法研究[J]. 计算机应用与软件, 2010,27(2):121-123.(Lin Wei, Liu Rongqi, Xu Xi. On Approach of Spam Filtering Based on N-Gram[J]. Computer Applications and Software, 2010,27(2):121-123.)
[15] 赵珀璋, 徐力. 计算机中文信息处理(下册)[M].北京: 中国宇航出版社, 1989. (Zhao Pozhang, Xu Li. Computer-based Chinese Information Processing[M]. Beijing: China Astronautic Publishing House, 1989.)
[16] ODP(Open Directory Project) [EB/OL]. [2012-05-09]. http://baike.baidu.com/view/5069.htm#1.
[17] ODP_emoz[EB/OL]. [2012-05-09]. http://www.dmoz.org/docs/en/about.htm.
[18] 开放式目录[EB/OL]. [2012-05-09]. http://zh.wikipedia.org/wiki/开放式目录. (Open Directory[EB/OL]. [2012-05-09]. http://zh.wikipedia.org/wiki/开放式目录.)
[19] 世界语系的概要[EB/OL]. [2012-05-09]. http://zh.wikipedia.org/zh-cn/语言. (The Essentials of World Language Family[EB/OL]. [2012-05-09]. http://zh.wikipedia.org/zh-cn/语言.)
[20] 芒·牧林. 古突厥文来源新探[C].见: 中国民族古文字研究会第七次学术研讨会论文集, 北京. 2004.(Mang·Mulin. A New Thought on the Origin of Ancient Turkic[C]. In: Proceedings of the 7th Seminar of Society of Ancient Chinese National Characters, Beijing, China. 2004.).
[1] Chen Jie,Ma Jing,Li Xiaofeng. Short-Text Classification Method with Text Features from Pre-trained Models[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2] Zhou Zeyu,Wang Hao,Zhao Zibo,Li Yueyan,Zhang Xiaoqin. Construction and Application of GCN Model for Text Classification with Associated Information[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[3] Liu Wenbin, He Yanqing, Wu Zhenfeng, Dong Cheng. Sentence Alignment Method Based on BERT and Multi-similarity Fusion[J]. 数据分析与知识发现, 2021, 5(7): 48-58.
[4] Yu Bengong,Zhu Xiaojie,Zhang Ziwei. A Capsule Network Model for Text Classification with Multi-level Feature Extraction[J]. 数据分析与知识发现, 2021, 5(6): 93-102.
[5] Wang Yan, Wang Huyan, Yu Bengong. Chinese Text Classification with Feature Fusion[J]. 数据分析与知识发现, 2021, 5(10): 1-14.
[6] Liang Jiwen,Jiang Chuan,Wang Dongbo. Chinese-English Sentence Alignment of Ancient Literature Based on Multi-feature Fusion[J]. 数据分析与知识发现, 2020, 4(9): 123-132.
[7] Wang Sidi,Hu Guangwei,Yang Siyu,Shi Yun. Automatic Transferring Government Website E-Mails Based on Text Classification[J]. 数据分析与知识发现, 2020, 4(6): 51-59.
[8] Xu Yuemei,Liu Yunwen,Cai Lianqiao. Predicitng Retweets of Government Microblogs with Deep-combined Features[J]. 数据分析与知识发现, 2020, 4(2/3): 18-28.
[9] Xu Tongtong,Sun Huazhi,Ma Chunmei,Jiang Lifen,Liu Yichen. Classification Model for Few-shot Texts Based on Bi-directional Long-term Attention Features[J]. 数据分析与知识发现, 2020, 4(10): 113-123.
[10] Bengong Yu,Yumeng Cao,Yangnan Chen,Ying Yang. Classification of Short Texts Based on nLD-SVM-RF Model[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[11] Weimin Nie,Yongzhou Chen,Jing Ma. A Text Vector Representation Model Merging Multi-Granularity Information[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[12] Yunfei Shao,Dongsu Liu. Classifying Short-texts with Class Feature Extension[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[13] Heran Qin,Liu Liu,Bin Li,Dongbo Wang. Automatic Classification of Ancient Classics with Entity Features[J]. 数据分析与知识发现, 2019, 3(9): 68-76.
[14] Guo Chen,Tianxiang Xu. Sentence Function Recognition Based on Active Learning[J]. 数据分析与知识发现, 2019, 3(8): 53-61.
[15] Bengong Yu,Yangnan Chen,Ying Yang. Classifying Short Text Complaints with nBD-SVM Model[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn