|
|
Chinese and Bengali Proper Noun Recognition Based on String Frequency Statistics Model |
Kishore Biswas1, Wang Huilin2, Yu Wei2 |
1. Department of Information Management, Peking University, Beijing 100871, China;
2. Institute of Science & Technology Information of China, Beijing 100038, China |
|
|
Abstract This paper implements String Frequency Statistics Algorithm proposed by Nagao to build Proper Noun Recognition (PNR) system for Chinese and Bengali languages. First, n-grams are extracted from untagged input corpus,then they are filtered to get rid of redundant sub-strings, using SSR algorithm. Finally, this multilingual PNR system assigns each n-gram a probability of being a proper noun based on the information of their neighboring words and outputs results according to their probability score. The test results show that this system can effectively recognize name of people, places, organizations or institutions from the input text.
|
Received: 03 November 2011
Published: 02 February 2012
|
|
[1] 候志霞, 曹军.自然语言处理的发展概况和前景展望[J].山东外语教学, 2003,18(5):53-55.[2] 郭艳华, 周昌乐.自然语言理解研究综述[J].杭州电子工业大学学报, 2000,20(1):58-64.[3] 赵铁军.机器翻译原理[M].哈尔滨:哈尔滨工业大学出版社,2001.[4] 刘开瑛.中文文本自动分词和标注[M].北京:商务印书馆,2000.[5] 姚天顺,朱靖波,张舜,等.自然语言理解[M].北京:清华大学出版社,2002.[6] 吕雅娟,赵铁军,杨沐昀,等.基于分解与动态规划策略的汉语未登录词识别[J].中文信息学报,2001,15(1):28-33.[7] 罗智勇,宋柔.现代汉语自动分词中专名的一体化、快速识别方法[C].见:国际牛文电脑学术会议,新加坡.2001:323-328.[8] 郑家恒,刘开瑛.自动分词系统中姓氏人名的处理策略探讨[C].见:全国第二届计算机语言学联合学术会议,厦门.北京:北京语言学院出版社,1993.[9] Tan H Y,Zheng J H,Liu K Y.Research on Method of Automatic Recognition of Chinese Place Name Based on Transformation[J].Journal of Software, 2001,12(1):1608-1613.[10] Park B R, Hwang Y S, Rim H C. Recognizing Korean Unknown Proper Nouns by Using Automatically Extracted Lexical Clues[C]. In: Proceedings of the 10th Research on Computational Linguistics International Conference. 1997.[11] Barcala M, Vilares J, Alonso M A,et al. Tokenization and Proper Noun Recognition for Information Retrieval[C]. In:Proceedings of the 3rd International Workshop on Natural Language and Information Systems (NLIS 2002).2002.[12] Aone C, Maloney J. Reuse of a Proper Noun Recognition System in Commercial and Operational NLP Application[C]. In:Proceedings of the 5th Workshop on Very Large Corpora. 1997.[13] 张华平,刘群.基于角色标注的中国人名自动识别研究 [J].计算机学报, 2004,27(1):85-91.[14] Zhou G D, Su J. Named Entity Recognition Using an HMM-based Chunk Tagger[C]. In:Proceedings of the 40th Annual Meeting of the ACL, Philadelphia.2002:473-480.[15] Borthwick A.A Maximum Entropy Approach to Named Entity Recognition[D].New York:New York University, 1999.[16] 秦文,苑春法.基于决策树的汉语未登录词识别[J].中文信息学报, 2004,18(1):14-19.[17] Takeuchi K, Collier N.Use of Support Vector Machines in Extended Named Entity Recognition[C].In:Proceedings of the 6th Conference on Natural Language Learning,Taipei,China.2002:119-125.[18] 李丽双,黄德根,陈春容.用支持向量机进行中文地名识别的研究[J].小型微型计算机系统, 2004,26(8):1416-1419.[19] 郑荣廷, 李楠, 吉久明,等.中文化学物质名称识别研究[J].现代图书情报技术, 2010(6):48-52.[20] 宋柔,朱宏.基于语料库和规则库的人名识别法[C].见:全国第二届计算机语言学联合学术会议,厦门.北京:北京语言学院出版社,1993:150,154.[21] 黄德根,马玉霞,杨元生. 基于互信息的中文姓名识别方法[J].大连理工大学学报,2004,44(5):744-748.[22] 孙茂松, 黄昌宁, 高海燕, 等. 中文姓名的自动辨识[J].中文信息学报, 1994,9(2):16-27.[23] 王振华, 孙祥龙, 陆汝占, 等. 结合决策树方法的中文姓名识别[J].中文信息学报,2004,18(6): 10-15.[24] 谭红叶,郑家恒,刘开瑛.中国地名自动识别系统的设计与实现[J].计算机工程, 2002,28(8):128-129.[25] 张辉, 徐健.中国组织结构名自动识别系统的设计与实现[J].电脑开发与应用,2002,15(1):5-6.[26] 郑家恒, 张辉.基于HMM的中国组织机构名自动识别[J].计算机应用,2002,22(1):1-2.[27] 胡乃全, 朱巧明, 周国栋.混合的汉语基本名词短语识别方法[J].计算机工程,2009, 35(20):199-201.[28] Chaudhuri Bidyut Baran, Bhattacharya Suvankar. An Experiment on Automatic Detection of Named Entities in Bangla [C]. In: Proceedings of the IJCNLP-08 Workshop on NER for South and South East Asian Languages,Hyderabad,India.2008:1-58.[29] Ekbal A, Naskar S,Bandopadhyay S. Named Entity Recognition and Transliteration in Bengali[J]. Special Issue of Linguistics Investigation Journal,2007,30(1):95-114.[30] Nagao M, Mori S. A New Method of N-gram Statistics for Large Number of N and Automatic Extraction of Words and Phrases from Large Text Data of Japanese[C]. In: Proceedings of the 15th Conference on Computational Linguistics.1994.[31] Lv X Q, Zhang L, Hu J F. Statistical Substring Reduction in Linear Time[C]. In: Proceedings of the Natural Language Processing-IJCNLP.2005:320-327. |
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|