Please wait a minute...
New Technology of Library and Information Service  2011, Vol. 27 Issue (12): 31-38    DOI: 10.11925/infotech.1003-3513.2011.12.05
Current Issue | Archive | Adv Search |
Chinese and Bengali Proper Noun Recognition Based on String Frequency Statistics Model
Kishore Biswas1, Wang Huilin2, Yu Wei2
1. Department of Information Management, Peking University, Beijing 100871, China;
2. Institute of Science & Technology Information of China, Beijing 100038, China
Download: PDF(891 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  This paper implements String Frequency Statistics Algorithm proposed by Nagao to build Proper Noun Recognition (PNR) system for Chinese and Bengali languages. First, n-grams are extracted from untagged input corpus,then they are filtered to get rid of redundant sub-strings, using SSR algorithm. Finally, this multilingual PNR system assigns each n-gram a probability of being a proper noun based on the information of their neighboring words and outputs results according to their probability score. The test results show that this system can effectively recognize name of people, places, organizations or institutions from the input text.
Key wordsProper noun recognition      String statistics      Nagao algorithm      SSR      algorithm     
Received: 03 November 2011      Published: 02 February 2012
: 

TP391

 

Cite this article:

Kishore Biswas, Wang Huilin, Yu Wei. Chinese and Bengali Proper Noun Recognition Based on String Frequency Statistics Model. New Technology of Library and Information Service, 2011, 27(12): 31-38.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2011.12.05     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2011/V27/I12/31

[1] 候志霞, 曹军.自然语言处理的发展概况和前景展望[J].山东外语教学, 2003,18(5):53-55.

[2] 郭艳华, 周昌乐.自然语言理解研究综述[J].杭州电子工业大学学报, 2000,20(1):58-64.

[3] 赵铁军.机器翻译原理[M].哈尔滨:哈尔滨工业大学出版社,2001.

[4] 刘开瑛.中文文本自动分词和标注[M].北京:商务印书馆,2000.

[5] 姚天顺,朱靖波,张舜,等.自然语言理解[M].北京:清华大学出版社,2002.

[6] 吕雅娟,赵铁军,杨沐昀,等.基于分解与动态规划策略的汉语未登录词识别[J].中文信息学报,2001,15(1):28-33.

[7] 罗智勇,宋柔.现代汉语自动分词中专名的一体化、快速识别方法[C].见:国际牛文电脑学术会议,新加坡.2001:323-328.

[8] 郑家恒,刘开瑛.自动分词系统中姓氏人名的处理策略探讨[C].见:全国第二届计算机语言学联合学术会议,厦门.北京:北京语言学院出版社,1993.

[9] Tan H Y,Zheng J H,Liu K Y.Research on Method of Automatic Recognition of Chinese Place Name Based on Transformation[J].Journal of Software, 2001,12(1):1608-1613.

[10] Park B R, Hwang Y S, Rim H C. Recognizing Korean Unknown Proper Nouns by Using Automatically Extracted Lexical Clues[C]. In: Proceedings of the 10th Research on Computational Linguistics International Conference. 1997.

[11] Barcala M, Vilares J, Alonso M A,et al. Tokenization and Proper Noun Recognition for Information Retrieval[C]. In:Proceedings of the 3rd International Workshop on Natural Language and Information Systems (NLIS 2002).2002.

[12] Aone C, Maloney J. Reuse of a Proper Noun Recognition System in Commercial and Operational NLP Application[C]. In:Proceedings of the 5th Workshop on Very Large Corpora. 1997.

[13] 张华平,刘群.基于角色标注的中国人名自动识别研究 [J].计算机学报, 2004,27(1):85-91.

[14] Zhou G D, Su J. Named Entity Recognition Using an HMM-based Chunk Tagger[C]. In:Proceedings of the 40th Annual Meeting of the ACL, Philadelphia.2002:473-480.

[15] Borthwick A.A Maximum Entropy Approach to Named Entity Recognition[D].New York:New York University, 1999.

[16] 秦文,苑春法.基于决策树的汉语未登录词识别[J].中文信息学报, 2004,18(1):14-19.

[17] Takeuchi K, Collier N.Use of Support Vector Machines in Extended Named Entity Recognition[C].In:Proceedings of the 6th Conference on Natural Language Learning,Taipei,China.2002:119-125.

[18] 李丽双,黄德根,陈春容.用支持向量机进行中文地名识别的研究[J].小型微型计算机系统, 2004,26(8):1416-1419.

[19] 郑荣廷, 李楠, 吉久明,等.中文化学物质名称识别研究[J].现代图书情报技术, 2010(6):48-52.

[20] 宋柔,朱宏.基于语料库和规则库的人名识别法[C].见:全国第二届计算机语言学联合学术会议,厦门.北京:北京语言学院出版社,1993:150,154.

[21] 黄德根,马玉霞,杨元生. 基于互信息的中文姓名识别方法[J].大连理工大学学报,2004,44(5):744-748.

[22] 孙茂松, 黄昌宁, 高海燕, 等. 中文姓名的自动辨识[J].中文信息学报, 1994,9(2):16-27.

[23] 王振华, 孙祥龙, 陆汝占, 等. 结合决策树方法的中文姓名识别[J].中文信息学报,2004,18(6): 10-15.

[24] 谭红叶,郑家恒,刘开瑛.中国地名自动识别系统的设计与实现[J].计算机工程, 2002,28(8):128-129.

[25] 张辉, 徐健.中国组织结构名自动识别系统的设计与实现[J].电脑开发与应用,2002,15(1):5-6.

[26] 郑家恒, 张辉.基于HMM的中国组织机构名自动识别[J].计算机应用,2002,22(1):1-2.

[27] 胡乃全, 朱巧明, 周国栋.混合的汉语基本名词短语识别方法[J].计算机工程,2009, 35(20):199-201.

[28] Chaudhuri Bidyut Baran, Bhattacharya Suvankar. An Experiment on Automatic Detection of Named Entities in Bangla [C]. In: Proceedings of the IJCNLP-08 Workshop on NER for South and South East Asian Languages,Hyderabad,India.2008:1-58.

[29] Ekbal A, Naskar S,Bandopadhyay S. Named Entity Recognition and Transliteration in Bengali[J]. Special Issue of Linguistics Investigation Journal,2007,30(1):95-114.

[30] Nagao M, Mori S. A New Method of N-gram Statistics for Large Number of N and Automatic Extraction of Words and Phrases from Large Text Data of Japanese[C]. In: Proceedings of the 15th Conference on Computational Linguistics.1994.

[31] Lv X Q, Zhang L, Hu J F. Statistical Substring Reduction in Linear Time[C]. In: Proceedings of the Natural Language Processing-IJCNLP.2005:320-327.
[1] Xiaolan Wu,Chengzhi Zhang. Analysis of Knowledge Flow Based on Academic Social Networks:
A Case Study of ScienceNet.cn
[J]. 数据分析与知识发现, 2019, 3(4): 107-116.
[2] Tingxin Wen,Yangzi Li,Jingshuang Sun. News Hotspots Discovery Method Based on Multi Factor Feature Selection and AFOA/K-means[J]. 数据分析与知识发现, 2019, 3(4): 97-106.
[3] Guijun Yang,Xue Xu,Fuqiang Zhao. Predicting User Ratings with XGBoost Algorithm[J]. 数据分析与知识发现, 2019, 3(1): 118-126.
[4] Tingxin Wen,Yangzi Li,Jingshuang Sun. Extracting Text Features with Improved Fruit Fly Optimization Algorithm[J]. 数据分析与知识发现, 2018, 2(5): 59-69.
[5] Lixin Zhou,Jie Lin. Extracting Product Features with NodeRank Algorithm[J]. 数据分析与知识发现, 2018, 2(4): 90-98.
[6] Cuiqing Jiang,Kailun Song,Yong Ding,Yao Liu. Identifying Potential Customers Based on User-Generated Contents[J]. 数据分析与知识发现, 2018, 2(3): 1-8.
[7] Jun Hou,Kui Liu,Qianmu Li. Classification Recommendation Based on ESSVM[J]. 数据分析与知识发现, 2018, 2(3): 9-21.
[8] Hongwei Liu,Hongming Gao,Li Chen,Mingjun Zhan,Zhouyang Liang. Identifying User Interests Based on Browsing Behaviors[J]. 数据分析与知识发现, 2018, 2(2): 74-85.
[9] Hu Meng,Xiaobei Liang,Yixiong Yang,Min Li. Evaluating and Optimizing Supply Chains with LMBP Algorithm[J]. 数据分析与知识发现, 2018, 2(11): 37-45.
[10] Changbing Li,Chongpeng Pang,Meiping Li. Extracting Product Features with Weight-based Apriori Algorithm[J]. 数据分析与知识发现, 2017, 1(9): 83-89.
[11] Xiaowei Chen,Yutian Shi. Identifying Key Nodes in Social Network with Improved PageRank Algorithm[J]. 数据分析与知识发现, 2017, 1(8): 68-75.
[12] Chunxia Zhan,Rongbo Wang,Xiaoxi Huang,Zhiqun Chen. Application of Text Clustering Method Based on Improved CFSFDP Algorithm[J]. 数据分析与知识发现, 2017, 1(4): 94-99.
[13] Changyuan Gao,Jianping Yu,Xiaoyan He. Knowledge Search for Cloud Computing Industry Alliance: An Algorithm Based on Improved Particle Swarm Optimization[J]. 数据分析与知识发现, 2017, 1(3): 81-89.
[14] Jing Yan,Qiang Bi,Jie Li,Fu Wang. Construction of Aggregation Quality Predicting Model for Digital Resource in Library ——Based on Improved Genetic Algorithm and BP Neural Network[J]. 数据分析与知识发现, 2017, 1(12): 49-62.
[15] Jianlin Yang,Yang Liu. Evaluating PU Learning Based on Associative Classification Algorithm[J]. 数据分析与知识发现, 2017, 1(11): 12-18.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn