Please wait a minute...
New Technology of Library and Information Service  2011, Vol. 27 Issue (12): 31-38    DOI: 10.11925/infotech.1003-3513.2011.12.05
Current Issue | Archive | Adv Search |
Chinese and Bengali Proper Noun Recognition Based on String Frequency Statistics Model
Kishore Biswas1, Wang Huilin2, Yu Wei2
1. Department of Information Management, Peking University, Beijing 100871, China;
2. Institute of Science & Technology Information of China, Beijing 100038, China
Download:
Export: BibTeX | EndNote (RIS)      
Abstract  This paper implements String Frequency Statistics Algorithm proposed by Nagao to build Proper Noun Recognition (PNR) system for Chinese and Bengali languages. First, n-grams are extracted from untagged input corpus,then they are filtered to get rid of redundant sub-strings, using SSR algorithm. Finally, this multilingual PNR system assigns each n-gram a probability of being a proper noun based on the information of their neighboring words and outputs results according to their probability score. The test results show that this system can effectively recognize name of people, places, organizations or institutions from the input text.
Key wordsProper noun recognition      String statistics      Nagao algorithm      SSR      algorithm     
Received: 03 November 2011      Published: 02 February 2012
: 

TP391

 

Cite this article:

Kishore Biswas, Wang Huilin, Yu Wei. Chinese and Bengali Proper Noun Recognition Based on String Frequency Statistics Model. New Technology of Library and Information Service, 2011, 27(12): 31-38.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2011.12.05     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2011/V27/I12/31

[1] 候志霞, 曹军.自然语言处理的发展概况和前景展望[J].山东外语教学, 2003,18(5):53-55.

[2] 郭艳华, 周昌乐.自然语言理解研究综述[J].杭州电子工业大学学报, 2000,20(1):58-64.

[3] 赵铁军.机器翻译原理[M].哈尔滨:哈尔滨工业大学出版社,2001.

[4] 刘开瑛.中文文本自动分词和标注[M].北京:商务印书馆,2000.

[5] 姚天顺,朱靖波,张舜,等.自然语言理解[M].北京:清华大学出版社,2002.

[6] 吕雅娟,赵铁军,杨沐昀,等.基于分解与动态规划策略的汉语未登录词识别[J].中文信息学报,2001,15(1):28-33.

[7] 罗智勇,宋柔.现代汉语自动分词中专名的一体化、快速识别方法[C].见:国际牛文电脑学术会议,新加坡.2001:323-328.

[8] 郑家恒,刘开瑛.自动分词系统中姓氏人名的处理策略探讨[C].见:全国第二届计算机语言学联合学术会议,厦门.北京:北京语言学院出版社,1993.

[9] Tan H Y,Zheng J H,Liu K Y.Research on Method of Automatic Recognition of Chinese Place Name Based on Transformation[J].Journal of Software, 2001,12(1):1608-1613.

[10] Park B R, Hwang Y S, Rim H C. Recognizing Korean Unknown Proper Nouns by Using Automatically Extracted Lexical Clues[C]. In: Proceedings of the 10th Research on Computational Linguistics International Conference. 1997.

[11] Barcala M, Vilares J, Alonso M A,et al. Tokenization and Proper Noun Recognition for Information Retrieval[C]. In:Proceedings of the 3rd International Workshop on Natural Language and Information Systems (NLIS 2002).2002.

[12] Aone C, Maloney J. Reuse of a Proper Noun Recognition System in Commercial and Operational NLP Application[C]. In:Proceedings of the 5th Workshop on Very Large Corpora. 1997.

[13] 张华平,刘群.基于角色标注的中国人名自动识别研究 [J].计算机学报, 2004,27(1):85-91.

[14] Zhou G D, Su J. Named Entity Recognition Using an HMM-based Chunk Tagger[C]. In:Proceedings of the 40th Annual Meeting of the ACL, Philadelphia.2002:473-480.

[15] Borthwick A.A Maximum Entropy Approach to Named Entity Recognition[D].New York:New York University, 1999.

[16] 秦文,苑春法.基于决策树的汉语未登录词识别[J].中文信息学报, 2004,18(1):14-19.

[17] Takeuchi K, Collier N.Use of Support Vector Machines in Extended Named Entity Recognition[C].In:Proceedings of the 6th Conference on Natural Language Learning,Taipei,China.2002:119-125.

[18] 李丽双,黄德根,陈春容.用支持向量机进行中文地名识别的研究[J].小型微型计算机系统, 2004,26(8):1416-1419.

[19] 郑荣廷, 李楠, 吉久明,等.中文化学物质名称识别研究[J].现代图书情报技术, 2010(6):48-52.

[20] 宋柔,朱宏.基于语料库和规则库的人名识别法[C].见:全国第二届计算机语言学联合学术会议,厦门.北京:北京语言学院出版社,1993:150,154.

[21] 黄德根,马玉霞,杨元生. 基于互信息的中文姓名识别方法[J].大连理工大学学报,2004,44(5):744-748.

[22] 孙茂松, 黄昌宁, 高海燕, 等. 中文姓名的自动辨识[J].中文信息学报, 1994,9(2):16-27.

[23] 王振华, 孙祥龙, 陆汝占, 等. 结合决策树方法的中文姓名识别[J].中文信息学报,2004,18(6): 10-15.

[24] 谭红叶,郑家恒,刘开瑛.中国地名自动识别系统的设计与实现[J].计算机工程, 2002,28(8):128-129.

[25] 张辉, 徐健.中国组织结构名自动识别系统的设计与实现[J].电脑开发与应用,2002,15(1):5-6.

[26] 郑家恒, 张辉.基于HMM的中国组织机构名自动识别[J].计算机应用,2002,22(1):1-2.

[27] 胡乃全, 朱巧明, 周国栋.混合的汉语基本名词短语识别方法[J].计算机工程,2009, 35(20):199-201.

[28] Chaudhuri Bidyut Baran, Bhattacharya Suvankar. An Experiment on Automatic Detection of Named Entities in Bangla [C]. In: Proceedings of the IJCNLP-08 Workshop on NER for South and South East Asian Languages,Hyderabad,India.2008:1-58.

[29] Ekbal A, Naskar S,Bandopadhyay S. Named Entity Recognition and Transliteration in Bengali[J]. Special Issue of Linguistics Investigation Journal,2007,30(1):95-114.

[30] Nagao M, Mori S. A New Method of N-gram Statistics for Large Number of N and Automatic Extraction of Words and Phrases from Large Text Data of Japanese[C]. In: Proceedings of the 15th Conference on Computational Linguistics.1994.

[31] Lv X Q, Zhang L, Hu J F. Statistical Substring Reduction in Linear Time[C]. In: Proceedings of the Natural Language Processing-IJCNLP.2005:320-327.
[1] Su Qiang, Hou Xiaoli, Zou Ni. Predicting Surgical Infections Based on Machine Learning[J]. 数据分析与知识发现, 2021, 5(8): 65-75.
[2] Dong Mei,Chang Zhijun,Zhang Runjie. A Multiple Pattern Matching Algorithm for Specifications of Incremental Metadata for Sci-Tech Literature[J]. 数据分析与知识发现, 2021, 5(6): 135-144.
[3] Dong Zhenheng,Lv Xueqiang,Ren Weiping,Jiang Yang,Li Guolin. Review of Key Technologies of High Performance Blockchain[J]. 数据分析与知识发现, 2021, 5(6): 14-24.
[4] Lu Linong,Zhu Zhongming,Zhang Wangqiang,Wang Xiaochun. Cross-database Knowledge Integration and Fingerprint of Institutional Repositories with Lingo3G Clustering Algorithm[J]. 数据分析与知识发现, 2021, 5(5): 127-132.
[5] Ma Yingxue,Gan Mingxin,Xiao Kejun. A Matrix Factorization Recommendation Method with Tags and Contents[J]. 数据分析与知识发现, 2021, 5(5): 71-82.
[6] Wang Nan,Li Hairong,Tan Shuru. Predicting of Public Opinion Reversal with Improved SMOTE Algorithm and Ensemble Learning[J]. 数据分析与知识发现, 2021, 5(4): 37-48.
[7] Qiu Yunfei, Guo Lei. Predicting Diabetic Complications with Unbalanced Data[J]. 数据分析与知识发现, 2021, 5(2): 116-128.
[8] Wu Shengnan, Pu Hongjun, Tian Ruonan, Liang Wenqi, Yu Qi. Network Structure’s Impacts on Link Prediction Algorithm from Meta-Analysis Perspective[J]. 数据分析与知识发现, 2021, 5(11): 102-113.
[9] Yang Chen, Chen Xiaohong, Wang Chuhan, Liu Tingting. Recommendation Strategy Based on Users’ Preferences for Fine-Grained Attributes[J]. 数据分析与知识发现, 2021, 5(10): 94-102.
[10] Yang Heng,Wang Sili,Zhu Zhongming,Liu Wei,Wang Nan. Recommending Domain Knowledge Based on Parallel Collaborative Filtering Algorithm[J]. 数据分析与知识发现, 2020, 4(6): 15-21.
[11] Li Wenzheng,Gu Yijun,Yan Hongli. Predicting Community Numbers with Network Bayesian Information Criterion[J]. 数据分析与知识发现, 2020, 4(4): 72-82.
[12] Tang Lin,Guo Chonghui,Chen Jingfeng. Review of Chinese Word Segmentation Studies[J]. 数据分析与知识发现, 2020, 4(2/3): 1-17.
[13] Liu Shurui,Tian Jidong,Chen Puchun,Lai Li,Song Guojie. New Sample Selection Algorithm with Textual Data[J]. 数据分析与知识发现, 2020, 4(2/3): 223-230.
[14] Zhang Chunjin,Guo Shenghui,Ji Shujuan,Yang Wei,Yi Lei. Group Recommendation Algorithms Based on Implicit Representation Learning of Multi-attribute Ratings[J]. 数据分析与知识发现, 2020, 4(12): 120-135.
[15] Chen Xianlai, Luo Xiao, Liu Li, Li Zhongmin, An Ying. k-Anonymity Algorithm of Multi-Branch-Tree Forest Based on Recognition Rate[J]. 数据分析与知识发现, 2020, 4(12): 14-25.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn