基于串频统计的汉语和孟加拉语专有名词识别

doi:10.11925/infotech.1003-3513.2011.12.05

现代图书情报技术

2011, Vol. 27

Issue (12): 31-38 https://doi.org/10.11925/infotech.1003-3513.2011.12.05

知识组织与知识管理

本期目录 | 过刊浏览 | 高级检索

基于串频统计的汉语和孟加拉语专有名词识别

柯修¹, 王惠临², 于薇²

1. 北京大学信息管理系北京 100871;
2. 中国科学技术信息研究所北京 100038

Chinese and Bengali Proper Noun Recognition Based on String Frequency Statistics Model

Kishore Biswas¹, Wang Huilin², Yu Wei²

1. Department of Information Management, Peking University, Beijing 100871, China;
2. Institute of Science & Technology Information of China, Beijing 100038, China

摘要
参考文献
相关文章
Metrics

全文: PDF (891 KB) HTML
输出: BibTeX | EndNote (RIS)

摘要基于Nagao串频统计算法实现汉语和孟加拉语专有名词的识别。提取未经过词性标注的中文和孟加拉语语料中的n元串,使用改进的SSR算法过滤多余子串,利用字串的相邻字信息计算所有n元串成为专有名词的概率,并据此筛选专有名词。最后,实现基于串频统计的跨语言专有名词识别系统。实验表明,系统能够从输入的生语料中有效地识别出人名、地名、团体机构名等。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	柯修
	王惠临
	于薇

关键词 ：专有名词识别, 串频统计, Nagao算法, SSR, 算法

Abstract：This paper implements String Frequency Statistics Algorithm proposed by Nagao to build Proper Noun Recognition (PNR) system for Chinese and Bengali languages. First, n-grams are extracted from untagged input corpus,then they are filtered to get rid of redundant sub-strings, using SSR algorithm. Finally, this multilingual PNR system assigns each n-gram a probability of being a proper noun based on the information of their neighboring words and outputs results according to their probability score. The test results show that this system can effectively recognize name of people, places, organizations or institutions from the input text.

Key words： Proper noun recognition String statistics Nagao algorithm SSR algorithm

收稿日期: 2011-11-03 出版日期: 2012-02-02

TP391

基金资助:

本文系中国科学技术信息研究所学科建设项目“自然语言处理”(项目编号:XK2011-6)的研究成果之一。

引用本文:

柯修, 王惠临, 于薇. 基于串频统计的汉语和孟加拉语专有名词识别[J]. 现代图书情报技术, 2011, 27(12): 31-38.
Kishore Biswas, Wang Huilin, Yu Wei. Chinese and Bengali Proper Noun Recognition Based on String Frequency Statistics Model. New Technology of Library and Information Service, 2011, 27(12): 31-38.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2011.12.05 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2011/V27/I12/31

[1] 候志霞, 曹军.自然语言处理的发展概况和前景展望[J].山东外语教学, 2003,18(5):53-55.

[2] 郭艳华, 周昌乐.自然语言理解研究综述[J].杭州电子工业大学学报, 2000,20(1):58-64.

[3] 赵铁军.机器翻译原理[M].哈尔滨:哈尔滨工业大学出版社,2001.

[4] 刘开瑛.中文文本自动分词和标注[M].北京:商务印书馆,2000.

[5] 姚天顺,朱靖波,张舜,等.自然语言理解[M].北京:清华大学出版社,2002.

[6] 吕雅娟,赵铁军,杨沐昀,等.基于分解与动态规划策略的汉语未登录词识别[J].中文信息学报,2001,15(1):28-33.

[7] 罗智勇,宋柔.现代汉语自动分词中专名的一体化、快速识别方法[C].见:国际牛文电脑学术会议,新加坡.2001:323-328.

[8] 郑家恒,刘开瑛.自动分词系统中姓氏人名的处理策略探讨[C].见:全国第二届计算机语言学联合学术会议,厦门.北京:北京语言学院出版社,1993.

[9] Tan H Y,Zheng J H,Liu K Y.Research on Method of Automatic Recognition of Chinese Place Name Based on Transformation[J].Journal of Software, 2001,12(1):1608-1613.

[10] Park B R, Hwang Y S, Rim H C. Recognizing Korean Unknown Proper Nouns by Using Automatically Extracted Lexical Clues[C]. In: Proceedings of the 10th Research on Computational Linguistics International Conference. 1997.

[11] Barcala M, Vilares J, Alonso M A,et al. Tokenization and Proper Noun Recognition for Information Retrieval[C]. In:Proceedings of the 3rd International Workshop on Natural Language and Information Systems (NLIS 2002).2002.

[12] Aone C, Maloney J. Reuse of a Proper Noun Recognition System in Commercial and Operational NLP Application[C]. In:Proceedings of the 5th Workshop on Very Large Corpora. 1997.

[13] 张华平,刘群.基于角色标注的中国人名自动识别研究 [J].计算机学报, 2004,27(1):85-91.

[14] Zhou G D, Su J. Named Entity Recognition Using an HMM-based Chunk Tagger[C]. In:Proceedings of the 40th Annual Meeting of the ACL, Philadelphia.2002:473-480.

[15] Borthwick A.A Maximum Entropy Approach to Named Entity Recognition[D].New York:New York University, 1999.

[16] 秦文,苑春法.基于决策树的汉语未登录词识别[J].中文信息学报, 2004,18(1):14-19.

[17] Takeuchi K, Collier N.Use of Support Vector Machines in Extended Named Entity Recognition[C].In:Proceedings of the 6th Conference on Natural Language Learning,Taipei,China.2002:119-125.

[18] 李丽双,黄德根,陈春容.用支持向量机进行中文地名识别的研究[J].小型微型计算机系统, 2004,26(8):1416-1419.

[19] 郑荣廷, 李楠, 吉久明,等.中文化学物质名称识别研究[J].现代图书情报技术, 2010(6):48-52.

[20] 宋柔,朱宏.基于语料库和规则库的人名识别法[C].见:全国第二届计算机语言学联合学术会议,厦门.北京:北京语言学院出版社,1993:150,154.

[21] 黄德根,马玉霞,杨元生. 基于互信息的中文姓名识别方法[J].大连理工大学学报,2004,44(5):744-748.

[22] 孙茂松, 黄昌宁, 高海燕, 等. 中文姓名的自动辨识[J].中文信息学报, 1994,9(2):16-27.

[23] 王振华, 孙祥龙, 陆汝占, 等. 结合决策树方法的中文姓名识别[J].中文信息学报,2004,18(6): 10-15.

[24] 谭红叶,郑家恒,刘开瑛.中国地名自动识别系统的设计与实现[J].计算机工程, 2002,28(8):128-129.

[25] 张辉, 徐健.中国组织结构名自动识别系统的设计与实现[J].电脑开发与应用,2002,15(1):5-6.

[26] 郑家恒, 张辉.基于HMM的中国组织机构名自动识别[J].计算机应用,2002,22(1):1-2.

[27] 胡乃全, 朱巧明, 周国栋.混合的汉语基本名词短语识别方法[J].计算机工程,2009, 35(20):199-201.

[28] Chaudhuri Bidyut Baran, Bhattacharya Suvankar. An Experiment on Automatic Detection of Named Entities in Bangla [C]. In: Proceedings of the IJCNLP-08 Workshop on NER for South and South East Asian Languages,Hyderabad,India.2008:1-58.

[29] Ekbal A, Naskar S,Bandopadhyay S. Named Entity Recognition and Transliteration in Bengali[J]. Special Issue of Linguistics Investigation Journal,2007,30(1):95-114.

[30] Nagao M, Mori S. A New Method of N-gram Statistics for Large Number of N and Automatic Extraction of Words and Phrases from Large Text Data of Japanese[C]. In: Proceedings of the 15th Conference on Computational Linguistics.1994.

[31] Lv X Q, Zhang L, Hu J F. Statistical Substring Reduction in Linear Time[C]. In: Proceedings of the Natural Language Processing-IJCNLP.2005:320-327.

[1]	苏强, 侯校理, 邹妮. 基于机器学习组合优化方法的术后感染预测模型研究^*[J]. 数据分析与知识发现, 2021, 5(8): 65-75.
[2]	董美,常志军,张润杰. 一种面向科技文献元数据增量数据规范的多模式匹配算法^*[J]. 数据分析与知识发现, 2021, 5(6): 135-144.
[3]	董振恒,吕学强,任维平,姜阳,李果林. 高性能区块链关键技术研究综述[J]. 数据分析与知识发现, 2021, 5(6): 14-24.
[4]	卢利农,祝忠明,张旺强,王小春. 基于Lingo3G聚类算法的机构知识库跨库知识整合与知识指纹服务实现[J]. 数据分析与知识发现, 2021, 5(5): 127-132.
[5]	马莹雪,甘明鑫,肖克峻. 融合标签和内容信息的矩阵分解推荐方法^*[J]. 数据分析与知识发现, 2021, 5(5): 71-82.
[6]	王楠,李海荣,谭舒孺. 基于改进SMOTE算法与集成学习的舆情反转预测研究^*[J]. 数据分析与知识发现, 2021, 5(4): 37-48.
[7]	邱云飞, 郭蕾. 面向非均衡数据的糖尿病并发症预测[J]. 数据分析与知识发现, 2021, 5(2): 116-128.
[8]	吴胜男, 蒲虹君, 田若楠, 梁雯琪, 于琦. *网络结构对链路预测算法的影响研究——基于元分析视角**[J]. 数据分析与知识发现, 2021, 5(11): 102-113.
[9]	杨辰, 陈晓虹, 王楚涵, 刘婷婷. 基于用户细粒度属性偏好聚类的推荐策略^*[J]. 数据分析与知识发现, 2021, 5(10): 94-102.
[10]	杨恒,王思丽,祝忠明,刘巍,王楠. 基于并行协同过滤算法的领域知识推荐模型研究*[J]. 数据分析与知识发现, 2020, 4(6): 15-21.
[11]	李文政,顾益军,闫红丽. 基于网络贝叶斯信息准则算法的社区数量预测研究*[J]. 数据分析与知识发现, 2020, 4(4): 72-82.
[12]	唐琳,郭崇慧,陈静锋. 中文分词技术研究综述^*[J]. 数据分析与知识发现, 2020, 4(2/3): 1-17.
[13]	刘书瑞,田继东,陈普春,赖立,宋国杰. 基于文本数据的过滤式与嵌入式样本选择算法*[J]. 数据分析与知识发现, 2020, 4(2/3): 223-230.
[14]	张纯金,郭盛辉,纪淑娟,杨伟,伊磊. 基于多属性评分隐表征学习的群组推荐算法^*[J]. 数据分析与知识发现, 2020, 4(12): 120-135.
[15]	陈先来, 罗霄, 刘莉, 李忠民, 安莹. 基于识别率的多叉树森林k-匿名算法^*[J]. 数据分析与知识发现, 2020, 4(12): 14-25.

Viewed

Full text

Abstract

Cited

Shared

Discussed