Please wait a minute...
Advanced Search
现代图书情报技术  2012, Vol. 28 Issue (3): 27-34    DOI: 10.11925/infotech.1003-3513.2012.03.05
  知识组织与知识管理 本期目录 | 过刊浏览 | 高级检索 |
词形还原方法及实现工具比较分析
吴思竹, 钱庆, 胡铁军, 李丹亚, 李军莲, 洪娜
中国医学科学院医学信息研究所 北京100020
Contrast Analysis of Methods and Tools for Lemmatization
Wu Sizhu, Qian Qing, Hu Tiejun, Li Danya, Li Junlian, Hong Na
Institute of Medical Information, Chinese Academy of Medical Sciences, Beijing 100020, China
全文: PDF(539 KB)   HTML  
输出: BibTeX | EndNote (RIS)      
摘要 结合理论和实验比较分析用于词形规范的词形还原方法和工具。归纳现有词形还原方法的主要分类,分析各类方法的特点和不足。介绍7种词形还原实现工具,并从其实现原理、使用的词性标注器、词典、开发语言、处理的语种、是否具有拼写检查功能等方面比较分析各工具的特点。选取其中5种工具,利用WordSimith Tools的标准数据进行词形还原实验。结合实验结果分析各工具的优劣,发现Specialist NLP Tools的词形还原工具具有较好的词形还原处理效果,为研究者选择适当的词形还原方法和工具提供参考。
服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
吴思竹
钱庆
胡铁军
李丹亚
李军莲
洪娜
关键词 词形规范化词干提取词形还原词元    
Abstract:Combining theory with practice, this paper compares the methods and tools for lemmatization in word normalization. It summarizes the categories of lemmatization methods and analyses their features and disadvantages. Then it separately compares seven tools from aspects as the principle, POS tagger, lexicon, programming language, language, spell checker.It takes experiments with the datasets from WordSimith Tools to evaluate five lemmatizers. By comparing the results, it finds that the Specialist NLP Tools has a better effect than others.This paper provides an assistance for the study in choosing the appropriate method and tool for lemmatization.
Key wordsWord normalization    Stemming    Lemmatization    Lemma
收稿日期: 2012-01-12     
: 

G350

 
基金资助:

本文系国家“十二五”科技支撑计划基金项目“科技知识组织体系的协同工作系统和辅助工具开发”(项目编号:2011BAH10B02)和中国医学科学院医学信息研究所中央级公益性科研院所基本科研业务费课题“基于语言网络的医学文本表示模型构建方法研究”(项目编号:11R0209)的研究成果之一。

引用本文:   
吴思竹, 钱庆, 胡铁军, 李丹亚, 李军莲, 洪娜. 词形还原方法及实现工具比较分析[J]. 现代图书情报技术, 2012, 28(3): 27-34.
Wu Sizhu, Qian Qing, Hu Tiejun, Li Danya, Li Junlian, Hong Na. Contrast Analysis of Methods and Tools for Lemmatization. New Technology of Library and Information Service, DOI:10.11925/infotech.1003-3513.2012.03.05.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2012.03.05
[1] Mansuri Y, Kim J G, Compton P, et al. An Evaluation of Ripple-Down Rules[C]. In: Proceedings of the IJCAI’91 Knowledge Acquisition Workshop Pokolbin.1991: 114-132.

[2] Plisson J, Lavrac N, Mladenic D. A Rule Based Approach to Word Lemmatization[C]. In: Proceedings of the 7th International MultiConference Information Society IS. 2004:83-86.

[3] Juršic M, Mozetic I, Lavrac N.Learning Ripple Down Rules for Efficient Lemmatization[C]. In : Proceedings of the 10th International Multi-Conference Information Society IS. 2007:206-209.

[4] Chrupala G. Simple Data-Driven Context-Sensitive Lemmatization[C]. In: Proceedings of SEPLN. 2006:121-127.

[5] Daelemans W, Groenewald H J, van Huyssteen G B. Prototype-based Active Learning for Lemmatization[C]. In: Proceedings of Recent Advances in Natural Language Processing (RANLP). 2009:65-70.

[6] Plisson J, Mladenic D, Lavrac N, et.al.A Lemmatization Web Service Based on Machine Learning Techniques[C]. In: Proceedings of the 2nd Language & Technology Conference. 2005:369-372.

[7] Ingason A K, Helgadóttir S, Loftsson H, et.al. A Mixed Method Lemmatization Algorithm Using a Hierarchy of Linguistic Identities (HOLI)[OL]. [2011-10-22]. http://linguist.is/skjol/lemmald.pdf.

[8] Branco A, Silva J. Very High Accuracy Rule-based Nominal Lemmatization with a Minimal Lexicon [OL]. [2011-10-22]. http://quexting.di.fc.ul.pt/publicacoes/BrancoSilva2007.pdf.

[9] Kanis J, Müller L. Automatic Lemmatizer Construction with Focus on OOV Words Lemmatization.Text[C]. In:Proceedings of the 8th International Conference on Text, Speech and Dialogue.Berlin, Heidelberg:Springer-Verlag,2005.

[10] European Languages Lemmatizer[EB/OL]. [2011-10-21].http://lemmatizer.org/.

[11] CST’s Lemmatiser[EB/OL]. [2011-10-22].http://cst.dk/online/lemmatiser/uk/.

[12] CST Lemmatiser 4.0[OL]. [2011-10-22].http://cst.dk/download/cstlemma/current/doc/cstlemma.pdf.

[13] Wmtrans Lemmatizer[EB/OL]. [2011-10-21].http://www-dev.canoo.com/wmtrans/home/index.html.

[14] MorphAdorner[EB/OL].[2011-10-21].http://morphadorner.northwestern.edu/morphadorner/.

[15] English Lemmatization Process[EB/OL].[2011-10-21]. http://morphadorner.northwestern.edu/morphadorner/lemmatizer/lemmatizationprocess/.

[16] Stanford CoreNlP[EB/OL].[2011-10-21].http://nlp.stanford.edu/software/corenlp.shtml.

[17] NLTK[EB/OL].[2011-10-21].http://www.nltk.org/.

[18] Specialist NLP Tools[EB/OL].[2011-10-21].http://specialist.nlm.nih.gov/.

[19] WordSmith[EB/OL].[2011-10-21]. http://www.lexically.net/wordsmith/.
[1] 谷俊. 专利文献中新技术术语识别研究[J]. 现代图书情报技术, 2012, (11): 53-59.
[2] 姚占雷, 许鑫. 互联网新闻报道中的突发事件识别研究[J]. 现代图书情报技术, 2011, 27(4): 52-57.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn