Please wait a minute...
Advanced Search
现代图书情报技术  2012, Vol. 28 Issue (2): 41-47    DOI: 10.11925/infotech.1003-3513.2012.02.07
  知识组织与知识管理 本期目录 | 过刊浏览 | 高级检索 |
基于N-Gram的专业领域中文新词识别研究
段宇锋, 鞠菲
华东师范大学商学院 上海 200241
Research on Chinese New Word Recognition in Specialized Field Based on N-Gram
Duan Yufeng, Ju Fei
Business School, East China Normal University, Shanghai 200241, China
全文: PDF(1015 KB)   HTML  
输出: BibTeX | EndNote (RIS)      
摘要 以植物学作为专业领域的样本,对专业领域的新词自动化识别进行探索。研究选取《中国植物志》作为样本集,在ICTCLAS切词的基础上采用N-Gram统计的方法提取新词的候选项,然后分别按照词频(TF)、文档频率(D)和平均词频(TF/D)对新词候选项排序,取一定范围内的候选项作为识别出的新词。实验结果表明,词频TF筛选新词候选项的识别效果最好,F值为0.65。该方法能够自动产生专业领域的用户词典,具有较强的可移植性。
服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
鞠菲
段宇锋
关键词 N-Gram新词识别词频统计    
Abstract:The paper researches automatic new word recognition in specialized field which is represented by phytology. A set of 200 documents on plant description randomly drawn from “Flora of China” is taken as sample set. At first, draw new words candidates are drawn by N-Gram method based on words split by ICTCLAS. Then all the new words candidates are sorted respectively by term frequency (TF), document frequency (D) and average term frequency (TF/D) and the candidates are selected among certain boundary as true new words. The experiments show that new words recognition according to TF is the best and F measurement is 0.65. This method can automatically produce user dictionary of specialized field and is highly portable.
Key wordsN-Gram    New word recognition    Term frequency
收稿日期: 2011-12-12     
: 

G350

 
基金资助:

本文系教育部人文社会科学研究青年基金项目“基于深度语义标注的网络中文学术信息抽取研究——以生物多样性描述为例”(项目编号:10YJC870004)的研究成果之一。

引用本文:   
段宇锋, 鞠菲. 基于N-Gram的专业领域中文新词识别研究[J]. 现代图书情报技术, 2012, 28(2): 41-47.
Duan Yufeng, Ju Fei. Research on Chinese New Word Recognition in Specialized Field Based on N-Gram. New Technology of Library and Information Service, DOI:10.11925/infotech.1003-3513.2012.02.07.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2012.02.07
[1] 孙茂松.汉语自动分词研究中的若干理论问题[J]. 语言文字应用 ,2005 (4):40-46. (Sun Maosong.Several Theoretical Problems in Automatic Chinese Word Segmentation Research[J]. Application of Language,2005(4):40-46.)

[2] 张德鑫.“水至清则无鱼”——我的新生词语规范观[J]. 北京大学学报:哲社版 ,2000,37(5):106-119.(Zhang Dexin. My Point of View on the Standard of Newborn Words[J]. Journal of Peking University:Philosophy and Social Sciences,2000,37(5):106-119.)

[3] 黄昌宁,赵海.中文分词十年回顾[J]. 中文信息学报 ,2007,21(3):8-19.(Huang Changning, Zhao Hai. Chinese Word Segmentation: A Decade Review[J]. Journal of Chinese Information Processing,2007,21(3):8-19.)

[4] 张海军,史树敏,朱朝勇,等.中文新词识别技术综述[J]. 计算机科学 ,2010,37(3): 6-12.(Zhang Haijun, Shi Shumin, Zhu Chaoyong,et al. Survey of Chinese New Words Identification[J]. Computer Science,2010,37(3):6-12.)

[5] 郑家恒,李文花.基于构词法的网络新词自动识别初探[J]. 山西大学学报:自然科学版 ,2002,25(2):115-119.(Zhen Jiaheng, Li Wenhua. Study on Automatic Identification for Internet New Words According to Word-Building Rule[J]. Journal of Shanxi University:Natural Science Edition,2002,25(2):115-119.)

[6] Chen K J, Bai M H. Unknown Word Detection for Chinese by a Corpus-based Learning Method[J]. International Journal of Computational Linguistics and Chinese Language Processing,1998, 3(1):27-44.

[7] 吴涛,张毛迪,陈传波.一种改进的统计与后串最大匹配的中文分词算法研究[J]. 计算机工程与科学 ,2008,30(8):79-82.(Wu Tao, Zhang Maodi, Chen Chuanbo. Research of Chinese Word Segmentation Algorithms Based on Statistics and Reverse Maximum Match[J]. Computer Engineering & Science,2008,30(8):79-82.)

[8] Nie J Y, Hannah M L, Jin W. Unknown Word Detection and Segmentation of Chinese Using Statistical and Heuristic Knowledge[J].Communications of COLIPS,1995,5(1):47-57.

[9] Yang M H, Ahuja N. A Geometric Approach to Train Support Vector Machines[C]. In:Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Hilton Head Island, SC, USA.2000:430-437.

[10] 秦浩伟,步丰林.一个中文新词识别特征的研究[J]. 计算机工程 ,2004,30(S1):369-370.(Qin Haowei, Bu Fenglin. Research on a Feature of Chinese New Word Identification[J].Computer Engineering,2004,30(S1):369-379.)

[11] 李钝,曹元大,万月亮.Internet中的新词识别[J]. 北京邮电大学学报 ,2008,31(1):26-29.(Li Dun, Cao Yuanda, Wan Yueliang. Internet-Oriented New Words Identification[J]. Journal of Beijing University of Posts and Telecommunications,2008,31(1):26-29.)

[12] 韩艳,林煜熙,姚建民. 基于统计信息的未登录词的扩展识别方法[J]. 中文信息学报 ,2009,23(3): 24-30.(Han Yan, Lin Yixi, Yao Jianmin. Study on Chinese OOV Identification Based on Extension[J].Journal of Chinese Information Processing,2009,23(3):24-30.)

[13] 丁建立,慈祥,黄剑雄.一种基于免疫遗传算法的网络新词识别方法[J]. 计算机科学 ,2011,38(1): 240-245.(Ding Jianli, Ci Xiang, Huang Jianxiong. Approach of Internet New Word Identification Based on Innmune Genetic Algorithm[J].Computer Science, 2011,38(1):240-245.)

[14] 崔世起.中文新词检测与分析[D].北京:中国科学院研究生院,2006.(Cui Shiqi. Research on Chinese New Word Identification and Analysis [D]. Beijing: Graduate University of Chinese Academy of Sciences,2006.)

[15] 韩客松,王永成,陈桂林.汉语语言的无词典分词模型系统[J]. 计算机应用研究 ,1999(10):8-9.(Han Kesong, Wang Yongcheng, Chen Guilin. Chinese Word Segmentation System Without Dictionary[J]. Application Research of Computers, 1999(10):8-9.)

[16] 魏莎莎.一种中文未登录词识别及词典设计新方法[D].重庆:西南大学,2011.(Wei Shasha. A New Method of Chinese Out-of-Vocabulary Identification and Dictionary Design[D]. Chongqing: Southeast University,2011.)

[17] 贺敏.面向互联网的中文有意义串挖掘[D].北京:中国科学院研究生院,2007.(He Min. Internet-Oriented Chinese Meaningful Word Reorganization [D].Beijing: Graduate University of Chinese Academy of Sciences,2009.)

[18] 黄玉兰.有意义串挖掘及其应用[D].北京:中国科学院研究生院,2009.(Huang Yulan. Meaningful Word Reorganization and Application [D]. Beijing: Graduate University of Chinese Academy of Sciences,2009.)

[19] 贺敏,龚才春,张华平,等.一种基于大规模语料的新词识别方法[J]. 计算机工程与应用 ,2007,43(21):157-159.(He Min, Gong Caichun, Zhang Huaping,et al. Method of New Word of Identification Based on Lager-scale Corpus[J]. Computer Engineering and Applications,2007,43(21):157-159.)

[20] 张海军,史树敏,丁溪源,等.基于分词提取重复串的未登录词遗漏量化模型[J]. 中文信息学报 ,2011,25(2):122-128.(Zhang Haijun, Shi Shumin, Ding Xiyuan,et al. Quantitative Omission Model of Candidate Unknown Words for Chinese Word Segmentation Based Repeat Extraction[J].Journal of Chinese Information Processing,2011,25(2):122-128.)

[21] 中国植物志[R/OL].[2011-09-12].http://frps.plantphoto.cn/dzb_list2.asp.(Flora of China [R/OL].[2011-09-12].http://frps.plantphoto.cn/dzb_list2.asp.)

[22] Rogati M, Yang Y M.High-Performing Feature Selection for Text Classification[C]. In: Proceedings of the 11th International Conference on Information and Knowledge Management. New York:ACM,2002:659-661.

[23] 陈小荷.自动分词中未登录词问题的一揽子解决方案[J]. 语言文字应用 ,1999(3):103-109.(Chen Xiaohe. A Package Scheme for Identifying Unlisted Words in Chinese Segmentation[J].Applied Linguistics,1999(3):103-109.)

[24] 都菁,熊海灵.基于论坛语料识别中文未登录词的方法[J]. 计算机工程与设计 ,2010,31(3):630-633.(Du Jing, Xiong Hailing. Algorithm to Recognize Unknown Chinese Words Based on BBS Corpus[J].Computer Engineering and Design,2010,31(3):630-633.)

[25] 吕美香,何琳, 李玥,等.基于N-gram文本表达的新闻领域关键词词典构建研究[J]. 情报科学 ,2010,28(4):571-574.(Lv Meixiang, He Lin, Li Yue,et al. Research on Construction of News Keyword Dictionary Based on N-Gram Text Representation[J]. Intelligence Science, 2010,28(4):571-574.)
[1] 段建勇,关晓龙. 基于统计和特征相结合的查询纠错方法研究*[J]. 现代图书情报技术, 2016, 32(2): 34-42.
[2] 王昊, 李思舒, 邓三鸿. 基于N-Gram的文本语种识别研究[J]. 现代图书情报技术, 2013, (4): 54-61.
[3] 孙海霞, 李军莲, 吴英杰, 吴夙慧. 基于混合策略的中文生物医学领域未登录词识别研究[J]. 现代图书情报技术, 2013, 29(1): 15-21.
[4] 黄水清,程冲 . 基于既定词表的自适应汉语分词技术研究[J]. 现代图书情报技术, 2006, 1(5): 13-17.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn