Please wait a minute...
New Technology of Library and Information Service  2013, Vol. 29 Issue (1): 15-21    DOI: 10.11925/infotech.1003-3513.2013.01.03
Current Issue | Archive | Adv Search |
The Study on Out-of-vocabulary Identification of Chinese Biomedical Field Based on Hybrid Method
Sun Haixia1, Li Junlian1, Wu Yingjie1, Wu Suhui2
1. Institute of Medical Information, Chinese Academy of Medical Sciences, Beijing 100020, China;
2. Department of Information Management, Nanjing University, Nanjing 210093, China
Export: BibTeX | EndNote (RIS)      
Abstract  First, the status of research on out-of-vocabulary automatic identification is introduced briefly. Then,combining the word length distribution and morphological characteristics of Chinese biomedical field, this paper presents an hybrid method of out-of-vocabulary identification of Chinese biomedical field, which is based on N-gram, integrating the methods of the field dictionary-based, filtered corpus-based, and rules-based. Finally, on a sample set of pharmaceutical journals data of Chinese BioMedical Literature Database, the authors make an experiment of the proposed hybrid method, and the experimental results achieve a good performance.
Key wordsOut-of-vocabulary      N-gram      Hybrid method      Biomedical     
Received: 17 December 2012      Published: 29 March 2013
:  TP393  

Cite this article:

Sun Haixia, Li Junlian, Wu Yingjie, Wu Suhui. The Study on Out-of-vocabulary Identification of Chinese Biomedical Field Based on Hybrid Method. New Technology of Library and Information Service, 2013, 29(1): 15-21.

URL:     OR

[1] 张海军,史树敏,朱朝勇,等. 中文新词识别技术综述[J]. 计算机科学,2010,37(3): 6-12. (Zhang Haijun, Shi Shumin, Zhu Chaoyong, et al. Survey of Chinese New Words Identification [J]. Computer Science, 2010, 37(3): 6-12.)
[2] 郑家恒,李文花. 基于构词法的网络新词自动识别初探[J]. 山西大学学报:自然科学版,2002,25(2):115-119. (Zheng Jiaheng, Li Wenhua. A Study on Automatic Identification for Internet New Words Accorging to Word-Building Rule [J]. Journal of Shanxi University:Natural Science Edition, 2002, 25(2):115-119.)
[3] 周雷. 基于碎片分词的未登录词识别方法[J]. 常熟理工学院学报:自然科学版,2007,21(2):77-81. (Zhou Lei. The Recognition Method of Unknown Chinese Words Based on Fragments Segmentation [J]. Journal of Changshu Institute of Technology:Natural Sciences, 2007,21(2):77-81.)
[4] 段宇锋, 鞠菲. 基于N-gram的专业领域中文新词识别研究[J]. 现代图书情报技术, 2012(2): 41-47. (Duan Yufeng, Ju Fei. Research on Chinese New Word Recognition in Specialized Field Based on N-gram[J].New Technology of Library and Information Service, 2012(2): 41-47.)
[5] 韩艳,林煜熙,姚建民. 基于统计信息的未登录词的扩展识别方法[J]. 中文信息学报,2009,23(3): 24-30. (Han Yan, Lin Yuxi, Yao Jianmin. Study on Chinese OOV Identification Based on Extension [J]. Journal of Chinese Information Processing, 2009, 23(3): 24-30.)
[6] 李钝,曹元大,万月亮. Internet中的新词识别[J]. 北京邮电大学学报,2008,31(1):26-29. (Li Dun, Cao Yuanda, Wan Yueliang. Internet-oriented New Words Identification[J]. Journal of Beijing University of Posts and Telecommunications, 2008, 31(1):26-29.)
[7] Wu A D, Jiang Z X. Statistically-enhanced New Word Identification in a Rule-based Chinese System[C]. In: Proceedings of the 2nd Workshop on Chinese Language, Hong Kong, China. 2000:46-51.
[8] 曹艳,杜慧平,刘竞,等. 基于词表和N-gram算法的新词识别试验[J]. 情报科学,2007,25(11): 1687-1695. (Cao Yan, Du Huiping, Liu Jing, et al. An Experiment of New Words Identification Based on Vocabulary and N-gram Algorithm [J].Information Science, 2007, 25(11): 1687-1695.)
[9] 贺敏,龚才春,张华平,等. 一种基于大规模语料的新词识别方法[J]. 计算机工程与应用,2007,43(21):157-159. (He Min, Gong Caichun, Zhang Huaping, et al. Method of New Word Identification Based on Larger-scale Corpus[J]. Computer Engineering and Applications, 2007, 43(21):157-159.)
[10] 张海军,史树敏,丁溪源,等. 基于分词提取重复串的未登录词遗漏量化模型[J]. 中文信息学报,2011,25(2):122-128. (Zhang Haijun, Shi Shumin, Ding Xiyuan, et al. Quantitative Omission Model of Candidate Unknown Words for Chinese Word Segmentation Based Repeat Extraction[J]. Journal of Chinese Information Processing, 2011,25(2):122-128.)
[11] 魏莎莎. 一种中文未登录词识别及词典设计新方法[D].重庆:西南大学,2011. (Wei Shasha. A New Method of Chinese Out-of-Vocabulary Identification and Dictionary Design[D].Chongqing: Southwest University,2011.)
[12] 中国生物医学文献数据库[EB/OL]. [2012-04-14]. (China Biology Medicine[EB/OL]. [2012-04-14].
[13] 哈工大停用词表 [EB/OL]. [2012-05-14]. (HIT Stop-Words List [EB/OL]. [2012-05-14].
[1] Zhang Zhiqiang,Fan Shaoping,Chen Xiujuan. Biomedical Informatics Studies for Knowledge Discovery in Precision Medicine[J]. 数据分析与知识发现, 2018, 2(1): 1-8.
[2] Duan Jianyong,. Auto-Correction Search Model Based on Statistics and Characteristics[J]. 现代图书情报技术, 2016, 32(2): 34-42.
[3] Duan Yufeng, Zhu Wenjing, Chen Qiao, Liu Wei, Liu Fenghong. The Study on Out-of-Vocabulary Identification on a Model Based on the Combination of CRFs and Domain Ontology Elements Set[J]. 现代图书情报技术, 2015, 31(4): 41-49.
[4] Xiao Tianjiu, Liu Ying. Words and N-gram Models Analysis for “A Dream of Red Mansions”[J]. 现代图书情报技术, 2015, 31(4): 50-57.
[5] Wang Hao, Li Sishu, Deng Sanhong. Study on Text Language Recognition Based on N-Gram[J]. 现代图书情报技术, 2013, (4): 54-61.
[6] Wang Xiuyan, Cui Lei. Extract Semantic Relations Between Biomedical Entities Applied Hybrid Method[J]. 现代图书情报技术, 2013, 29(3): 77-82.
[7] Duan Yufeng, Ju Fei. Research on Chinese New Word Recognition in Specialized Field Based on N-Gram[J]. 现代图书情报技术, 2012, 28(2): 41-47.
[8] Wang Xiuyan, Cui Lei. Overview of Semantic Relations Extraction Between Biomedical Entities by Key Verbs[J]. 现代图书情报技术, 2011, 27(9): 21-27.
[9] Wu Suhui, Cheng Ying, Zheng Yanning, Pan Yuntao. N-gram Based on Cluster Label Extracting Algorithm for English Paper[J]. 现代图书情报技术, 2011, 27(7/8): 68-75.
[10] Bai Rujiang, Yu Xiaofan, Wang Xiaoyue. The Comparative Analysis of Major Domestic and Foreign Ontology Library[J]. 现代图书情报技术, 2011, 27(1): 3-13.
[11] Yu Xitian,Wan Lili,Hu Tiejun,Li Danya. Research and Implementation of Related Articles Database Based on Vector Space Model[J]. 现代图书情报技术, 2008, 24(6): 61-66.
[12] Lei Chunbing ,Zhang Xiaomei,Yan Shigang,Wang Guoqing,Chen Jianqing,Liu Jinyu,Du Yunxiang. Establishment of the Biomedical Bibliographic Database in[J]. 现代图书情报技术, 2005, 21(8): 54-57.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938