|
|
The Study on Out-of-vocabulary Identification of Chinese Biomedical Field Based on Hybrid Method |
Sun Haixia1, Li Junlian1, Wu Yingjie1, Wu Suhui2 |
1. Institute of Medical Information, Chinese Academy of Medical Sciences, Beijing 100020, China; 2. Department of Information Management, Nanjing University, Nanjing 210093, China |
|
|
Abstract First, the status of research on out-of-vocabulary automatic identification is introduced briefly. Then,combining the word length distribution and morphological characteristics of Chinese biomedical field, this paper presents an hybrid method of out-of-vocabulary identification of Chinese biomedical field, which is based on N-gram, integrating the methods of the field dictionary-based, filtered corpus-based, and rules-based. Finally, on a sample set of pharmaceutical journals data of Chinese BioMedical Literature Database, the authors make an experiment of the proposed hybrid method, and the experimental results achieve a good performance.
|
Received: 17 December 2012
Published: 29 March 2013
|
|
[1] 张海军,史树敏,朱朝勇,等. 中文新词识别技术综述[J]. 计算机科学,2010,37(3): 6-12. (Zhang Haijun, Shi Shumin, Zhu Chaoyong, et al. Survey of Chinese New Words Identification [J]. Computer Science, 2010, 37(3): 6-12.) [2] 郑家恒,李文花. 基于构词法的网络新词自动识别初探[J]. 山西大学学报:自然科学版,2002,25(2):115-119. (Zheng Jiaheng, Li Wenhua. A Study on Automatic Identification for Internet New Words Accorging to Word-Building Rule [J]. Journal of Shanxi University:Natural Science Edition, 2002, 25(2):115-119.) [3] 周雷. 基于碎片分词的未登录词识别方法[J]. 常熟理工学院学报:自然科学版,2007,21(2):77-81. (Zhou Lei. The Recognition Method of Unknown Chinese Words Based on Fragments Segmentation [J]. Journal of Changshu Institute of Technology:Natural Sciences, 2007,21(2):77-81.) [4] 段宇锋, 鞠菲. 基于N-gram的专业领域中文新词识别研究[J]. 现代图书情报技术, 2012(2): 41-47. (Duan Yufeng, Ju Fei. Research on Chinese New Word Recognition in Specialized Field Based on N-gram[J].New Technology of Library and Information Service, 2012(2): 41-47.) [5] 韩艳,林煜熙,姚建民. 基于统计信息的未登录词的扩展识别方法[J]. 中文信息学报,2009,23(3): 24-30. (Han Yan, Lin Yuxi, Yao Jianmin. Study on Chinese OOV Identification Based on Extension [J]. Journal of Chinese Information Processing, 2009, 23(3): 24-30.) [6] 李钝,曹元大,万月亮. Internet中的新词识别[J]. 北京邮电大学学报,2008,31(1):26-29. (Li Dun, Cao Yuanda, Wan Yueliang. Internet-oriented New Words Identification[J]. Journal of Beijing University of Posts and Telecommunications, 2008, 31(1):26-29.) [7] Wu A D, Jiang Z X. Statistically-enhanced New Word Identification in a Rule-based Chinese System[C]. In: Proceedings of the 2nd Workshop on Chinese Language, Hong Kong, China. 2000:46-51. [8] 曹艳,杜慧平,刘竞,等. 基于词表和N-gram算法的新词识别试验[J]. 情报科学,2007,25(11): 1687-1695. (Cao Yan, Du Huiping, Liu Jing, et al. An Experiment of New Words Identification Based on Vocabulary and N-gram Algorithm [J].Information Science, 2007, 25(11): 1687-1695.) [9] 贺敏,龚才春,张华平,等. 一种基于大规模语料的新词识别方法[J]. 计算机工程与应用,2007,43(21):157-159. (He Min, Gong Caichun, Zhang Huaping, et al. Method of New Word Identification Based on Larger-scale Corpus[J]. Computer Engineering and Applications, 2007, 43(21):157-159.) [10] 张海军,史树敏,丁溪源,等. 基于分词提取重复串的未登录词遗漏量化模型[J]. 中文信息学报,2011,25(2):122-128. (Zhang Haijun, Shi Shumin, Ding Xiyuan, et al. Quantitative Omission Model of Candidate Unknown Words for Chinese Word Segmentation Based Repeat Extraction[J]. Journal of Chinese Information Processing, 2011,25(2):122-128.) [11] 魏莎莎. 一种中文未登录词识别及词典设计新方法[D].重庆:西南大学,2011. (Wei Shasha. A New Method of Chinese Out-of-Vocabulary Identification and Dictionary Design[D].Chongqing: Southwest University,2011.) [12] 中国生物医学文献数据库[EB/OL]. [2012-04-14]. http://www.sinomed.ac.cn/. (China Biology Medicine[EB/OL]. [2012-04-14]. http://www.sinomed.ac.cn/.) [13] 哈工大停用词表 [EB/OL]. [2012-05-14]. http://wenku.baidu.com/view/b8b30382e53a580216fcfeb7.html. (HIT Stop-Words List [EB/OL]. [2012-05-14]. http://wenku.baidu.com/view/b8b30382e53a580216fcfeb7.html.) |
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|