Study on Semantic Markup of Species Description Text in Chinese Based on Auto-learning Rules
Duan Yufeng1, Hei Zhenzhen1, Ju Fei1, Cui Hong2
1. Business School, East China Normal University, Shanghai 200241, China;
2. School of Information Resource & Library Science, University of Arizona, Tucson 85719, USA
Abstract:This paper uses the algorithm of auto-learning rules combining with leading words to implement the semantic markup of species description text in Chinese with the data set of 1 000 documents collected from Flora of China randomly. Experimental results indicate that the whole markup efficiency (the values of F) of rule-based algorithm, which is designed by the study, generally reaches 0.930, and most elements are in the range of 0.724-0.964. Therefore, this algorithm is better than Naive Bayesian categorization algorithm, and it is also proved that leading words are positive for optimizing the algorithm.
段宇锋, 黑珍珍, 鞠菲, 崔红. 基于自主学习规则的中文物种描述文本的语义标注研究[J]. 现代图书情报技术, 2012, 28(5): 41-47.
Duan Yufeng, Hei Zhenzhen, Ju Fei, Cui Hong. Study on Semantic Markup of Species Description Text in Chinese Based on Auto-learning Rules. New Technology of Library and Information Service, 2012, 28(5): 41-47.
[1] Taylor A. Extracting Knowledge from Biological Descriptions[C]. In: Proceedings of the 2nd International Conference on Building and Sharing Very Large-Scale Knowledge Bases. 1995:114-119.[2] Vanel J M. Worldwide Botanical Knowledge Base[EB/OL]. [2011-10-11]. http://wwbota.free.fr/.[3] Wood M M, Lydon S J, Tablan V, et al. Using Parallel Texts to Improve Recall in IE[C]. In: Proceedings of International Conference on Recent Advances in Natural Language Processing (RANLP).Amsterdam: John Benjamins, 2004:70-77.[4] 罗贝,吴洁,曹存根,等. 从文本中获取植物知识方法的研究[J]. 计算机科学 ,2005,32(10):6-13.(Luo Bei, Wu Jie, Cao Cungen,et al. Botanical Knowledge Acquisition from Text[J]. Computer Science, 2005,32(10):6-13.)[5] 沙丽华. 面向领域文档的语义标注方法研究[D]. 长春:吉林大学,2009.(Sha Lihua. Research on Semantic Annotation for Domain Documents[D]. Changchun: Jilin University,2009.)[6] 石静. 基于本体的植物信息抽取与分析研究[D]. 西安:西北农林科技大学,2010. (Shi Jing. Information Extraction and Analysis Based on Plant Ontology[D]. Xi'an: Northwest Agriculture and Foresty University, 2010.)[7] Sautter G, Bohm K, Agosti D. A Combining Approach to Find all Taxon Names[J]. Biodiversity Informatics,2006(3):46-58.[8] Tang X Y, Heidorn P B. Using Automatically Extracted Information in Species Page Retrieval[EB/OL]. [2011-08-10]. http://www.tdwg.org/proceedings/article/view/195/.[9] Soderland S. Learning Information Extraction Rules for Semi-Structured and Free Text[J]. Machine Learning, 1999, 34 (1-3): 233-272.[10] 郑家恒,菅小艳. 农作物信息抽取系统的设计与实现[J]. 计算机工程 ,2006,32(7):197-198,220.(Zheng Jiaheng, Jian Xiaoyan. Design and Realization of the System of Farm Crop Information Extraction[J]. Computer Engineering, 2006, 32(7):197-198,220.)[11] Cui H, Heidorn P B. The Reusability of Induced Knowledge for Automatic Semantic Markup of Taxonomic Descriptions[J]. Journal of the American Society for Information Science and Technology, 2007, 58(1): 133-149.[12] Cui H, Boufford D, Selden P. Semantic Annotation of Biosystematics Literature Without Training Examples[J]. Journal of the American Society of Information Science and Technology, 2010, 61 (3): 522-542.[13] Cui H. The XML Schema for MARTT[EB/OL]. [2012-08-08]. http://publish.uwo.ca/~hcui7/research/xmlschema.xsd.[14] 中国植物志编辑委员会. 中国植物志[M]. 北京:科学出版社,1959.(Flora of China Editorial Committee. Flora of China [M]. Beijing: Science Press, 1959.)