[Objective] To extract information from Chinese plant species diversity description text. [Methods] Take the plant species diversity domain ontology as the foundation, and adopt the strategy of stepwise selection and annotation on paragraph, sentence and concept. [Results] A sample including 4 734 information points is used to test. The value of extraction accuracy rate, recall rate and F-measure achieves 0.86, 0.85 and 0.85 respectively. [Limitations] In order to solve the problems on extracting information from description text, the rule set should be improved in the future. [Conclusions] The research scheme can fulfill the information extraction from Chinese plant species diversity description text effectively.
段宇锋,黄思思. 中文植物物种多样性描述文本的信息抽取研究*[J]. 现代图书情报技术, 2016, 32(1): 87-96.
Yufeng Duan,Sisi Huang. Information Extraction from Chinese Plant Species Diversity Description Text. New Technology of Library and Information Service, 2016, 32(1): 87-96.
Thessen A E, Cui H, Mozzherin D. Applications of Natural Language Processing in Biodiversity Science [J]. Advances in Bioinformatics, 2012: Article ID 391574. doi: 10.1155/2012/ 391574.
[3]
Vanel J M. Worldwide Botanical Knowledge Base [EB/OL]. [2011-10-11]. .
(Zheng Jiaheng, Jian Xiaoyan.Design and Realization of the System of Farm Crop Information Extraction[J]. Computer Engineering, 2006, 32(7): 197-198, 220.)
[5]
Cui H, Heidorn P.The Reusability of Induced Knowledge for Automatic Semantic Markup of Taxonomic Descriptions[J]. Journal of the American Society for Information Science and Technology. 2007, 58(1): 133-149.
(Duan Yufeng, Hei Zhenzhen, Ju Fei, et al.Study on Semantic Markup of Species Description Text in Chinese Based on Auto-learning Rules[J]. New Technology of Library and Information Service, 2012(5): 41-47.)
(Duan Yufeng, Hei Zhenzhen, Ju Fei, et al.Semantic Annotation of Species Description Text in Chinese Literature by Naïve Bayes Classifier[J]. Journal of the China Society for Scientific and Technical Information, 2012, 31(8): 805-812.)
(Duan Yufeng, Zhu Wenjing, Chen Qiao, et al.Semantic Annotation of Species Description Text in Chinese by Combining Naïve Bayes Algorithm with Bootstrapping Method[J]. New Technology of Library and Information Service, 2014(5): 83-89.)
[9]
Taylor A.Extracting Knowledge from Biological Descriptions [C]. In: Proceedings of the 2nd International Conference on Building and Sharing Very Large-Scale Knowledge Bases. 1995: 114-119.
[10]
Wood M M, Lydon S J, Tablan V, et al.Using Parallel Texts to Improve Recall in IE [C]. In: Proceedings of Recent Advances in Natural Language Processing (RANLP’03). 2003: 505-512.
[11]
Tang X, Heidorn P B. Using Automatically Extracted Information in Species Page Retrieval [OL]. [2011-08-10]. .
[12]
Soderland S.Learning Information Extraction Rules for Semi-Structured and Free Text[J]. Machine Learning, 1999, 34(1-3): 233-272.
[13]
Abascal R, Sanchez J A.X-tract: Structure Extraction from Botanical Textual Descriptions [C]. In: Proceeding of the String Processing & Information Retrieval Symposium & International Workshop on Groupware.1999: 2-7.
[14]
Diederich J, Frotuner R, Milton J. Computer-assisted Data Extraction from the Taxonomical Literature [OL]. [2011- 08-15]. .
[15]
Cui H.CharaParser for Fine-grained Semantic Annotation of Organism Morphological Descriptions[J]. Journal of the American Society for Information Science and Technology, 2012, 63(4): 738-754.
[16]
Cui H, Singaram S, Janning A.Combine Unsupervised Learning and Heuristic Rules to Annotate Morphological Characters[J]. Proceedings of the American Society for Information Science and Technology, 2011, 48(1): 1-9.
[17]
沙丽华. 面向领域文档的语义标注方法研究[D]. 长春: 吉林大学, 2009.
[17]
(Sha Lihua.Research on Semantic Annotation for Domain Documents [D]. Changchun: Jilin University, 2009.)
[18]
石静. 基于本体的植物信息抽取与分析研究[D]. 杨凌: 西北农林科技大学, 2010.
[18]
(Shi Jing.Information Extraction and Analysis Based on Plant Ontology [D]. Yangling: Northwest Agriculture and Foresty University, 2010.)
[19]
Gruber T R.Toward Principles for the Design of Ontologies Used for Knowledge Sharing[J]. International Journal of Human-Computer Studies, 1995, 43(5-6): 907-928.
(Duan Yufeng, Zhu Wenjing, Chen Qiao, et al.The Study on Out-of-Vocabulary Identification on a Model Based on the Combination of CRFs and Domain Ontology Elements Set[J]. New Technology of Library and Information Service, 2015(4): 41-49.)
[22]
中国植物志编辑委员会. 中国植物志[DB/OL]. [2007-09-28]. .
[22]
(Flora of China Editorial Committee. Flora of China [DB/OL]. [2007-09-28].