Semantic Annotation of Species Description Text in Chinese by Combining Naïve Bayes Algorithm with Bootstrapping Method
Duan Yufeng1, Zhu Wenjing2, Chen Qiao1, Cui Hong3
1 Business School, East China Normal University, Shanghai 200241, China;
2 Institute of Scientific and Technical Information of Shanghai, Shanghai Library, Shanghai 200031, China;
3 School of Information Resources and Library Science, University of Arizona, Tucson, AZ85719, USA
[Objective] To reduce cost of machine learning by declining the size of learning dataset in species description text annotation in Chinese. [Methods] Based on Bootstrapping method, design a weakly supervised learning method which performs learning and tagging processes iteratively with a small amount of data at the beginning. The iteration process promotes annotation ability continuously by expanding the knowledge base. [Results] The average score of F-value runs up to 0.911 2 on a dataset with 15 041 sentences. [Limitations] The annotation efficiency might be relatively low on sparse data. [Conclusions] The experimental data shows that the algorithm in this study not only declines the dataset size requirement of machine learning dramatically, but also increases annotation efficiency.
段宇锋, 朱雯晶, 陈巧, 崔红. 朴素贝叶斯算法与Bootstrapping方法相结合的中文物种描述文本语义标注研究*[J]. 现代图书情报技术, 2014, 30(5): 83-89.
Duan Yufeng, Zhu Wenjing, Chen Qiao, Cui Hong. Semantic Annotation of Species Description Text in Chinese by Combining Naïve Bayes Algorithm with Bootstrapping Method. New Technology of Library and Information Service, 2014, 30(5): 83-89.
[1] 段宇锋, 黑珍珍, 鞠菲, 等. 基于自主学习规则的中文物种描述文本的语义标注研究[J]. 现代图书情报技术, 2012(5): 41-47. (Duan Yufeng, Hei Zhenzhen, Ju Fei, et al. Study on Semantic Markup of Species Description Text in Chinese Based on Auto-learning Rules[J]. New Technology of Library and Information Service, 2012(5): 41-47.)
[2] 段宇锋, 黑珍珍, 鞠菲, 等. 基于贝叶斯分类的中文物种描述文本的语义标注研究[J]. 情报学报, 2012, 31(8): 805-812. (Duan Yufeng, Hei Zhenzhen, Ju Fei, et al. Semantic Annotation of Species Description Text in Chinese Literature by Naïve Bayes Classifier[J]. Journal of the China Society for Scientific and Technical Information, 2012, 31(8):805-812.)
[3] 中国植物志编辑委员会. 中国植物志[M]. 北京: 科学出版社, 1959. (Flora of China Editorial Committee. Flora of China[M]. Beijing: Science Press, 1959.)
[4] Cui H. The XML Schema for MARTT[OL].[2012-08-08]. http://publish.uwo.ca/~hcui7/research/xmlschema.xsd.
[5] Michie D,Spiegelhalter D J,Taylor C C.Machine Learning, Neural and Statistical Classification[M]. New York: Ellis Horwood, 1994.
[6] 罗军, 高琦, 王翊. 基于Bootstrapping的本体标注方法[J]. 计算机工程, 2010, 36(23): 85-87. (Luo Jun, Gao Qi, Wang Yi. Ontology Annotation Method Based on Bootstrapping[J]. Computer Engineering, 2010, 36(23): 85-87.)
[7] 琚春华, 殷贤君, 许翀寰. 结合自助抽样的动态数据流贝叶斯分类算法[J]. 计算机工程与应用, 2011, 47(8): 118-121, 142. (Ju Chunhua, Yin Xianjun, Xu Chonghuan. Bayesian Classification Algorithm of Dynamic Data Stream Based on Bootstrap[J]. Computer Engineering and Applications, 2011, 47(8): 118-121, 142.)
[8] Sacchi L, Tucker A, Counsell S, et al. Improving Predictive Models of Glaucoma Severity by Incorporationg Quality Indicators[J]. Artificial Intelligence in Medicine, 2014, 60(2): 103-112.
[9] Mitchell T M. 机器学习[M]. 曾华军, 张银奎, 等译. 北京:机械工业出版社, 2003: 112-143. (Mitchell T M. Machine Learning[M]. Translated by Zeng Huajun, Zhang Yinkui, et al. Beijing: China Machine Press, 2003: 112-143.)
[10] Cui H. MARTT:A General Approach to Automatic Markup of Taxonomic Descriptions with XML[OL]. [2011-10-12]. http://cais-acsi.ca/proceedings/2005/cui_2005.pdf.