Analysis of the Factors Affecting the Performance of CRF-based Keywords Extraction Model
Zhang Chengmin1,2 Xu Xin3 Zhang Chengzhi 4,5
1(Department of Information Management, Nanjing University, Nanjing 210093,China) 2(Library of China Pharmaceutical University, Nanjing 210009,China) 3(Department of Informatics, East China Normal University, Shanghai 200241,China) 4(Department of Information Management, Nanjing University of Science & Technology, Nanjing 210094,China) 5(Institute of Scientific & Technical Information of China, Beijing 100038,China)
The CRF model can use the features of documents more sufficiently and effectively. Keywords extraction based on CRF is proposed and implemented. The factors affecting the performance of the CRF-based keyword extraction model are analyzed. The factors include: the performance of text segmentation, the scale of training corpus, the number of figure and the parameters setting of the CRF model.
章成敏,许鑫,章成志. 条件随机场标引模型的性能影响因素分析[J]. 现代图书情报技术, 2008, 24(6): 34-40.
Zhang Chengmin,Xu Xin,Zhang Chengzhi. Analysis of the Factors Affecting the Performance of CRF-based Keywords Extraction Model. New Technology of Library and Information Service, 2008, 24(6): 34-40.
[1] Salton G, Yang C S, Yu C T. A Theory of Term Importance in Automatic Text Analysis[J]. Journal of the American society for Information Science, 1975, 26(1): 33-44.
[2] 韩客松, 王永成. 中文全文标引的主题词标引和主题概念标引方法[J]. 情报学报, 2001, 20(2): 212-216.
[3] Frank E, Paynter G W, Witten I H. Domain-Specific Keyphrase Extraction[C]. In: Proceedings of the 16th International Joint Conference on Aritifcal Intelliegence, Stockholm, Sweden, Morgan Kaufmann, 1999: 668-673.
[4] Turney P D. Learning to Extract Keyphrases from Text[R]. NRC Technical Report ERB-1057, National Research Council, Canada, 1999: 1-43.
[5] 张庆国, 薛德军, 张振海, 等. 海量数据集上基于特征组合的关键词自动抽取[J]. 情报学报, 2006, 25(5): 587-593.
[6] Lafferty J, McCallum A, Pereira F. Conditional Random Fields: Probabilistic Models for Segementing and Labeling Sequence Data[C]. In: Proceedings of the 18th International Conference on Machine Learning (ICML01), Williamstown, MA, USA, 2001: 282-289.
[7] CRF++: Yet Another CRF Toolkit[CP/OL]. [2005-12-20]. http://chasen.org/~taku/software/CRF++.
[8] 中文自然语言处理开放平台[EB/OL].[2005-12-20]. http://www.nlp.org.cn.
[9] Yang W F, Li X. Chinese Keyword Extraction Based on Max-dupliated Strings of the Documents[C]. In: Proceedings of the 25th Annual International Conference on Research and Development in Information Retrieval (SIGIR02), Tampere, Finland, 2002: 439-440.
[10] Lexicon_full_2000[DB/OL]. [2006-04-20]. http://ccl.pku.edu.cn/doubtfire/Course/Chinese%20Information%20Processing/Source_Code/Chapter_8/Lexicon_full_2000.zip.
[11] HaCohen-Kerner Y. Automatic Extraction of Keywords from Abstracts[C]. In: Proceedings of the 7th International Conference on Knowledge-Based Intelligent Information and Engineering Systems. Berlin, Heidelberg: Springer-Verlag, 2003: 843-849.