Please wait a minute...
New Technology of Library and Information Service  2008, Vol. 24 Issue (6): 34-40    DOI: 10.11925/infotech.1003-3513.2008.06.07
Current Issue | Archive | Adv Search |
Analysis of the Factors Affecting the Performance of CRF-based Keywords Extraction Model
Zhang Chengmin1,2   Xu Xin3   Zhang Chengzhi 4,5
1(Department of Information Management, Nanjing University, Nanjing 210093,China)
2(Library of China Pharmaceutical University, Nanjing 210009,China)
3(Department of Informatics, East China Normal University, Shanghai  200241,China)
4(Department of Information Management, Nanjing University of Science & Technology, Nanjing 210094,China)
5(Institute of Scientific & Technical Information of China, Beijing 100038,China)
Download:
Export: BibTeX | EndNote (RIS)      
Abstract  

 The CRF model can use the features of documents more sufficiently and effectively. Keywords extraction based on CRF is proposed and implemented. The factors affecting the performance of the CRF-based keyword extraction model are analyzed. The factors include: the performance of text segmentation, the scale of training corpus, the number of figure and the parameters setting of the CRF model.

Key wordsAutomatic indexing      Keywords extraction      Conditional random fields      Machine learning     
Received: 31 January 2008      Published: 25 June 2008
: 

TP391 

 
  G252

 
Corresponding Authors: Zhang Chengmin     E-mail: zhangchengmin@gmail.com
About author:: Zhang Chengmin,Xu Xin,Zhang Chengzhi

Cite this article:

Zhang Chengmin,Xu Xin,Zhang Chengzhi. Analysis of the Factors Affecting the Performance of CRF-based Keywords Extraction Model. New Technology of Library and Information Service, 2008, 24(6): 34-40.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2008.06.07     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2008/V24/I6/34

[1] Salton G, Yang C S, Yu C T. A Theory of Term Importance in Automatic Text Analysis[J]. Journal of the American society for Information Science, 1975, 26(1): 33-44.
[2] 韩客松, 王永成. 中文全文标引的主题词标引和主题概念标引方法[J]. 情报学报, 2001, 20(2): 212-216.
[3] Frank E, Paynter G W, Witten I H. Domain-Specific Keyphrase Extraction[C]. In: Proceedings of the 16th International Joint Conference on Aritifcal Intelliegence, Stockholm, Sweden, Morgan Kaufmann, 1999: 668-673.
[4] Turney P D. Learning to Extract Keyphrases from Text[R]. NRC Technical Report ERB-1057, National Research Council, Canada, 1999: 1-43.
[5] 张庆国, 薛德军, 张振海, 等. 海量数据集上基于特征组合的关键词自动抽取[J]. 情报学报, 2006, 25(5): 587-593.
[6] Lafferty J, McCallum A, Pereira F. Conditional Random Fields: Probabilistic Models for Segementing and Labeling Sequence Data[C]. In: Proceedings of the 18th International Conference on Machine Learning (ICML01), Williamstown, MA, USA, 2001: 282-289.
[7] CRF++: Yet Another CRF Toolkit[CP/OL]. [2005-12-20]. http://chasen.org/~taku/software/CRF++.
[8] 中文自然语言处理开放平台[EB/OL].[2005-12-20].  http://www.nlp.org.cn.
[9] Yang W F, Li X. Chinese Keyword Extraction Based on Max-dupliated Strings of the Documents[C]. In: Proceedings of the 25th Annual International Conference on Research and Development in Information Retrieval (SIGIR02), Tampere, Finland, 2002: 439-440.
[10] Lexicon_full_2000[DB/OL]. [2006-04-20]. http://ccl.pku.edu.cn/doubtfire/Course/Chinese%20Information%20Processing/Source_Code/Chapter_8/Lexicon_full_2000.zip.
[11] HaCohen-Kerner Y. Automatic Extraction of Keywords from Abstracts[C]. In: Proceedings of the 7th International Conference on Knowledge-Based Intelligent Information and Engineering Systems. Berlin, Heidelberg: Springer-Verlag, 2003: 843-849.

[1] Wang Hanxue,Cui Wenjuan,Zhou Yuanchun,Du Yi. Identifying Pathogens of Foodborne Diseases with Machine Learning[J]. 数据分析与知识发现, 2021, 5(9): 54-62.
[2] Chen Donghua,Zhao Hongmei,Shang Xiaopu,Zhang Runtong. Optimizing Large Hospital Operating Rooms with Data Analytics[J]. 数据分析与知识发现, 2021, 5(9): 115-128.
[3] Che Hongxin,Wang Tong,Wang Wei. Comparing Prediction Models for Prostate Cancer[J]. 数据分析与知识发现, 2021, 5(9): 107-114.
[4] Su Qiang, Hou Xiaoli, Zou Ni. Predicting Surgical Infections Based on Machine Learning[J]. 数据分析与知识发现, 2021, 5(8): 65-75.
[5] Cao Rui,Liao Bin,Li Min,Sun Ruina. Predicting Prices and Analyzing Features of Online Short-Term Rentals Based on XGBoost[J]. 数据分析与知识发现, 2021, 5(6): 51-65.
[6] Zhong Jiawa,Liu Wei,Wang Sili,Yang Heng. Review of Methods and Applications of Text Sentiment Analysis[J]. 数据分析与知识发现, 2021, 5(6): 1-13.
[7] Xiang Zhuoyuan,Liu Zhicong,Wu Yu. Adaptive Recommendation Model Based on User Behaviors[J]. 数据分析与知识发现, 2021, 5(4): 103-114.
[8] Chai Guorong,Wang Bin,Sha Yongzhong. Public Health Risk Forecasting with Multiple Machine Learning Methods Combined:Case Study of Influenza Forecasting in Lanzhou, China[J]. 数据分析与知识发现, 2021, 5(1): 90-98.
[9] Chen Dong,Wang Jiandong,Li Huiying,Cai Sihang,Huang Qianqian,Yi Chengqi,Cao Pan. Forecasting Poultry Turnovers with Machine Learning and Multiple Factors[J]. 数据分析与知识发现, 2020, 4(7): 18-27.
[10] Liang Ye,Li Xiaoyuan,Xu Hang,Hu Yiran. CLOpin: A Cross-Lingual Knowledge Graph Framework for Public Opinion Analysis and Early Warning[J]. 数据分析与知识发现, 2020, 4(6): 1-14.
[11] Yang Heng,Wang Sili,Zhu Zhongming,Liu Wei,Wang Nan. Recommending Domain Knowledge Based on Parallel Collaborative Filtering Algorithm[J]. 数据分析与知识发现, 2020, 4(6): 15-21.
[12] Zhao Ping,Sun Lianying,Tu Shuai,Bian Jianling,Wan Ying. Identifying Scenic Spot Entities Based on Improved Knowledge Transfer[J]. 数据分析与知识发现, 2020, 4(5): 118-126.
[13] Xiong Xin,Wang Hao,Zhang Haichao,Zhang Baolong. Impacts of Chinese Term Granularity on Measuring Term Discriminative Capacity[J]. 数据分析与知识发现, 2020, 4(2/3): 143-152.
[14] Wang Shuyi,Liu Sai,Ma Zheng. Microblog Image Privacy Classification with Deep Transfer Learning[J]. 数据分析与知识发现, 2020, 4(10): 80-92.
[15] Ruojia Wang,Lu Zhang,Jimin Wang. Automatic Triage of Online Doctor Services Based on Machine Learning[J]. 数据分析与知识发现, 2019, 3(9): 88-97.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn