条件随机场标引模型的性能影响因素分析

doi:10.11925/infotech.1003-3513.2008.06.07

现代图书情报技术

2008, Vol. 24

Issue (6): 34-40 https://doi.org/10.11925/infotech.1003-3513.2008.06.07

知识组织与知识管理

本期目录 | 过刊浏览 | 高级检索

条件随机场标引模型的性能影响因素分析

章成敏^1,2许鑫³ 章成志^4,5

¹(南京大学信息管理系南京 210093)
²(中国药科大学图书馆南京 210009)
³(华东师范大学信息学系上海 200241)
⁴(南京理工大学信息管理系南京 210094)
⁵(中国科学技术信息研究所北京 100038)

Analysis of the Factors Affecting the Performance of CRF-based Keywords Extraction Model

Zhang Chengmin^1,2Xu Xin³ Zhang Chengzhi ^4,5

¹(Department of Information Management, Nanjing University, Nanjing 210093,China)
²(Library of China Pharmaceutical University, Nanjing 210009,China)
³(Department of Informatics， East China Normal University， Shanghai 200241,China)
⁴(Department of Information Management, Nanjing University of Science & Technology, Nanjing 210094,China)
⁵(Institute of Scientific & Technical Information of China, Beijing 100038,China)

摘要
参考文献
相关文章
Metrics

全文: PDF (452 KB)
输出: BibTeX | EndNote (RIS)

摘要

利用条件随机场模型进行自动标引研究，对文本分词性能、训练集的规模、特征的个数、模型本身的参数设置等影响模型标引性能的因素进行实验和分析。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	许鑫
	章成志
	章成敏

关键词 ：自动标引, 关键词提取, 条件随机场, 机器学习

Abstract：

The CRF model can use the features of documents more sufficiently and effectively. Keywords extraction based on CRF is proposed and implemented. The factors affecting the performance of the CRF-based keyword extraction model are analyzed. The factors include: the performance of text segmentation, the scale of training corpus, the number of figure and the parameters setting of the CRF model.

Key words： Automatic indexing Keywords extraction Conditional random fields Machine learning

收稿日期: 2008-01-31 出版日期: 2008-06-25

:	TP391
	G252

通讯作者: 章成敏 E-mail: zhangchengmin@gmail.com

作者简介: 章成敏,许鑫,章成志

引用本文:

章成敏,许鑫,章成志. 条件随机场标引模型的性能影响因素分析[J]. 现代图书情报技术, 2008, 24(6): 34-40.
Zhang Chengmin,Xu Xin,Zhang Chengzhi. Analysis of the Factors Affecting the Performance of CRF-based Keywords Extraction Model. New Technology of Library and Information Service, 2008, 24(6): 34-40.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2008.06.07 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2008/V24/I6/34

［1］ Salton G, Yang C S, Yu C T. A Theory of Term Importance in Automatic Text Analysis［J］. Journal of the American society for Information Science, 1975, 26(1): 33-44.
［2］韩客松, 王永成. 中文全文标引的主题词标引和主题概念标引方法［J］. 情报学报, 2001, 20(2): 212-216.
［3］ Frank E, Paynter G W, Witten I H. Domain-Specific Keyphrase Extraction［C］. In: Proceedings of the 16th International Joint Conference on Aritifcal Intelliegence, Stockholm, Sweden, Morgan Kaufmann, 1999: 668-673.
［4］ Turney P D. Learning to Extract Keyphrases from Text［R］. NRC Technical Report ERB-1057, National Research Council, Canada， 1999: 1-43.
［5］张庆国, 薛德军, 张振海, 等. 海量数据集上基于特征组合的关键词自动抽取［J］. 情报学报, 2006, 25(5): 587-593.
［6］ Lafferty J, McCallum A, Pereira F. Conditional Random Fields: Probabilistic Models for Segementing and Labeling Sequence Data［C］. In: Proceedings of the 18th International Conference on Machine Learning (ICML01), Williamstown, MA, USA, 2001: 282-289.
［7］ CRF++: Yet Another CRF Toolkit［CP/OL］. ［2005-12-20］. http://chasen.org/~taku/software/CRF++.
［8］中文自然语言处理开放平台［EB/OL］.［2005-12-20］. http://www.nlp.org.cn.
［9］ Yang W F, Li X. Chinese Keyword Extraction Based on Max-dupliated Strings of the Documents［C］. In: Proceedings of the 25th Annual International Conference on Research and Development in Information Retrieval (SIGIR02), Tampere, Finland, 2002: 439-440.
［10］ Lexicon_full_2000［DB/OL］. ［2006-04-20］. http://ccl.pku.edu.cn/doubtfire/Course/Chinese%20Information%20Processing/Source_Code/Chapter_8/Lexicon_full_2000.zip.
［11］ HaCohen-Kerner Y. Automatic Extraction of Keywords from Abstracts［C］. In: Proceedings of the 7th International Conference on Knowledge-Based Intelligent Information and Engineering Systems. Berlin, Heidelberg: Springer-Verlag, 2003: 843-849.

[1]	王寒雪,崔文娟,周园春,杜一. 基于机器学习的食源性疾病致病菌识别方法*[J]. 数据分析与知识发现, 2021, 5(9): 54-62.
[2]	陈东华,赵红梅,尚小溥,张润彤. 数据驱动的大型医院手术室运营预测与优化方法研究*[J]. 数据分析与知识发现, 2021, 5(9): 115-128.
[3]	车宏鑫,王桐,王伟. 前列腺癌预测模型对比研究*[J]. 数据分析与知识发现, 2021, 5(9): 107-114.
[4]	苏强, 侯校理, 邹妮. 基于机器学习组合优化方法的术后感染预测模型研究^*[J]. 数据分析与知识发现, 2021, 5(8): 65-75.
[5]	王昊, 林克柔, 孟镇, 李心蕾. 文本表示及其特征生成对法律判决书中多类型实体识别的影响分析[J]. 数据分析与知识发现, 2021, 5(7): 10-25.
[6]	曹睿,廖彬,李敏,孙瑞娜. 基于XGBoost的在线短租市场价格预测及特征分析模型^*[J]. 数据分析与知识发现, 2021, 5(6): 51-65.
[7]	钟佳娃,刘巍,王思丽,杨恒. 文本情感分析方法及应用综述^*[J]. 数据分析与知识发现, 2021, 5(6): 1-13.
[8]	向卓元,刘志聪,吴玉. 基于用户行为自适应推荐模型研究 ^*[J]. 数据分析与知识发现, 2021, 5(4): 103-114.
[9]	成彬,施水才,都云程,肖诗斌. 基于融合词性的BiLSTM-CRF的期刊关键词抽取方法[J]. 数据分析与知识发现, 2021, 5(3): 101-108.
[10]	柴国荣,王斌,沙勇忠. 基于多机器学习方法联合的公共卫生风险预测研究——以兰州市流感预测为例*[J]. 数据分析与知识发现, 2021, 5(1): 90-98.
[11]	陈东,王建冬,李慧颖,蔡思航,黄倩倩,易成岐,曹攀. 融合机器学习算法和多因素的禽肉交易量预测方法研究 ^*[J]. 数据分析与知识发现, 2020, 4(7): 18-27.
[12]	梁野,李小元,许航,胡伊然. CLOpin:一种面向舆情分析与预警领域的跨语言知识图谱架构*[J]. 数据分析与知识发现, 2020, 4(6): 1-14.
[13]	杨恒,王思丽,祝忠明,刘巍,王楠. 基于并行协同过滤算法的领域知识推荐模型研究*[J]. 数据分析与知识发现, 2020, 4(6): 15-21.
[14]	赵平,孙连英,涂帅,卞建玲,万莹. 改进的知识迁移景点实体识别算法研究及应用^*[J]. 数据分析与知识发现, 2020, 4(5): 118-126.
[15]	李成梁,赵中英,李超,亓亮,温彦. 基于依存关系嵌入与条件随机场的商品属性抽取方法^*[J]. 数据分析与知识发现, 2020, 4(5): 54-65.

Viewed

Full text

Abstract

Cited

Shared

Discussed