基于规则的网络文本资源标题快速自动识别方法

doi:10.11925/infotech.1003-3513.2011.06.05

现代图书情报技术

2011, Vol. 27

Issue (6): 27-31 https://doi.org/10.11925/infotech.1003-3513.2011.06.05

DLIB & OSS 2011论文选登

本期目录 | 过刊浏览 | 高级检索

基于规则的网络文本资源标题快速自动识别方法

刘建华¹, 张智雄¹, 谢靖¹, 邹益民^1,2

1. 中国科学院国家科学图书馆北京 100190;
2. 中国科学院研究生院北京 100049

Automatic Identify Title of Web Text Resource Based on Rules

Liu Jianhua¹, Zhang Zhixiong¹, Xie Jing¹, Zou Yimin^1,2

1. National Science Library, Chinese Academy of Sciences, Beijing 100190, China;
2. Craduate University of Chinese Acadeny of Sciences, Beijing 100049, China

摘要
参考文献
相关文章
Metrics

全文: PDF (567 KB) HTML
输出: BibTeX | EndNote (RIS)

摘要选取网络文本资源的标题识别作为切入点,除考虑多数研究关注的文本的格式信息(如字体)、位置信息等特征外,加入对标题与网页正文内容的相关度的考虑,利用科技监测项目采集到的大量历史数据作为统计分析的基础,从候选标题的可能来源和特征方面,构建基于规则的网络文本资源标题快速识别方法,并给出该方法的时间效率和识别准确率测评结果。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	刘建华
	张智雄
	谢靖
	邹益民

关键词 ：网络文本资源, 标题识别, 标题来源, 标题特征

Abstract：As the important role of titles of Web resource for information retrieval,text cluster and so on,this paper proposes a method to identify the titles automatically and quickly based on the style information(such as font) and location information of text which are used by many other researchers. Besides, it considers the relevance between the title candidates and text content. Lastly, this paper implements the title identification component and does some experiments to show the effectiveness of this method.

Key words： Web text resources Title identification Title source Title feature

收稿日期: 2011-05-05 出版日期: 2011-08-15

G203

基金资助:

本文系中国科学院知识创新工程、中国科学院国家科学图书馆资助项目“综合科技资源集成登记系统”(项目编号:y000021002)和中国科学院西部之光2009基金项目“甘肃省综合科技资源登记示范系统建设”的研究成果之一。

引用本文:

刘建华, 张智雄, 谢靖, 邹益民. 基于规则的网络文本资源标题快速自动识别方法[J]. 现代图书情报技术, 2011, 27(6): 27-31.
Liu Jianhua, Zhang Zhixiong, Xie Jing, Zou Yimin. Automatic Identify Title of Web Text Resource Based on Rules. New Technology of Library and Information Service, 2011, 27(6): 27-31.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2011.06.05 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2011/V27/I6/27

[1] Changuel S, Labroche N, Bouchon-Meunier B.A General Learning Method for Automatic Title Extraction from HTML Pages[C].In:Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition.2009:704-718.

[2] Giuffrida G, Shek E C, Yang J. Knowledge-based Metadata Extraction from PostScript Files[C]. In: Proceedings of the 5th ACM Conference on Digital Libraries.2000:77-84.

[3] Peng F, McCallum A. Accurate Information Extraction from Research Papers Using Conditional Random Fields[C]. In: Proceedings of the Human Language Technology Conference/North American Chapter of the Association for Computational Linguistics Annual Meeting.2004:329-336.

[4] 朱海军,张桂平,蔡东风,等.科技论文的标题识别[C].见:第九届全国计算语言学学术会议论文集,2007.

[5] Hu Y, Li H,Cao Y, et al. Automatic Extraction of Titles from General Documents Using Machine Learning[J].Information Processing and Management, 2006, 42(5):1276-1293.

[6] Xue Y, Hu Y,Xin G, et al. Web Page Title Extraction and Its Application[J]. Information Processing and Management, 2007, 43(5):1332-1347.

[7] 朱青,吕晓旭. 基于机器学习的HTML标题抽取[J].微计算机信息,2010,26(3):15-16,11.

[8] 李国华,昝红英. 基于语句相似度的网页标题抽取方法[J].中文信息学报, 2011,25 (2): 32-37.

[9] Open Document Format for Office Applications (OpenDoeument) v1.0 [EB/OL]. [2011-02-10]. http://docs.oasis-open.or~oficdv1.0.

[10] HTML Parser[EB/OL].[2011-03-10].http://htmlparser.sourceforge.net/.

[11] Jericho HTML Parser[EB/OL].[2011-03-10].http://jericho.htmlparser.net/docs/index.html.

[12] Apache PDFBox-Java PDF Library[EB/OL]. [2011-03-20]. http://pdfbox.apache.org/.

[13] Apache POI-Text Extraction[EB/OL].[2011-03-20]. http://poi.apache.org/.

[1]	袁园, 孙霄凌, 朱庆华. 微博用户关注兴趣的社会网络分析[J]. 现代图书情报技术, 2012, 28(2): 68-75.
[2]	张云中. 一种基于FCA和Folksonomy的本体构建方法[J]. 现代图书情报技术, 2011, 27(12): 15-23.
[3]	赵杨, 张李义. 研发合作中的信息资源配置系统动力学建模与仿真[J]. 现代图书情报技术, 2011, 27(2): 54-61.
[4]	赵文兵, 朱庆华, 吴克文, 黄奇. 微博客用户特性及动机分析——以和讯财经微博为例[J]. 现代图书情报技术, 2011, 27(2): 69-75.
[5]	刘红红, 安海忠, 高湘昀. 基于文本复杂网络的内容结构特征分析[J]. 现代图书情报技术, 2011, 27(1): 69-73.
[6]	吴丹, 刘媛, 王少成. 中英文网络问答社区比较研究与评价实验[J]. 现代图书情报技术, 2011, 27(1): 74-82.
[7]	袁红. 基于网络内容分析的高校门户网站可用性测评 ——以江苏省为例[J]. 现代图书情报技术, 2010, 26(10): 70-75.
[8]	乔建忠. 基于业务关联的政务信息资源分类系统的研究与实现[J]. 现代图书情报技术, 2010, 26(9): 28-36.
[9]	马超叶祺吴斌石川佘影. 基于动态链接分析的网络可视化分析平台的设计与实现*[J]. 现代图书情报技术, 2010, 26(6): 60-65.
[10]	吴克文,赵宇翔,朱庆华. 社会化问答网站使用模式分析——以雅虎知识堂为例[J]. 现代图书情报技术, 2009, 25(12): 57-63.
[11]	邹荣,曾婷,姜爱蓉,郭靖. 基于DSpace构建联合网站的研究与实践[J]. 现代图书情报技术, 2009, 25(5): 67-71.
[12]	李清茂. 基于主题图的旅游文献组织方法研究*[J]. 现代图书情报技术, 2009, 25(4): 82-87.
[13]	唐毅,杨雁. OAI-PMH在基于Mediawiki的学科信息门户整合中的应用*[J]. 现代图书情报技术, 2009, 3(3): 80-84.
[14]	朱亚玲,贾晓凤. 基于双向拍卖的网格资源调度模型与竞价策略[J]. 现代图书情报技术, 2008, 24(12): 32-36.
[15]	雷雪,焦玉英,陆泉,成全. 基于社会认知论的Wiki社区知识共享行为研究*[J]. 现代图书情报技术, 2008, 24(2): 30-34.

Viewed

Full text

Abstract

Cited

Shared

Discussed