Please wait a minute...
Advanced Search
现代图书情报技术  2016, Vol. 32 Issue (5): 91-98     https://doi.org/10.11925/infotech.1003-3513.2016.05.11
  应用论文 本期目录 | 过刊浏览 | 高级检索 |
构建面向WARC文档的全文索引系统
胡吉颖,吴振新(),谢靖,张智雄
中国科学院文献情报中心 北京 100190
A Full-text Indexing System for WARC Files
Hu Jiying,Wu Zhenxin(),Xie Jing,Zhang Zhixiong
National Science Library, Chinese Academy of Sciences, Beijing 100190, China
全文: PDF (2526 KB)   HTML ( 67
输出: BibTeX | EndNote (RIS)      
摘要 

目的】开发网络信息存档WARC文件的解析与索引系统, 充分挖掘科技网站存档资源价值。【应用背景】在网络资源采集存档领域, WARC文件格式获得了广泛的应用。随着网络信息的多样化, 已有的WARC文件索引工具越来越难以满足用户多样性的查询需求。【方法】采用模块化方案解析WARC文件。分析比较常用的索引工具, 选择Solr平台开发全文索引系统。【结果】实现对WARC文件基于内容的检索访问服务, 并在WARC的索引中增加了学科分类、资源类型和存档时间等分面检索内容, 从多维度对WARC文件内容进行揭示。【结论】向用户提供了丰富的科技网站存档数据信息, 提高了用户检索访问效率。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
胡吉颖
吴振新
谢靖
张智雄
关键词 网络存档WARC文件模块化解析Solr索引    
Abstract

[Objective] This paper develops a parsing and indexing system for the WARC files, which fully exploits the value of Web archives from scientific institutions. [Context] The WARC files have been widely used in digital curation. However, the existing full-text indexing tools cannot satisfy the diversified needs of the WARC searchers. [Methods] We employed a modular scheme to parse the WARC files. Upon analyzing popular indexing tools, developed a new full-text indexing system based on the Solr platform. [Results] The new system effectively indexed the Web archives. Users could search information from different perspective, such as the subject category, resource type, and archived time, etc. [Conclusions] The new system indexes rich Web archives from international institutions, and improves the efficiency of users’ information retrieval activities.

Key wordsWeb archive    WARC file    Modular parse    Solr index
收稿日期: 2016-02-25      出版日期: 2016-06-24
引用本文:   
胡吉颖,吴振新,谢靖,张智雄. 构建面向WARC文档的全文索引系统[J]. 现代图书情报技术, 2016, 32(5): 91-98.
Hu Jiying,Wu Zhenxin,Xie Jing,Zhang Zhixiong. A Full-text Indexing System for WARC Files. New Technology of Library and Information Service, 2016, 32(5): 91-98.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2016.05.11      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2016/V32/I5/91
[1] IIPC Members [EB/OL]. [2015-12-25]. .
[2] ISO 28500: 2009 WARC File Format [EB/OL]. [2009-05-15]. .
[3] 曲云鹏. 网络存档文件格式WARC研究[J]. 图书馆学研究, 2014(24): 20-25, 28.
[3] (Qu Yunpeng.Research on the Standardized WARC File Format[J]. Researches on Library Science, 2014(24): 20-25, 28.)
[4] 孙志茹, 吴振新, 曲云鹏. 基于Wayback的索引策略研究[J]. 现代图书情报技术, 2009(4): 14-18.
[4] (Sun Zhiru, Wu Zhenxin, Qu Yunpeng.Analysis of Index Strategies in Web Archive[J]. New Technology of Library and Information Service, 2009(4): 14-18.)
[5] 吴振新, 曲云鹏, 李成文, 等. 基于开源软件搭建网络信息资源采集与保存平台[J]. 现代图书情报技术, 2009(7-8): 6-10.
[5] (Wu Zhenxin, Qu Yunpeng, Li Chengwen, et al.Constructing a System for Harvesting and Preserving Chinese Web Information Resources Based on Open Source Software[J]. New Technology of Library and Information Service, 2009(7-8): 6-10.)
[6] WERA 0.4.2RC1 [EB/OL]. [2006-01-17]. .
[7] NutchWAX 0.11.0-SNAPSHOT API [EB/OL]. [2007-02-20]. .
[8] SOLR-Nutch Report [EB/OL]. [2011-01-31]. .
[9] Solr Features [EB/OL]. [2016-01-25]. .
[10] 吴振新, 张智雄, 谢靖, 等. 基于IIPC开源软件拓展构建国际重要科研机构Web存档系统[J]. 现代图书情报技术, 2015(4): 1-9.
[10] (Wu Zhenxin, Zhang Zhixiong, Xie Jing, et al.Developing Web Archive System of International Institutions Based on IIPC Open Source Software[J]. New Technology of Library and Information Service, 2015(4): 1-9.)
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn