Please wait a minute...
New Technology of Library and Information Service  2016, Vol. 32 Issue (5): 91-98    DOI: 10.11925/infotech.1003-3513.2016.05.11
Orginal Article Current Issue | Archive | Adv Search |
A Full-text Indexing System for WARC Files
Hu Jiying,Wu Zhenxin(),Xie Jing,Zhang Zhixiong
National Science Library, Chinese Academy of Sciences, Beijing 100190, China
Download: PDF(2526 KB)   HTML ( 66
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper develops a parsing and indexing system for the WARC files, which fully exploits the value of Web archives from scientific institutions. [Context] The WARC files have been widely used in digital curation. However, the existing full-text indexing tools cannot satisfy the diversified needs of the WARC searchers. [Methods] We employed a modular scheme to parse the WARC files. Upon analyzing popular indexing tools, developed a new full-text indexing system based on the Solr platform. [Results] The new system effectively indexed the Web archives. Users could search information from different perspective, such as the subject category, resource type, and archived time, etc. [Conclusions] The new system indexes rich Web archives from international institutions, and improves the efficiency of users’ information retrieval activities.

Key wordsWeb archive      WARC file      Modular parse      Solr index     
Received: 25 February 2016      Published: 24 June 2016

Cite this article:

Hu Jiying,Wu Zhenxin,Xie Jing,Zhang Zhixiong. A Full-text Indexing System for WARC Files. New Technology of Library and Information Service, 2016, 32(5): 91-98.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2016.05.11     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2016/V32/I5/91

[1] IIPC Members [EB/OL]. [2015-12-25]. .
[2] ISO 28500: 2009 WARC File Format [EB/OL]. [2009-05-15]. .
[3] 曲云鹏. 网络存档文件格式WARC研究[J]. 图书馆学研究, 2014(24): 20-25, 28.
[3] (Qu Yunpeng.Research on the Standardized WARC File Format[J]. Researches on Library Science, 2014(24): 20-25, 28.)
[4] 孙志茹, 吴振新, 曲云鹏. 基于Wayback的索引策略研究[J]. 现代图书情报技术, 2009(4): 14-18.
[4] (Sun Zhiru, Wu Zhenxin, Qu Yunpeng.Analysis of Index Strategies in Web Archive[J]. New Technology of Library and Information Service, 2009(4): 14-18.)
[5] 吴振新, 曲云鹏, 李成文, 等. 基于开源软件搭建网络信息资源采集与保存平台[J]. 现代图书情报技术, 2009(7-8): 6-10.
[5] (Wu Zhenxin, Qu Yunpeng, Li Chengwen, et al.Constructing a System for Harvesting and Preserving Chinese Web Information Resources Based on Open Source Software[J]. New Technology of Library and Information Service, 2009(7-8): 6-10.)
[6] WERA 0.4.2RC1 [EB/OL]. [2006-01-17]. .
[7] NutchWAX 0.11.0-SNAPSHOT API [EB/OL]. [2007-02-20]. .
[8] SOLR-Nutch Report [EB/OL]. [2011-01-31]. .
[9] Solr Features [EB/OL]. [2016-01-25]. .
[10] 吴振新, 张智雄, 谢靖, 等. 基于IIPC开源软件拓展构建国际重要科研机构Web存档系统[J]. 现代图书情报技术, 2015(4): 1-9.
[10] (Wu Zhenxin, Zhang Zhixiong, Xie Jing, et al.Developing Web Archive System of International Institutions Based on IIPC Open Source Software[J]. New Technology of Library and Information Service, 2015(4): 1-9.)
[1] Wu Zhenxin, Zhang Zhixiong, Xie Jing, Hu Jiying. Developing Web Archive System of International Institutions Based on IIPC Open Source Software[J]. 现代图书情报技术, 2015, 31(4): 1-9.
[2] Liu Lan,Wu Zhenxin,Xiang Jing,Sun Zhiru. Review of Open Source Software in Web Archive[J]. 现代图书情报技术, 2009, 25(5): 11-17.
[3] Sun Zhiru,Wu Zhenxin,Qu Yupeng. Analysis of Index Strategies in Web Archive[J]. 现代图书情报技术, 2009, 25(4): 14-18.
[4] Shen Jinzhi,Kou Wenbo,Tian Chengeng. Web Archive Content Extracted on Feature Orienting and Boarder Forecasting[J]. 现代图书情报技术, 2009, 25(12): 52-56.
[5] Wu Zhenxin,Xiang Jing. Analysis of Retrieval System Architecture in Web Archive[J]. 现代图书情报技术, 2009, 3(1): 22-27.
[6] Liu Lan,Wu Zhenxin,Zhang Zhixiong,Xu Lin. Study on Harvest Strategy in Web Archive[J]. 现代图书情报技术, 2009, 3(1): 10-15.
[7] Lin Ying,Wu Zhenxin,Zhang Zhixiong. An Analysis of Web Information Archiving Strategies[J]. 现代图书情报技术, 2009, 3(1): 16-21.
[8] Wu Zhenxin,Zhang Zhixiong,Sun Zhiru. An Analysis of the Application of Web Archive Resources Based on Data Mining[J]. 现代图书情报技术, 2009, 3(1): 28-33.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn