|
|
A Full-text Indexing System for WARC Files |
Hu Jiying,Wu Zhenxin(),Xie Jing,Zhang Zhixiong |
National Science Library, Chinese Academy of Sciences, Beijing 100190, China |
|
|
Abstract [Objective] This paper develops a parsing and indexing system for the WARC files, which fully exploits the value of Web archives from scientific institutions. [Context] The WARC files have been widely used in digital curation. However, the existing full-text indexing tools cannot satisfy the diversified needs of the WARC searchers. [Methods] We employed a modular scheme to parse the WARC files. Upon analyzing popular indexing tools, developed a new full-text indexing system based on the Solr platform. [Results] The new system effectively indexed the Web archives. Users could search information from different perspective, such as the subject category, resource type, and archived time, etc. [Conclusions] The new system indexes rich Web archives from international institutions, and improves the efficiency of users’ information retrieval activities.
|
Received: 25 February 2016
Published: 24 June 2016
|
[1] | IIPC Members [EB/OL]. [2015-12-25]. . | [2] | ISO 28500: 2009 WARC File Format [EB/OL]. [2009-05-15]. . | [3] | 曲云鹏. 网络存档文件格式WARC研究[J]. 图书馆学研究, 2014(24): 20-25, 28. | [3] | (Qu Yunpeng.Research on the Standardized WARC File Format[J]. Researches on Library Science, 2014(24): 20-25, 28.) | [4] | 孙志茹, 吴振新, 曲云鹏. 基于Wayback的索引策略研究[J]. 现代图书情报技术, 2009(4): 14-18. | [4] | (Sun Zhiru, Wu Zhenxin, Qu Yunpeng.Analysis of Index Strategies in Web Archive[J]. New Technology of Library and Information Service, 2009(4): 14-18.) | [5] | 吴振新, 曲云鹏, 李成文, 等. 基于开源软件搭建网络信息资源采集与保存平台[J]. 现代图书情报技术, 2009(7-8): 6-10. | [5] | (Wu Zhenxin, Qu Yunpeng, Li Chengwen, et al.Constructing a System for Harvesting and Preserving Chinese Web Information Resources Based on Open Source Software[J]. New Technology of Library and Information Service, 2009(7-8): 6-10.) | [6] | WERA 0.4.2RC1 [EB/OL]. [2006-01-17]. . | [7] | NutchWAX 0.11.0-SNAPSHOT API [EB/OL]. [2007-02-20]. . | [8] | SOLR-Nutch Report [EB/OL]. [2011-01-31]. . | [9] | Solr Features [EB/OL]. [2016-01-25]. . | [10] | 吴振新, 张智雄, 谢靖, 等. 基于IIPC开源软件拓展构建国际重要科研机构Web存档系统[J]. 现代图书情报技术, 2015(4): 1-9. | [10] | (Wu Zhenxin, Zhang Zhixiong, Xie Jing, et al.Developing Web Archive System of International Institutions Based on IIPC Open Source Software[J]. New Technology of Library and Information Service, 2015(4): 1-9.) |
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|