[Objective] This paper develops a parsing and indexing system for the WARC files, which fully exploits the value of Web archives from scientific institutions. [Context] The WARC files have been widely used in digital curation. However, the existing full-text indexing tools cannot satisfy the diversified needs of the WARC searchers. [Methods] We employed a modular scheme to parse the WARC files. Upon analyzing popular indexing tools, developed a new full-text indexing system based on the Solr platform. [Results] The new system effectively indexed the Web archives. Users could search information from different perspective, such as the subject category, resource type, and archived time, etc. [Conclusions] The new system indexes rich Web archives from international institutions, and improves the efficiency of users’ information retrieval activities.
胡吉颖,吴振新,谢靖,张智雄. 构建面向WARC文档的全文索引系统[J]. 现代图书情报技术, 2016, 32(5): 91-98.
Hu Jiying,Wu Zhenxin,Xie Jing,Zhang Zhixiong. A Full-text Indexing System for WARC Files. New Technology of Library and Information Service, DOI：10.11925/infotech.1003-3513.2016.05.11.
(Wu Zhenxin, Qu Yunpeng, Li Chengwen, et al.Constructing a System for Harvesting and Preserving Chinese Web Information Resources Based on Open Source Software[J]. New Technology of Library and Information Service, 2009(7-8): 6-10.)
WERA 0.4.2RC1 [EB/OL]. [2006-01-17]. .
NutchWAX 0.11.0-SNAPSHOT API [EB/OL]. [2007-02-20]. .
(Wu Zhenxin, Zhang Zhixiong, Xie Jing, et al.Developing Web Archive System of International Institutions Based on IIPC Open Source Software[J]. New Technology of Library and Information Service, 2015(4): 1-9.)