%A Hu Jiying,Wu Zhenxin,Xie Jing,Zhang Zhixiong %T A Full-text Indexing System for WARC Files %0 Journal Article %D 2016 %J Data Analysis and Knowledge Discovery %R 10.11925/infotech.1003-3513.2016.05.11 %P 91-98 %V 32 %N 5 %U {https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/abstract/article_4229.shtml} %8 2016-05-25 %X

[Objective] This paper develops a parsing and indexing system for the WARC files, which fully exploits the value of Web archives from scientific institutions. [Context] The WARC files have been widely used in digital curation. However, the existing full-text indexing tools cannot satisfy the diversified needs of the WARC searchers. [Methods] We employed a modular scheme to parse the WARC files. Upon analyzing popular indexing tools, developed a new full-text indexing system based on the Solr platform. [Results] The new system effectively indexed the Web archives. Users could search information from different perspective, such as the subject category, resource type, and archived time, etc. [Conclusions] The new system indexes rich Web archives from international institutions, and improves the efficiency of users’ information retrieval activities.