|
|
Developing Web Archive System of International Institutions Based on IIPC Open Source Software |
Wu Zhenxin, Zhang Zhixiong, Xie Jing, Hu Jiying |
National Science Library, Chinese Academy of Sciences, Beijing 100190, China |
|
|
Abstract [Objective] Develope Web Archive System of International Institutions. [Methods] Based on IIPC open source software framework, this paper applies a three layer expansion strategy in the acquisition terminal, provides automatical uploading and reporting function in the acquisition client, develops a WARC parser which can analyze the content of WARC file, uses Solr to be an indexer. [Results] This paper implements acquisition expansion, promotes the automatical level of system workflow by adding more function modules in the acquisition client, extracts more information by developing WARC parser modules, uses Solr to enrich index and retrieval service. [Limitations] Lack of large-scale Web archive to verify this platform. [Conclusions] The expanded Web archive framework becomes distributed, extended and full automatic.
|
Received: 03 September 2014
Published: 21 May 2015
|
|
[1] Toward a National Strategy for Preserving Online Science [EB/OL]. [2014-08-05]. http://www.digitalpreservation.gov/meetings/documents/othermeetings/science-at-risk-NDIIPP-report-nov-2012.pdf.
[2] IIPC [EB/OL]. [2014-08-05]. http://netpreserve.org/.
[3] Tools and Software [EB/OL]. [2014-08-05]. http://netpreserve.org/Web-archiving/tools-and-software.
[4] 刘兰, 吴振新, 向菁, 等. 网络信息资源保存开源软件综述[J]. 现代图书情报技术, 2009(5): 11-17. (Liu Lan, Wu Zhenxin, Xiang Jing, et al. Review of Open Source Software in Web Archive [J]. New Technology of Library and Information Service, 2009(5): 11-17.)
[5] ISO 28500:2009 Information and Documentation——WARC File Format [EB/OL]. [2014-08-05]. http://www.iso.org/iso/
home/store/catalogue_tc/catalogue_detail.htm?csnumber=44717.
[6] Heritrix [EB/OL]. [2014-08-05]. https://Webarchive.jira.com/wiki/display/Heritrix/Heritrix.
[7] Internet Archive [EB/OL]. [2014-08-05]. http://www.internetarchive.org/.
[8] The Web Curator Tool Project [EB/OL]. [2014-08-05]. http://Webcurator.sourceforge.net/.
[9] Web Archive Access [EB/OL]. [2014-08-05]. http://sourceforge.net/projects/archive-access/files/wayback/.
[10] NutchWAX [EB/OL]. [2014-08-05]. http://archive-access.sourceforge.net/projects/nutch/.
[11] 吴振新, 曲云鹏, 李成文, 等. 基于开源软件搭建网络信息资源采集与保存平台[J]. 现代图书情报技术, 2009(7-8): 6-10. (Wu Zhenxin, Qu Yunpeng, Li Chengwen, et al. Constructing a System for Harvesting and Preserving Chinese Web Information Resources Based on Open Source Software [J]. New Technology of Library and Information Service, 2009(7-8): 6-10.)
[12] Trail: RMI [EB/OL]. [2014-08-05]. http://download.oracle.com/javase/tutorial/rmi/index.html.
[13] 吴振新,张智雄,王婷.网络信息资源保存的协作网络研究[J]. 数字图书馆论坛. 2009(7): 2-6. (Wu Zhenxin, Zhang Zhixiong, Wang Ting. Research on the Web Archive Cooperative Networks [J]. Digital Library Forum, 2009(7): 2-6.) |
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|