[Objective] To address the problems of Web data collection, difficult to integrate multiple types of digital resources etc. in characteristic database construction. [Context] The life of characteristic digital resources information is short, each heterogeneous database platform in Shaanxi has great difference, supports limited RSS interface, contains complex data formats. [Methods] Using Web data collection technology such as Drupal Feeds, XPath Parser, Crawls, Image Grabber, combined with data cleaning and removing, to achieve specialization and systematization for Web data collection. [Results] Explore feeds RSS collection, HTML/XML automatic acquisition, rules for different characteristics of resource modification specially, and Web streaming media collection. [Conclusions] This study can rich platform data sources, partially provide solutions to difficult data collection, data formats unstandardized, data source route limited and so on.
李丹, 闫晓弟, 魏青山. Drupal数据采集在构建特色数字资源中的实践[J]. 现代图书情报技术, 2015, 31(7-8): 148-154.
Li Dan, Yan Xiaodi, Wei Qingshan . Practice of Data Collection in Building Characteristic Digital Resources Based on Drupal. New Technology of Library and Information Service, 2015, 31(7-8): 148-154.
[1] 李丹, 闫晓弟, 李娟, 等. 陕西省地方特色数字资源现状分析与思考[J]. 情报探索, 2013(10): 59-61. (Li Dan, Yan Xiaodi, Li Juan, et al. Analysis and Deliberation on Local Characteristic Digital Resources in Shaanxi [J]. Information Research, 2013(10): 59-61.)
[2] 刘兰, 吴振新, 张智雄, 等. Web Archive的采集策略研究[J]. 现代图书情报技术, 2009(1): 10-15. (Liu Lan, Wu Zhenxin, Zhang Zhixiong, et al. Study on the Harvest Strategies in Web Archive [J]. New Technology of Library and Information Service, 2009(1): 10-15.)
[3] Marshall C C. Making Metadata: A Study of Metadata Creation for a Mixed Physical-Digital Collection [C]. In: Proceedings of the 3rd ACM Conference on Digital Libraries (DL'98). New York: ACM, 1998: 162-171.
[4] 范炜. Drupal分类组织机制研究: 一种复合信息组织模式[J]. 图书馆杂志, 2010, 29(1): 23-26. (Fan Wei. A Study on Drupal's Taxonomy Module: A Hybrid Pattern of Information Organization [J]. Library Journal, 2010, 29(1): 23-26.)
[5] 王欣, 李玉兰, 商允峥. 基于Drupal构建图书馆2.0网站的研究和实践[J]. 现代图书情报技术, 2009(11): 82-87. (Wang Xin, Li Yulan, Shang Yunzheng. The Research and Practice of Building a Library Website with Library 2.0 Features Based on Drupal [J]. New Technology of Library and Information Service, 2009(11): 82-87.)
[6] 李丹, 闫晓弟, 魏青山. Drupal的混搭技术在图书馆的应用[J]. 现代图书情报技术, 2013(10): 79-84. (Li Dan, Yan Xiaodi, Wei Qingshan. Application of Mashup in Library Based on Drupal [J]. New Technology of Library and Information Service, 2013(10): 79-84.)
[7] Rauber A, Aschenbrenner A, Witvoet O. Austrian Online Archive Processing: Analyzing Archives of the World Wide Web [A]. //Agosti M, Thanos C. Research and Advanced Technology for Digital Libraries [M]. Springer Berlin Heidelberg, 2002: 16-31.
[8] Xpath [EB/OL]. [2014-12-12]. http://www.w3school.com.cn/xpath/index.asp.