1National Science Library, Chinese Academy of Sciences, Beijing 100190, China 2Department of Library Information and Archives Management, University of Chinese Academy of Sciences, Beijing 100190, China 3Institute of Software, Chinese Academy of Sciences, Beijing 100190, China
[Objective] This research addresses the issues facing the storage and online access of massive text-level documents, the governance of large-scale data, and the low service performance, aiming to build a big data platform for sci-tech literature. [Methods] First, we analyzed the characteristics of distributed big data services for science and technology. Then, we adopted a co-tenant deployment strategy based on the servers and networks. Finally, we designed a big data platform for sci-tech literature with a “5+2” overall architecture. [Results] We established a PB-level big data platform for sci-tech literature. It has data storage capacity of 200TB and collected 320 million document entities as well as 6 billion entity relationship. The metadata processing performance based on MapReduce was increased by 3 times, and then formed the knowledge service architecture based on new technology. [Limitations] We did not adequately process streaming data, thus the system cannot offer prompt response for new data. [Conclusions] The new platform supports the knowledge discovery services of National Science Library, Chinese Academy of Sciences, as well as the intelligent scientific research system. It has good online services and improves the processing and service capabilities of sci-tech literature.
常志军,钱力,谢靖,吴振新,张鹄,于倩倩,王颖,王永吉. 基于分布式技术的科技文献大数据平台的建设研究*[J]. 数据分析与知识发现, 2021, 5(3): 69-77.
Chang Zhijun,Qian Li,Xie Jing,Wu Zhenxin,Zhang Hu,Yu Qianqian,Wang Ying,Wang Yongji. Big Data Platform for Sci-Tech Literature Based on Distributed Technology. Data Analysis and Knowledge Discovery, 2021, 5(3): 69-77.