Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (3): 69-77    DOI: 10.11925/infotech.2096-3467.2018.1371
Big Data Platform for Sci-Tech Literature Based on Distributed Technology
Chang Zhijun1,2(),Qian Li1,2,Xie Jing1,2,Wu Zhenxin1,2,Zhang Hu1,Yu Qianqian1,Wang Ying1,Wang Yongji3
1National Science Library, Chinese Academy of Sciences, Beijing 100190, China
2Department of Library Information and Archives Management, University of Chinese Academy of Sciences, Beijing 100190, China
3Institute of Software, Chinese Academy of Sciences, Beijing 100190, China
[Objective] This research addresses the issues facing the storage and online access of massive text-level documents, the governance of large-scale data, and the low service performance, aiming to build a big data platform for sci-tech literature. [Methods] First, we analyzed the characteristics of distributed big data services for science and technology. Then, we adopted a co-tenant deployment strategy based on the servers and networks. Finally, we designed a big data platform for sci-tech literature with a “5+2” overall architecture. [Results] We established a PB-level big data platform for sci-tech literature. It has data storage capacity of 200TB and collected 320 million document entities as well as 6 billion entity relationship. The metadata processing performance based on MapReduce was increased by 3 times, and then formed the knowledge service architecture based on new technology. [Limitations] We did not adequately process streaming data, thus the system cannot offer prompt response for new data. [Conclusions] The new platform supports the knowledge discovery services of National Science Library, Chinese Academy of Sciences, as well as the intelligent scientific research system. It has good online services and improves the processing and service capabilities of sci-tech literature.

Key wordsBig Data Technology      Distributed Storage      Distributed Computing      Co-Tenant Deployment      Data Warehouse     
Received: 04 December 2018      Published: 12 April 2021
ZTFLH:  TP311  
Architecture of Big Data Platform for Scientific Literature
Flowchart of Warning System Based on Logs
序号 大数据平台软件系统 部署模式 服务器标识 集群规模(台)
1 分布式文件系统(HDFS) 共租部署 S1…15 15
2 分布式小文件存储系统(FastDFS) 共租部署 S11…15 5
3 分布式数据仓库系统 共租部署 S1…15 15
4 分布式
MapReduce 共租部署 S1…15 15
5 Spark 共租部署 S16…24 9
6 分布式搜索引擎系统 共租部署 S16...24 9
7 微服务系统 独立部署 S25、26 2
8 分布式高速缓存系统 共租部署 S25、26 2
9 收割服务器群 共租部署 S10、11 2
Software Co-leasing Strategy of Big Data Platform for Scientific Literature
Network Topology Diagram of Big Data Platform for Scientific Literature
Structure Diagram of Big Data Platform Subsystem of Scientific Literature
序号 文献类型 数据量
1 论文 2.3亿+
2 专利 9千万+
3 报告 70万+
4 标准 30万+
5 课件 5万+
6 图书 100万+
7 政策 60万+
8 特色数据 200万+
Main Entities Aggregated Data Volume Statistics
Schematic Diagram of Intelligent Knowledge Service Products
