1National Science Library, Chinese Academy of Sciences, Beijing 100190, China 2Department of Library Information and Archives Management, University of Chinese Academy of Sciences, Beijing 100190, China 3Institute of Software, Chinese Academy of Sciences, Beijing 100190, China
[Objective] This research addresses the issues facing the storage and online access of massive text-level documents, the governance of large-scale data, and the low service performance, aiming to build a big data platform for sci-tech literature. [Methods] First, we analyzed the characteristics of distributed big data services for science and technology. Then, we adopted a co-tenant deployment strategy based on the servers and networks. Finally, we designed a big data platform for sci-tech literature with a “5+2” overall architecture. [Results] We established a PB-level big data platform for sci-tech literature. It has data storage capacity of 200TB and collected 320 million document entities as well as 6 billion entity relationship. The metadata processing performance based on MapReduce was increased by 3 times, and then formed the knowledge service architecture based on new technology. [Limitations] We did not adequately process streaming data, thus the system cannot offer prompt response for new data. [Conclusions] The new platform supports the knowledge discovery services of National Science Library, Chinese Academy of Sciences, as well as the intelligent scientific research system. It has good online services and improves the processing and service capabilities of sci-tech literature.
常志军,钱力,谢靖,吴振新,张鹄,于倩倩,王颖,王永吉. 基于分布式技术的科技文献大数据平台的建设研究*[J]. 数据分析与知识发现, 2021, 5(3): 69-77.
Chang Zhijun,Qian Li,Xie Jing,Wu Zhenxin,Zhang Hu,Yu Qianqian,Wang Ying,Wang Yongji. Big Data Platform for Sci-Tech Literature Based on Distributed Technology. Data Analysis and Knowledge Discovery, 2021, 5(3): 69-77.
( Cheng Yu, Hu Fangang, Wu Yunming. Reflections on the Values, Problems and Development Path of Big Data on Education[J]. Software Guide, 2020,19(5):281-284.)
[2]
陶波. 基于大数据平台的医疗健康数据分析与应用模式研究[D]. 武汉: 华中科技大学, 2019.
[2]
( Tao Bo. Research on Medical Health Data Analysis and Application Model Based on Big Data Platform[D]. Wuhan: Huazhong University of Science & Technology, 2019.)
[3]
刘彦平. 电商企业与大数据营销[J]. 中国市场, 2016(40):28-29, 36.
[3]
( Liu Yanping. E-Commerce Business and Big Data Marketing[J]. China Market, 2016(40):28-29, 36.)
( Yang Siluo, Dong Jiahui. Research on Research Hotspots and Development Trends of Smart Libraries at Domestic and Abroad[J]. Journal of Modern Information, 2020,40(11):167-177.)
[7]
李洁. 数据驱动下数字图书馆知识发现服务创新模式与策略研究[D]. 长春:吉林大学, 2020.
[7]
( Li Jie. Data-Driven Knownledge Discovery Innovation in Digital Library: Modes and Strategies[D]. Changchun: Jilin University, 2020.)
[8]
Wang Y, Ma C, Wang W, et al. An Approach of Fast Data Manipulation in HDFS with Supplementary Mechanisms[J]. Journal of Supercomputing, 2015,71(5):1736-1753.
[9]
余庆. 分布式文件系统FastDFS架构剖析[J]. 程序员, 2010(11):63-65.
[9]
( Yu Qing. Analysis of Distributed File System FastDFS Architecture[J]. Programmer, 2010(11):63-65.)
( Zhang Xueliang, Chen Jinyong, Chen Yong. Research on Large-scale Text Processing Based on Hadoop Platform[J]. Radio Communications Technology, 2014,40(1):54-57.)
[12]
李文栋. 基于Spark的大数据挖掘技术的研究与实现[D]. 济南:山东大学, 2015.
[12]
( Li Wendong. The Research and Implementation of Mining Large Data Based on Spark[D]. Jinan: Shandong University, 2015.)
( Gao Jinsong, Liu Hongqiu. Research on the Linked Data at Domestic and Abroad Based on Knowledge Mapping[J]. Information Science, 2018,36(3):117-124.)
( Zhang Shuxin, Wu Haibin, Meng Hui, et al. Design of IT Eco-environment for Shipping EDI Platform Based on SpringCloud[J]. China Storage & Transport, 2018(2):100-103.)
[15]
赵宇. 大数据平台运行监控系统的研究与应用[D]. 北京: 北京交通大学, 2016.
[15]
( Zhao Yu. Research and Application of Big Data Platform Operation Monitoring System[D]. Beijing: Beijing Jiaotong University, 2016.)