Please wait a minute...
Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (3): 69-77    DOI: 10.11925/infotech.2096-3467.2018.1371
Current Issue | Archive | Adv Search |
Big Data Platform for Sci-Tech Literature Based on Distributed Technology
Chang Zhijun1,2(),Qian Li1,2,Xie Jing1,2,Wu Zhenxin1,2,Zhang Hu1,Yu Qianqian1,Wang Ying1,Wang Yongji3
1National Science Library, Chinese Academy of Sciences, Beijing 100190, China
2Department of Library Information and Archives Management, University of Chinese Academy of Sciences, Beijing 100190, China
3Institute of Software, Chinese Academy of Sciences, Beijing 100190, China
Download: PDF (1493 KB)   HTML ( 28
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This research addresses the issues facing the storage and online access of massive text-level documents, the governance of large-scale data, and the low service performance, aiming to build a big data platform for sci-tech literature. [Methods] First, we analyzed the characteristics of distributed big data services for science and technology. Then, we adopted a co-tenant deployment strategy based on the servers and networks. Finally, we designed a big data platform for sci-tech literature with a “5+2” overall architecture. [Results] We established a PB-level big data platform for sci-tech literature. It has data storage capacity of 200TB and collected 320 million document entities as well as 6 billion entity relationship. The metadata processing performance based on MapReduce was increased by 3 times, and then formed the knowledge service architecture based on new technology. [Limitations] We did not adequately process streaming data, thus the system cannot offer prompt response for new data. [Conclusions] The new platform supports the knowledge discovery services of National Science Library, Chinese Academy of Sciences, as well as the intelligent scientific research system. It has good online services and improves the processing and service capabilities of sci-tech literature.

Key wordsBig Data Technology      Distributed Storage      Distributed Computing      Co-Tenant Deployment      Data Warehouse     
Received: 04 December 2018      Published: 12 April 2021
ZTFLH:  TP311  
  G250  
Corresponding Authors: Chang Zhijun     E-mail: changzj@mail.las.ac.cn

Cite this article:

Chang Zhijun,Qian Li,Xie Jing,Wu Zhenxin,Zhang Hu,Yu Qianqian,Wang Ying,Wang Yongji. Big Data Platform for Sci-Tech Literature Based on Distributed Technology. Data Analysis and Knowledge Discovery, 2021, 5(3): 69-77.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2018.1371     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2021/V5/I3/69

Architecture of Big Data Platform for Scientific Literature
Flowchart of Warning System Based on Logs
序号 大数据平台软件系统 部署模式 服务器标识 集群规模(台)
1 分布式文件系统(HDFS) 共租部署 S1…15 15
2 分布式小文件存储系统(FastDFS) 共租部署 S11…15 5
3 分布式数据仓库系统 共租部署 S1…15 15
4 分布式
计算引擎
MapReduce 共租部署 S1…15 15
5 Spark 共租部署 S16…24 9
6 分布式搜索引擎系统 共租部署 S16...24 9
7 微服务系统 独立部署 S25、26 2
8 分布式高速缓存系统 共租部署 S25、26 2
9 收割服务器群 共租部署 S10、11 2
Software Co-leasing Strategy of Big Data Platform for Scientific Literature
Network Topology Diagram of Big Data Platform for Scientific Literature
Structure Diagram of Big Data Platform Subsystem of Scientific Literature
序号 文献类型 数据量
1 论文 2.3亿+
2 专利 9千万+
3 报告 70万+
4 标准 30万+
5 课件 5万+
6 图书 100万+
7 政策 60万+
8 特色数据 200万+
Main Entities Aggregated Data Volume Statistics
Schematic Diagram of Intelligent Knowledge Service Products
[1] 程玉, 胡凡刚, 吴运明. 教育大数据价值体现、问题反思与发展路径[J]. 软件导刊, 2020,19(5):281-284.
[1] ( Cheng Yu, Hu Fangang, Wu Yunming. Reflections on the Values, Problems and Development Path of Big Data on Education[J]. Software Guide, 2020,19(5):281-284.)
[2] 陶波. 基于大数据平台的医疗健康数据分析与应用模式研究[D]. 武汉: 华中科技大学, 2019.
[2] ( Tao Bo. Research on Medical Health Data Analysis and Application Model Based on Big Data Platform[D]. Wuhan: Huazhong University of Science & Technology, 2019.)
[3] 刘彦平. 电商企业与大数据营销[J]. 中国市场, 2016(40):28-29, 36.
[3] ( Liu Yanping. E-Commerce Business and Big Data Marketing[J]. China Market, 2016(40):28-29, 36.)
[4] 张应飞. 基于金融大数据的互联网信贷发展风险探析[J]. 经济研究参考, 2014(29):74-76.
[4] ( Zhang Yingfei. Analysis on the Risk of Internet Credit Development Based on Financial Big Data[J]. Review of Economic Research, 2014(29):74-76.)
[5] 曾文, 车尧. 科技大数据的情报分析技术研究[J]. 情报科学, 2019,37(3):93-96.
[5] ( Zeng Wen, Che Yao. Research on Information Analysis Technology on Science and Technology Big Data[J]. Information Science, 2019,37(3):93-96.)
[6] 杨思洛, 董嘉慧. 国内外智慧图书馆研究热点及发展趋势探究[J]. 现代情报, 2020,40(11):167-177.
[6] ( Yang Siluo, Dong Jiahui. Research on Research Hotspots and Development Trends of Smart Libraries at Domestic and Abroad[J]. Journal of Modern Information, 2020,40(11):167-177.)
[7] 李洁. 数据驱动下数字图书馆知识发现服务创新模式与策略研究[D]. 长春:吉林大学, 2020.
[7] ( Li Jie. Data-Driven Knownledge Discovery Innovation in Digital Library: Modes and Strategies[D]. Changchun: Jilin University, 2020.)
[8] Wang Y, Ma C, Wang W, et al. An Approach of Fast Data Manipulation in HDFS with Supplementary Mechanisms[J]. Journal of Supercomputing, 2015,71(5):1736-1753.
[9] 余庆. 分布式文件系统FastDFS架构剖析[J]. 程序员, 2010(11):63-65.
[9] ( Yu Qing. Analysis of Distributed File System FastDFS Architecture[J]. Programmer, 2010(11):63-65.)
[10] 杜娟, 苏秋月. 基于DAG的Hive数据溯源方法[J]. 信息技术与网络安全, 2020,39(11):31-37.
[10] ( Du Juan, Su Qiuyue. Hive Data Provenance Method Based on DAG[J]. Information Technology and Network Security, 2020,39(11):31-37.)
[11] 张学亮, 陈金勇, 陈勇. 基于Hadoop云计算平台的海量文本处理研究[J]. 无线电通信技术, 2014,40(1):54-57.
[11] ( Zhang Xueliang, Chen Jinyong, Chen Yong. Research on Large-scale Text Processing Based on Hadoop Platform[J]. Radio Communications Technology, 2014,40(1):54-57.)
[12] 李文栋. 基于Spark的大数据挖掘技术的研究与实现[D]. 济南:山东大学, 2015.
[12] ( Li Wendong. The Research and Implementation of Mining Large Data Based on Spark[D]. Jinan: Shandong University, 2015.)
[13] 高劲松, 刘洪秋. 基于知识图谱的国内外关联数据研究分析[J]. 情报科学, 2018,36(3):117-124.
[13] ( Gao Jinsong, Liu Hongqiu. Research on the Linked Data at Domestic and Abroad Based on Knowledge Mapping[J]. Information Science, 2018,36(3):117-124.)
[14] 张树新, 吴海斌, 蒙辉, 等. 基于SpringCloud的航运EDI平台IT生态环境设计[J]. 中国储运, 2018(2):100-103.
[14] ( Zhang Shuxin, Wu Haibin, Meng Hui, et al. Design of IT Eco-environment for Shipping EDI Platform Based on SpringCloud[J]. China Storage & Transport, 2018(2):100-103.)
[15] 赵宇. 大数据平台运行监控系统的研究与应用[D]. 北京: 北京交通大学, 2016.
[15] ( Zhao Yu. Research and Application of Big Data Platform Operation Monitoring System[D]. Beijing: Beijing Jiaotong University, 2016.)
[1] Shen Zhihong,Zhao Zihao,Wang Haibo. Big Data Technology Stack Shifting: From SQL Centric to Graph Centric[J]. 数据分析与知识发现, 2020, 4(7): 50-65.
[2] Zhai Dongsheng, Cai Liwei, Zhang Jie, Feng Xiuzhen. The Study of Patent Data Warehouse-based Technical Efficiency Map Mining Method——Taking 3D Printing Technology as an Example[J]. 现代图书情报技术, 2015, 31(7-8): 131-138.
[3] Dong Kun. Research of Personalized Book Recommender System of University Library Based on Collaborative Filter[J]. 现代图书情报技术, 2011, (11): 44-47.
[4] Zhou Jing, Zhao Ying, Yang Xin. CWM-based ETL Metadata System Model Design[J]. 现代图书情报技术, 2011, 27(1): 88-93.
[5] Chen Quan,Yang Xiaojiang. Design and Implementation of a Management System for Digital Resource Collection[J]. 现代图书情报技术, 2009, 25(5): 86-91.
[6] Qi Wei,Wang Xiufang,Wang Xiangyu . Data Warehouse Design of Military Institute Library[J]. 现代图书情报技术, 2006, 1(8): 77-79.
[7] Wang Lancheng,Ao Yi,Zeng Qiong . The Development and Research on Heterogeneous Resource Integration of Information Organization and Technology[J]. 现代图书情报技术, 2006, 1(3): 68-71.
[8] Jin Ying,Deng Sanhong,Li Yong. Application of DSS in E-Government: Take Social Security as an Example[J]. 现代图书情报技术, 2004, 20(9): 66-69.
[9] Jin Yan. Data Warehouse and the Library Development[J]. 现代图书情报技术, 2000, 16(3): 13-16.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn