[Objective] This paper proposes a new system to automatically track, acquire, store and manage scientific information, aiming to support research in related fields. [Methods] We developed the new system based on the CSpace and then solve many technical issues. Then, we examined the new system with marine information. [Results] The proposed system could automatically retrieve multi-source heterogeneous scientific information, which supported the construction of science and technology platform. [Limitations] The information acquisition procedure of the new system was complex, and it cannot retrieve documents from password-protected sites. [Conclusions] The proposed method could expand the CSpace’s data acquisition and integration functions, and might be transferred to other fields.
(Yao Xiaona, Zhu Zhongming, Liu Wei, et al.Research and Practice on the Institutional Repository Aggregative System[J]. Library and Information Service, 2015, 59(21): 123-127, 75.)
叶勤勇. 基于URL规则的聚焦爬虫及其应用[D]. 杭州: 浙江大学, 2007.
(Ye Qinyong.URL Rule Based Focused Crawl and Its Application[D]. Hangzhou: Zhejiang University, 2007.)
(Jiang Fubin.URL Classifier Algorithm Based on Decision Tree and Platform Design of Focused Crawler [D]. Chengdu: Chengdu University of Technology, 2016.)
杨镒铭. 基于URL模式的网页分类算法研究[D]. 合肥: 中国科学技术大学, 2016.
(Yang Yiming.Research on URL-Pattern Based Algorithm for Web Page[D]. Hefei: University of Science and Technology of China, 2016.)
Bar-Yossef Z, Rajagopalan S.Template Detection via Data Mining and Its Applications[C]//Proceedings of the 11th International Conference on World Wide Web, Honolulu, Hawaii, USA. New York, USA: ACM, 2002: 580-591.
Mitra P, Debnath S, Giles Lee C, et al.Automatic Identification of Informative Sections of Web Pages[J]. IEEE Transactions on Knowledge & Data Engineering, 2009, 17(9): 1233-1246.
王浩. 基于半监督学习的网络敏感信息识别[D]. 天津: 天津大学, 2012.
(Wang Hao.Internet Sensitive Information Identification Based on Semi-Supervised Learning [D]. Tianjin: Tianjin University, 2012.)
Pavlinek M, Podgorelec V.Text Classification Method Based on Self-training and LDA Topic Models[J]. Expert Systems with Applications, 2017, 80: 83-93.
(Ou Jianwen, Dong Shoubin, Cai Bin.Topic Information Extraction from Template Web Pages[J]. Journal of Tsinghua University: Science and Technology, 2005, 45(S1): 1743-1747.)
(Zhang Zhixiong, Liu Jianhua, Xie Jing, et al.Design and Implementation of the Service Cloud for Strategic S&T Information Monitoring[J]. New Technology of Library and Information Service, 2014(6): 51-61.)