Please wait a minute...
New Technology of Library and Information Service  2007, Vol. 2 Issue (1): 49-52    DOI: 10.11925/infotech.1003-3513.2007.01.12
Current Issue | Archive | Adv Search |
Design and Implementation of Focused Crawler Based on OSS
Li Chunwang
(Library of Chinese Academy of Sciences, Beijing 100080, China)
Download:
Export: BibTeX | EndNote (RIS)      
Abstract  

After analyzing the architecture of a focused crawler and its implemented strategies based on OSS, this paper emphatically discusses subject modeling and related algorithms, and explains the detailed integration technologies which includes the same Java standards, Web services and Java Native Interface (JNI).

Key wordsFocused crawler      Search engine      OSS      System design and implementation     
Received: 10 November 2006      Published: 25 January 2007
: 

TP39

 
Corresponding Authors: Li Chunwang     E-mail: licw@mail.las.ac.cn
About author:: Li Chunwang

Cite this article:

Li Chunwang . Design and Implementation of Focused Crawler Based on OSS. New Technology of Library and Information Service, 2007, 2(1): 49-52.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2007.01.12     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2007/V2/I1/49

1Chakrabarti S, Punera K, Subramanyam M. Accelerated focused crawling through online relevance    feedback, WWW2002, May 7-11,2002,Honolulu, Hawaii,USA. http://www.cs.berkeley.edu/~soumen/doc/www2002m/p336-chakrabarti.pdf (Accessed Nov. 8, 2006)
2Mitchell S, Mooney M, Mason J, et al. iVia open source virtual library  system.D-LibMagazine,2003,9(1).http://www.dlib.org/dlib/january03/mitchell/01mitchell.html (Accessed Nov. 8, 2006)
3INFOMINE: scholarly internet resource collections. http://infomine.ucr.edu/ (Accessed Nov. 8, 2006)
4Bergman M K. Six major trends affecting knowledge management and information technology.White paper published by BrightPlanet Corporation, July 2003
5Anthes G. Search engines——the future. http://www.computerworld.com/printthis/2004/0,4814,91841,00.html (Accessed Nov. 8, 2006)
6李春旺. Web信息主题采集技术研究.图书情报工作,2005,49(4):77-80,70
7JSpider - the open source Web robot. http://j-spider.sourceforge.net/ (Accessed Nov. 8, 2006)
8WebSPHINX a personal,customizable Web crawler. http://www.cs.cmu.edu/~rcm/websphinx/ (Accessed  Nov. 8, 2006)
9WebLech URL spider. http://weblech.sourceforge.net/ (Accessed Nov. 8, 2006)
10Greenstein D. Draft report of a meeting convened by the digital library federation on October 5-6, 2001 in Washington DC to consider Open Source Software for Libraries. October 22, 2001. http://www.diglib.org/architectures/ossrep.htm (Accessed Nov. 8, 2006)
11Robert C. Miller,Krishna Bharat. SPHINX: a framework for creating personal, site-specific Web crawlers. Computer Network and ISDN Systems, 1998(30):119-130
12Ehrig M. Ontology - focused crawling of documents and relational metadata.(Master thesis). University of Karlsruhe, Germany,2002. http://www2002.org/CDROM/poster/94/ (Accessed Nov. 8, 2006)
13Ehrig M, Maedche A. Ontology-focused crawling of Web documents.http://www.aifb.uni-karlsruhe.de/WBS/meh/publications/ehrig03ontology.pdf (Accessed Nov. 8, 2006)
14Gawrysiak P. Using data mining methodology for text retrieval. http://bolek.ii.pw.edu.pl/~gawrysia/publ/DIBSarticle.pdf (Accessed Nov. 8, 2006)
15Clever System (HITS) - A page ranking algorithm developed by IBM. http://www.ecsl.cs.sunysb.edu/~chiueh/cse646/cn4/cn4.html (Accessed Nov. 8, 2006)
16Multivalent. http://multivalent.sourceforge.net/ (Accessed Nov. 8, 2006)
17Apache Lucene.http://lucene.apache.org/java/docs/ (Accessed Nov. 8, 2006)
18TextCat language guesser. http://www.let.rug.nl/~vannoord/TextCat/ (Accessed Nov. 8, 2006)
19Cavnar W B,Trenkle J M.N-gram-based text categorization.In Proceedings of Third Annual Symposiumon Document Analysis and Information Retrieval, Las Vegas, NV, UNLV Publications/Reprographics,1994(4):161-175
20计算所汉语词法分析系统ICTCLAS. http://www.nlp.org.cn/project/project.php?proj_id=6.2004-12-05    (Accessed Nov. 8, 2006)

[1] Yang Hanxun, Zhou Dequn, Ma Jing, Luo Yongcong. Detecting Rumors with Uncertain Loss and Task-level Attention Mechanism[J]. 数据分析与知识发现, 2021, 5(7): 101-110.
[2] Lu Linong,Zhu Zhongming,Zhang Wangqiang,Wang Xiaochun. Cross-database Knowledge Integration and Fingerprint of Institutional Repositories with Lingo3G Clustering Algorithm[J]. 数据分析与知识发现, 2021, 5(5): 127-132.
[3] Chen Hao, Zhang Mengyi, Cheng Xiufeng. Identifying Cross-Region Patent Collaboration Opportunities Using LDA and Decision Trees——Case Study of Universities from Guangdong and Wuhan[J]. 数据分析与知识发现, 2021, 5(10): 37-50.
[4] Zhang Sifan,Niu Zhendong,Lu Hao,Zhu Yifan,Wang Rongrong. Predicting Citations Based on Graph Convolution Embedding and Feature Cross:Case Study of Transportation Research[J]. 数据分析与知识发现, 2020, 4(9): 56-67.
[5] Liang Ye,Li Xiaoyuan,Xu Hang,Hu Yiran. CLOpin: A Cross-Lingual Knowledge Graph Framework for Public Opinion Analysis and Early Warning[J]. 数据分析与知识发现, 2020, 4(6): 1-14.
[6] Zhu Lu,Tian Xiaomeng,Cao Sainan,Liu Yuanyuan. Subspace Cross-modal Retrieval Based on High-Order Semantic Correlation[J]. 数据分析与知识发现, 2020, 4(5): 84-91.
[7] Qi Ruihua,Jian Yue,Guo Xu,Guan Jinghua,Yang Mingxin. Sentiment Analysis of Cross-Domain Product Reviews Based on Feature Fusion and Attention Mechanism[J]. 数据分析与知识发现, 2020, 4(12): 85-94.
[8] Zhang Jinzhu,Zhu Lipeng,Liu Jingjie. Unsupervised Cross-Language Model for Patent Recommendation Based on Representation[J]. 数据分析与知识发现, 2020, 4(10): 93-103.
[9] Mingxuan Huang,Shoudong Lu,Hui Xu. Cross-Language Information Retrieval Based on Weighted Association Patterns and Rule Consequent Expansion[J]. 数据分析与知识发现, 2019, 3(9): 77-87.
[10] Jiaxin Ye,Huixiang Xiong. Recommending Personalized Contents from Cross-Domain Resources Based on Tags[J]. 数据分析与知识发现, 2019, 3(2): 21-32.
[11] Qinghong Zhong,Xiaodong Qiao,Yunliang Zhang,Mengjuan Weng. Cross-media Fusion Method Based on LDA2Vec and Residual Network[J]. 数据分析与知识发现, 2019, 3(10): 78-88.
[12] Li Xiangdong,Gao Fan,Li Youhai. Categorizing Documents Automatically within Common Semantic Space[J]. 数据分析与知识发现, 2018, 2(9): 66-73.
[13] Wu Dan,Lu Liuxing. Semantic Changes of Queries from Cross-device Searching[J]. 数据分析与知识发现, 2018, 2(8): 69-78.
[14] Yu Chuanming,Feng Bolin,An Lu. Sentiment Analysis in Cross-Domain Environment with Deep Representative Learning[J]. 数据分析与知识发现, 2017, 1(7): 73-81.
[15] Huang Mingxuan. Cross Language Information Retrieval Model Based on Matrix-weighted Association Patterns Mining[J]. 数据分析与知识发现, 2017, 1(1): 26-36.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn