Please wait a minute...
New Technology of Library and Information Service  2010, Vol. 26 Issue (3): 52-57    DOI: 10.11925/infotech.1003-3513.2010.03.09
article Current Issue | Archive | Adv Search |
Overview of Research on Data Collection from Ajax Sites
Xia Tian
(School of Information Resource Management, Renmin University of China, Beijing 100872, China)
Download: PDF(522 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  

This paper introduces the recent advances achieved from five aspects, which include Ajax link elements judgment, page state identification, page state controllable transformation, content extraction and duplicated states detection. The overall processing flow and the relevant supporting technologies are summarized, and the new research trends are discussed. This study will be helpful to promote the further research on Ajax data collection issues.

Key wordsData collection      Ajax crawler      HTML renderer      Web2.0     
Received: 06 March 2010      Published: 25 March 2010
: 

G350

 
Corresponding Authors: Xia Tian     E-mail: xiat@ruc.edu.cn
About author:: Xia Tian

Cite this article:

Xia Tian. Overview of Research on Data Collection from Ajax Sites. New Technology of Library and Information Service, 2010, 26(3): 52-57.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2010.03.09     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2010/V26/I3/52

[1] Garrett J. Ajax: A New Approach to Web Applications[EB/OL]. (2005-02-18).[2010-01-15]. http://www.adaptivepath.com/ideas/essays/archives/000385.php.
[2] Mesbah A, Van Deursen A. An Architectural Style for Ajax[C]. In: Proceedings of the 6th Working IEEE/IFIP Conference on Software Architecture,Mumbai, India. Washington, DC, USA :IEEE Computer Society,2007: 44-53.
[3] Bozdag E, Mesbah A, Van Deursen A. A Comparison of Push and Pull Techniques for Ajax[C]. In: Proceedings of the 9th IEEE International Symposium on Web Site Evolution,Paris, France.2007: 15-22.
[4] Mesbah A, Van Deursen A. Exposing the Hidden-Web Induced by Ajax[R/OL]. [2009-08-01]. http://swerl.tudelft.nl/twiki/pub/Main/TechnicalReports/TUD-SERG-2008-001.pdf.
[5] Frey G. Indexing Ajax Web Applications[D]. Zurich: Swiss Federal Institute of Technology Zurich, 2007.
[6] Matter R. Ajax Crawl: Making Ajax Applications Searchable[D]. Zurich: Swiss Federal Institute of Technology Zurich, 2008.
[7] Mesbah A, Bozdag E, Van Deursen A.Crawling Ajax by Inferring User Interface State Changes[C]. In: Proceedings of the 8th International Conference on Web Engineering,Yorktown Heights, NJ. Washington, DC, USA: IEEE Computer Society,2008: 122-134.
[8] 郭浩, 陆余良, 刘金红. 一种基于状态转换图的Ajax 爬行算法[J]. 计算机应用研究, 2009, 26(11): 4266-4269.
[9] Duda C, Frey G, Kossmann D, et al. AjaxSearch: Crawling, Indexing and Searching Web 2.0 Applications[J]. Proceedings of the VLDB Endowment Archive, 2008, 1(2): 1440-1443.
[10] 夏冰, 高军, 王腾蛟,等. 一种高效的动态脚本网站有效页面获取方法[J]. 软件学报, 2009, 20(z): 176-183.
[11] Xia T. Extracting Structured Data from Ajax Site[C]. In: Proceedings of 2009 International IEEE Workshop on Database Technology and Applications,Wuhan, China.2009: 259-262.
[12] Shah S. Crawling Ajax-driven Web 2.0 Applications[R/OL]. (2007-02-14). [2010-01-15].http://www.infosecwriters.com/text_resources/pdf/Crawling_AJAX_SShah.pdf.
[13] 罗兵. 支持Ajax的互联网搜索引擎爬虫设计与实现[D]. 杭州: 浙江大学, 2007.
[14] 肖卓磊. 基于Ajax技术的搜索引擎研究[D]. 武汉: 武汉理工大学, 2009.
[15] 曾伟辉, 李淼. 基于JavaScript切片的Ajax框架网络爬虫技术研究[J]. 计算机系统应用, 2009, 18(7): 169-171.
[16] Mozilla. Rhino: JavaScript for Java [EB/OL]. [2009-03-22]. http://www.mozilla.org/rhino/.
[17] Cobra: Java HTML Renderer & Parser [EB/OL]. [2009-01-19].http://lobobrowser.org/cobra.jsp.
[18] 袁小节. 基于协议驱动与事件驱动的综合聚焦爬虫研究与实现[D]. 长沙: 国防科学技术大学, 2009.
[19] Reis D C, Golgher  P B, Silva A S, et al. Automatic Web News Extraction Using Tree Edit Distance[C]. In: Proceedings of the 13th International Conference on World Wide Web, New York. New York, NY, USA: ACM Press, 2004: 502-511.
[20] Xia T. Extracting Multi-Records from Web Pages[C]. In: Proceedings of the 4th International Conference on Semantics, Knowledge and Grid, Beijing, China.2008: 396-399.
[21] Marzal A, Vidal E. Computation of Normalized Edit Distance and Applications[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1993, 15(9): 926-932.
[22] Buttler D. A Short Survey of Document Structure Similarity Algorithm[C]. In: Proceedings of the 5th International Conference on Internet Computing,Las Vegas, US.2004: 3-9.
[23] Webrenderer [EB/OL]. [2010-02-16]. http://www.webrenderer.com/.
[24] Webclient [EB/OL]. (2007-09-23). [2010-02-16].http://www.mozilla.org/projects/blackwood/webclient/.
[25] JRex-The Java Browser Component[EB/OL]. [2009-06-21]. http://jrex.mozdev.org/.
[26] JExplorer [EB/OL].[2010-01-29]. http://www.teamdev.com/jexplorer/.
[27] Watij[EB/OL]. [2009-11-16]. http://watij.com/.
[28] Watir [EB/OL]. [2009-11-16]. http://watir.com/.
[29] HtmlUnit[EB/OL]. [2010-02-09]. http://htmlunit.sourceforge.net/.
[30] XHTML Renderer Project[EB/OL]. [2009-07-01]. https://xhtmlrenderer.dev.java.net/.
[31] CSS Parser[EB/OL]. [2009-11-16]. http://cssparser.sourceforge.net/.
[32] Crowbar[EB/OL]. [2010-01-16]. http://simile.mit.edu/wiki/Crowbar.
[33] FireWatir[EB/OL]. [2010-01-14]. http://code.google.com/p/firewatir/.

 

[1] Li Dan, Yan Xiaodi, Wei Qingshan . Practice of Data Collection in Building Characteristic Digital Resources Based on Drupal[J]. 现代图书情报技术, 2015, 31(7-8): 148-154.
[2] Li Lei, Zhang Chengzhi. Survey on Quality Evaluation of Social Tags[J]. 现代图书情报技术, 2013, 29(11): 22-29.
[3] Zhao Jie, Dong Zhenning, Zhang Shaqing, Xiao Nanfeng. A Collection Method for Multi-granularity Web Usage Data[J]. 现代图书情报技术, 2011, 27(2): 42-47.
[4] Guo Wenli Zhao Xiaoye Zhou Jie. Construction of a Library Lecture Subscription System Based on Ajax[J]. 现代图书情报技术, 2010, 26(5): 84-88.
[5] Xue Juan. Design and Implementation of Discipline Navigation System in University Libraries Based on Tag Technology[J]. 现代图书情报技术, 2010, 26(11): 90-93.
[6] Meng Jian,Zhang Liyi. An Integrated Model of Distributed Commodity Information Based on RESTful Web Services and Mashup[J]. 现代图书情报技术, 2010, 26(1): 15-21.
[7] Zou Rong,Fan Aihong,Jiang Airong. Construction of the Academic Papers Management System with DSpace[J]. 现代图书情报技术, 2009, (10): 90-94.
[8] Li Feng,Li Chunwang. Study on Mashup Technology[J]. 现代图书情报技术, 2009, 3(1): 44-49.
[9] Le Xiaoqiu,Li Yu,Zhang Xiaolin,Zhang Zhixiong,Li Chunwang. Approaches to implement Services-embedded Desktop Information Tools[J]. 现代图书情报技术, 2008, 24(3): 7-11.
[10] Meng Xiaochuan,Ma Ziwei. Research and Implementation of Multi-dimensional Portal System in Digital Library Based on Liferay[J]. 现代图书情报技术, 2008, 24(12): 8-14.
[11] Wang Weijun,Xiong Rui,Cheng Jiangdong. Constructing Web2.0-based Knowledge Management Platform by DotNetNuke[J]. 现代图书情报技术, 2007, 2(7): 41-45.
[12] Shen Kuilin. Reconstruction of Visualization Network Teaching System by Web2.0 Technology[J]. 现代图书情报技术, 2007, 2(7): 46-49.
[13] Cui Meng,Ma Ziwei. Implement of Paper Management and Service System Based on Struts+Ajax[J]. 现代图书情报技术, 2007, 2(11): 7-12.
[14] Wei Maoqian,Xie Jing,Ma Ziwei. Design and Implementation of Union Search and Extended Service System Based on Lightweight Infrastructure[J]. 现代图书情报技术, 2007, 2(11): 19-22.
[15] Gong Weitao,Ma Ziwei. The Digital Library Portal Integration Technology and Implementation[J]. 现代图书情报技术, 2007, 2(11): 23-27.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn