基于DOM的Web信息自动抽取
吴伟 刘友华
(南京大学信息管理系 南京 210093)
Automatic Web Information Extraction Based on DOM
Wu Wei Liu Youhua
(Department of Information Management, Nanjing University,Nanjing 210093,China)
摘要 提出了Web页面信息的自动抽取思想,并使用WebBrowser和DOM技术实现了Web页面上网页元素查找、表单自动填写、表单自动提交、自动获得查询结果并自动抽取所需信息的技术,从而实现了Web页面信息的自动抽取。文中还给出了这一方法的实现细节和示例代码。
关键词 :
Web页面 ,
自动信息抽取 DOM ,
WebBrowser
Abstract :More and more Web sites are built on databasedriven architecture. The Web pages of these sites are creating dynamically. This paper advances and implements a method of automatic information extraction from the dynamic pages by using WebBrowser and DOM technique. In addition, the paper illustrates the details and code through a prototype.
Key words :
Dynamic Web
Automatic information extraction
DOM
WebBrowser
收稿日期: 2003-09-15
出版日期: 2004-01-06
通讯作者:
吴伟
E-mail: wuweibox@hotmail.com
作者简介 : 吴伟,刘友华
1Document Object Model (DOM) Level1 Specification, Version 1.0. W3C Recommendation. October,01,1998. http://www.w3.org/TR/1998/REC-DOM-Level-1-19981001/ (Accessed Mar.5,2003)
2Document Object Model(DOM)Level 2 Events Specification,Version1.0. W3C Recommendation. November,13,2000. http://www.w3.org/TR/DOM-Level-2-Events/ (Accessed Mar,6,2003)
3Michael Edwards, Scott Roberts. Reusing Internet Explorer and the WebBrowser Control: An Array of Options. Microsoft Corporation MSDN Library. July 30, 1998.http://msdn.microsoft.com/library/en-us/dnwebgen/html/reusebovw.asp (Accessed Jul.15,2002)
4Kevin Hoffman, Jeff Gabriel al. Professional .NET Framework. Wrox Press Ltd. 2001
5Microsoft Corporation. Microsoft Visual C#.NET Language Reference. Microsoft Press. 2002
6Microsoft Corporation MSDN Library.Microsoft.NET/COM Migration and Interoperability. August,2001. http://msdn.microsoft.com/library/en-us/dnbda/html/cominterop.asp (Accessed May,1,2003)
7Microsoft Corporation MSDN Library. MSHTML Reference.
http://msdn.microsoft.com/workshop/browser/mshtml/reference/reference.asp (Accessed Apri. 24,2003)
8Microsoft Corporation MSDN Library. WebBrowser Object.
http://msdn.microsoft.com/workshop/browser/webbrowser/reference/Objects/WebBrowser.asp (Accessed Apr.5, 2003)
9Microsoft Visual InterDev. http://msdn.microsoft.com/vinterdev/default.asp (Accessed Apr.3, 2003)
Viewed
Full text
Abstract
Cited
Shared
Discussed