|
|
Extracting the Content of Google Web Page with Regular Expressions |
Zhang Jian1 Ou Hong2 |
1(Library of Changsha University of Science and Technology, Changsha 410076,China)
2(Hunan Library,Changsha 410011,China) |
|
|
Abstract That properly and completely extracting the content of search Web pages is the basic precondition for handling the information retrieved.This paper analyses the structure characteristic of Google Web pages,presents a group of regular expressions for matching the content of these pages,and realizes a content extractor with Visual C#.The results from practical application to many Google Web pages shows that the matching method with regular expressions can extract the whole main content of Google Web pages.
|
Received: 30 May 2005
Published: 25 September 2005
|
|
Corresponding Authors:
Zhang Jian
E-mail: ehulh@163.com
|
About author:: Zhang Jian,Ou Hong |
1孟小峰.Web信息集成技术研究.计算机应用与软件,2003,20(11):32-36
2黄红华,俞勇.CWIWSK——从半结构化中抽取信息的归纳规则方法.上海交通大学学报,2003,37(3):424-427
3Theodore W.Hong,Keith L.Clark.Towards a Universal Web Wrapper.In:Proceddings of the 17th International Florida Intelligence Research Symposium Conference.Florida,USA:AAAI Press,2004. Available at:
4吴伟,刘友华.基于DOM的Web信息自动抽取.现代图书情报技术,2004(2):68-71
5Google Web APIs Reference.http://www.google.com/api/reference,(Accessed May. 8,2005)
6Linger,F.,McQueen,C.,Wilton,P.著.刘乐亭译.C#字符串和正则表达式参考手册.北京:清华大学出版社,2003
7Archer,T.,Whitechapel,A.著.马朝晖等译.C#技术揭秘.北京:机械工业出版社,2003
8张志刚,陈静,李晓明.一种HTML网页净化方法.情报学报,2004,23(4):387-393 |
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|