That properly and completely extracting the content of search Web pages is the basic precondition for handling the information retrieved.This paper analyses the structure characteristic of Google Web pages,presents a group of regular expressions for matching the content of these pages,and realizes a content extractor with Visual C#.The results from practical application to many Google Web pages shows that the matching method with regular expressions can extract the whole main content of Google Web pages.
张健,欧红. 应用正则式抽取Google网页内容[J]. 现代图书情报技术, 2005, 21(9): 50-53.
Zhang Jian,Ou Hong. Extracting the Content of Google Web Page with Regular Expressions. New Technology of Library and Information Service, 2005, 21(9): 50-53.
1孟小峰.Web信息集成技术研究.计算机应用与软件,2003,20(11):32-36
2黄红华,俞勇.CWIWSK——从半结构化中抽取信息的归纳规则方法.上海交通大学学报,2003,37(3):424-427
3Theodore W.Hong,Keith L.Clark.Towards a Universal Web Wrapper.In:Proceddings of the 17th International Florida Intelligence Research Symposium Conference.Florida,USA:AAAI Press,2004. Available at:
4吴伟,刘友华.基于DOM的Web信息自动抽取.现代图书情报技术,2004(2):68-71
5Google Web APIs Reference.http://www.google.com/api/reference,(Accessed May. 8,2005)
6Linger,F.,McQueen,C.,Wilton,P.著.刘乐亭译.C#字符串和正则表达式参考手册.北京:清华大学出版社,2003
7Archer,T.,Whitechapel,A.著.马朝晖等译.C#技术揭秘.北京:机械工业出版社,2003
8张志刚,陈静,李晓明.一种HTML网页净化方法.情报学报,2004,23(4):387-393