Please wait a minute...
New Technology of Library and Information Service  2005, Vol. 21 Issue (9): 50-53    DOI: 10.11925/infotech.1003-3513.2005.09.12
Current Issue | Archive | Adv Search |
Extracting the Content of Google Web Page with Regular Expressions
Zhang Jian1   Ou Hong2
1(Library of Changsha University of Science and Technology, Changsha 410076,China)
2(Hunan Library,Changsha 410011,China)
Download: PDF(0 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  

That properly and completely extracting the content of search Web pages is the basic precondition for handling the information retrieved.This paper analyses the structure characteristic of Google Web pages,presents a group of regular expressions for matching the content of these pages,and realizes a content extractor with Visual C#.The results from practical application to many Google Web pages shows that the matching method with regular expressions can extract the whole main content of Google Web pages.

Key wordsRegular expressions      Extraction      Web page      Google     
Received: 30 May 2005      Published: 25 September 2005
: 

G354.4 

 
     
  TP391.3

 
Corresponding Authors: Zhang Jian     E-mail: ehulh@163.com
About author:: Zhang Jian,Ou Hong

Cite this article:

Zhang Jian,Ou Hong. Extracting the Content of Google Web Page with Regular Expressions. New Technology of Library and Information Service, 2005, 21(9): 50-53.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2005.09.12     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2005/V21/I9/50

1孟小峰.Web信息集成技术研究.计算机应用与软件,2003,20(11):32-36
2黄红华,俞勇.CWIWSK——从半结构化中抽取信息的归纳规则方法.上海交通大学学报,2003,37(3):424-427
3Theodore W.Hong,Keith L.Clark.Towards a Universal Web Wrapper.In:Proceddings of the 17th International Florida Intelligence Research Symposium Conference.Florida,USA:AAAI Press,2004. Available at:
4吴伟,刘友华.基于DOM的Web信息自动抽取.现代图书情报技术,2004(2):68-71
5Google Web APIs Reference.http://www.google.com/api/reference,(Accessed May. 8,2005)
6Linger,F.,McQueen,C.,Wilton,P.著.刘乐亭译.C#字符串和正则表达式参考手册.北京:清华大学出版社,2003
7Archer,T.,Whitechapel,A.著.马朝晖等译.C#技术揭秘.北京:机械工业出版社,2003
8张志刚,陈静,李晓明.一种HTML网页净化方法.情报学报,2004,23(4):387-393

[1] Ruihua Qi,Junyi Zhou,Xu Guo,Caihong Liu. Extracting Book Review Topics with Knowledge Base[J]. 数据分析与知识发现, 2019, 3(6): 83-91.
[2] Jinzhu Zhang,Yiming Hu. Extracting Titles from Scientific References in Patents with Fusion of Representation Learning and Machine Learning[J]. 数据分析与知识发现, 2019, 3(5): 68-76.
[3] Yuemin Wu,Ganggui Ding,Bin Hu. Extracting Relationship of Agricultural Financial Texts with Attention Mechanism[J]. 数据分析与知识发现, 2019, 3(5): 86-92.
[4] Zhiqiang Liu,Yuncheng Du,Shuicai Shi. Extraction of Key Information in Web News Based on Improved Hidden Markov Model[J]. 数据分析与知识发现, 2019, 3(3): 120-128.
[5] Hongxia Xu,Chunwang Li. Review of Knowledge Extraction of Scientific Literature[J]. 数据分析与知识发现, 2019, 3(3): 14-24.
[6] Zhen Zhang,Jin Zeng. Extracting Keywords from User Comments: Case Study of Meituan[J]. 数据分析与知识发现, 2019, 3(3): 36-44.
[7] Shengchun Ding,Linlin Hou,Ying Wang. Product Knowledge Map Construction Based on the E-commerce Data[J]. 数据分析与知识发现, 2019, 3(3): 45-56.
[8] Yue Yuan,Dongbo Wang,Shuiqing Huang,Bin Li. The Comparative Study of Different Tagging Sets on Entity Extraction of Classical Books[J]. 数据分析与知识发现, 2019, 3(3): 57-65.
[9] Guijun Yang,Xue Xu,Fuqiang Zhao. Predicting User Ratings with XGBoost Algorithm[J]. 数据分析与知识发现, 2019, 3(1): 118-126.
[10] Ying Wang,Li Qian,Jing Xie,Zhijun Chang,Beibei Kong. Building Knowledge Graph with Sci-Tech Big Data[J]. 数据分析与知识发现, 2019, 3(1): 15-26.
[11] Li Yu,Li Qian,Changlei Fu,Huaming Zhao. Extracting Fine-grained Knowledge Units from Texts with Deep Learning[J]. 数据分析与知识发现, 2019, 3(1): 38-45.
[12] Zhuchen Liu,Hao Chen,Yanhua Yu,Jie Li. Extracting Keywords with TextRank and Weighted Word Positions[J]. 数据分析与知识发现, 2018, 2(9): 74-79.
[13] Dongmei Mu,Shan Jin,Yuanhong Ju. Finding Association Between Diseases and Genes from Literature Abstracts[J]. 数据分析与知识发现, 2018, 2(8): 98-106.
[14] Lixin Zhou,Jie Lin. Extracting Product Features with NodeRank Algorithm[J]. 数据分析与知识发现, 2018, 2(4): 90-98.
[15] Weilin He,Guohe Feng,Hongling Xie. Analyzing Scientific Literature with Content Similarity - Topics over Time Model[J]. 数据分析与知识发现, 2018, 2(11): 64-72.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn