Please wait a minute...
New Technology of Library and Information Service  2005, Vol. 21 Issue (9): 50-53    DOI: 10.11925/infotech.1003-3513.2005.09.12
Current Issue | Archive | Adv Search |
Extracting the Content of Google Web Page with Regular Expressions
Zhang Jian1   Ou Hong2
1(Library of Changsha University of Science and Technology, Changsha 410076,China)
2(Hunan Library,Changsha 410011,China)
Download: PDF (0 KB)  
Export: BibTeX | EndNote (RIS)      
Abstract  

That properly and completely extracting the content of search Web pages is the basic precondition for handling the information retrieved.This paper analyses the structure characteristic of Google Web pages,presents a group of regular expressions for matching the content of these pages,and realizes a content extractor with Visual C#.The results from practical application to many Google Web pages shows that the matching method with regular expressions can extract the whole main content of Google Web pages.

Key wordsRegular expressions      Extraction      Web page      Google     
Received: 30 May 2005      Published: 25 September 2005
ZTFLH: 

G354.4 

 
     
  TP391.3

 
Corresponding Authors: Zhang Jian     E-mail: ehulh@163.com
About author:: Zhang Jian,Ou Hong

Cite this article:

Zhang Jian,Ou Hong. Extracting the Content of Google Web Page with Regular Expressions. New Technology of Library and Information Service, 2005, 21(9): 50-53.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2005.09.12     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2005/V21/I9/50

1孟小峰.Web信息集成技术研究.计算机应用与软件,2003,20(11):32-36
2黄红华,俞勇.CWIWSK——从半结构化中抽取信息的归纳规则方法.上海交通大学学报,2003,37(3):424-427
3Theodore W.Hong,Keith L.Clark.Towards a Universal Web Wrapper.In:Proceddings of the 17th International Florida Intelligence Research Symposium Conference.Florida,USA:AAAI Press,2004. Available at:
4吴伟,刘友华.基于DOM的Web信息自动抽取.现代图书情报技术,2004(2):68-71
5Google Web APIs Reference.http://www.google.com/api/reference,(Accessed May. 8,2005)
6Linger,F.,McQueen,C.,Wilton,P.著.刘乐亭译.C#字符串和正则表达式参考手册.北京:清华大学出版社,2003
7Archer,T.,Whitechapel,A.著.马朝晖等译.C#技术揭秘.北京:机械工业出版社,2003
8张志刚,陈静,李晓明.一种HTML网页净化方法.情报学报,2004,23(4):387-393

[1] Dai Jianhua, Deng Yubin. Extracting Emotion-Cause Pairs Based on Emotional Dilation Gated CNN[J]. 数据分析与知识发现, 2020, 4(8): 98-106.
[2] Xia Tian. Extracting Key-phrases from Chinese Scholarly Papers[J]. 数据分析与知识发现, 2020, 4(7): 76-86.
[3] Li Chengliang,Zhao Zhongying,Li Chao,Qi Liang,Wen Yan. Extracting Product Properties with Dependency Relationship Embedding and Conditional Random Field[J]. 数据分析与知识发现, 2020, 4(5): 54-65.
[4] Hui Nie,Huan He. Identifying Implicit Features with Word Embedding[J]. 数据分析与知识发现, 2020, 4(1): 99-110.
[5] Gang Li,Huayang Zhou,Jin Mao,Sijing Chen. Classifying Social Media Users with Machine Learning[J]. 数据分析与知识发现, 2019, 3(8): 1-9.
[6] Mingzhu Sun,Jing Ma,Lingfei Qian. Extracting Keywords Based on Topic Structure and Word Diagram Iteration[J]. 数据分析与知识发现, 2019, 3(8): 68-76.
[7] Xiaofeng Li,Jing Ma,Chi Li,Hengmin Zhu. Identifying Commodity Names Based on XGBoost Model[J]. 数据分析与知识发现, 2019, 3(7): 34-41.
[8] Xiuxian Wen,Jian Xu. Research on Product Characteristics Extraction and Hedonic Price Based on User Comments[J]. 数据分析与知识发现, 2019, 3(7): 42-51.
[9] Qingtian Zeng,Xiaohui Hu,Chao Li. Extracting Keywords with Topic Embedding and Network Structure Analysis[J]. 数据分析与知识发现, 2019, 3(7): 52-60.
[10] Ruihua Qi,Junyi Zhou,Xu Guo,Caihong Liu. Extracting Book Review Topics with Knowledge Base[J]. 数据分析与知识发现, 2019, 3(6): 83-91.
[11] Jinzhu Zhang,Yiming Hu. Extracting Titles from Scientific References in Patents with Fusion of Representation Learning and Machine Learning[J]. 数据分析与知识发现, 2019, 3(5): 68-76.
[12] Yuemin Wu,Ganggui Ding,Bin Hu. Extracting Relationship of Agricultural Financial Texts with Attention Mechanism[J]. 数据分析与知识发现, 2019, 3(5): 86-92.
[13] Zhiqiang Liu,Yuncheng Du,Shuicai Shi. Extraction of Key Information in Web News Based on Improved Hidden Markov Model[J]. 数据分析与知识发现, 2019, 3(3): 120-128.
[14] Hongxia Xu,Chunwang Li. Review of Knowledge Extraction of Scientific Literature[J]. 数据分析与知识发现, 2019, 3(3): 14-24.
[15] Zhen Zhang,Jin Zeng. Extracting Keywords from User Comments: Case Study of Meituan[J]. 数据分析与知识发现, 2019, 3(3): 36-44.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn