Please wait a minute...
New Technology of Library and Information Service  2008, Vol. 24 Issue (3): 51-54    DOI: 10.11925/infotech.1003-3513.2008.03.09
Current Issue | Archive | Adv Search |
An Algorithm for Noise Reduction in Web Pages Based on a Group of Content-related Rules
Wang Jiandong1,2  Wang Jimin1  Tian Feijia1
1(Department of Information Management, Peking University,  Beijing 100871,China)
2(Lianyungang Teacher’s College Library, Lianyungang 222000,China)
Download:
Export: BibTeX | EndNote (RIS)      
Abstract  

This paper presents a new algorithm for the Elimination of Noise in Web Pages Based on a Group of Content-related rules. First, we present an algorithm which can peel off noises by iteratively comparing the tables on the same level of the page’s table tree. Next, we present an algorithm in order to evaluate anchor text’s topic similarity to the content of the page. To some extent, as the new algorithm takes semantic facts of the pages into consideration, it acquires a even higher accuracy than pure rule-based algorithms, and requires a fairly low time complexity. The experiment indicates that this algorithm performs very effectively when purifying great mass of web pages.

Key wordsNoise Reduction in Web Pages      Levenshtein Distance     
Received: 27 November 2007      Published: 25 March 2008
: 

TP18

 
Corresponding Authors: Wang Jiandong     E-mail: ZS.Wagner@yahoo.com.cn
About author:: Wang Jiandong,Wang Jimin,Tian Feijia

Cite this article:

Wang Jiandong,Wang Jimin,Tian Feijia. An Algorithm for Noise Reduction in Web Pages Based on a Group of Content-related Rules. New Technology of Library and Information Service, 2008, 24(3): 51-54.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2008.03.09     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2008/V24/I3/51

[1] 张志刚,陈静,李晓明. 一种HTML网页净化方法[J]. 情报学报,2004,23(4):387-393.
[2] 时达明,林鸿飞,杨志豪. 基于网页框架和规则的网页噪音去除方法[C]. 第三届学生计算语言学研讨会. 沈阳.2006.
[3] 荆涛,左万利. 基于可视布局信息的网页噪音去除算法[J]. 华南理工大学学报:自然科学版,2004,32(21):84-87.
[4] 封化民,刘飚,刘艳敏,等. 含有位置坐标树的Web页面分析和内容提取框架[J]. 清华大学学报,2005,45(S1):1767-1771.
[5] 孙承杰,关毅. 基于统计的网页正文信息抽取方法的研究[J]. 中文信息学报,2004,18(5):17-22.
[6] 欧健文,董守斌,蔡斌. 模板化网页主题信息的提取方法[J]. 清华大学学报,2005,45(S1):1743-1747.
[7] Lin S-H,Ho J-M. Discovering Informative Content Blocks from Web Documents [C]. In:Proceedings of the ACM SIGKDD Int Conf on Knowledge Discovery & Data Mining (SIGKDD’02). 2002.
[8] Cai D, Yu S, Wen J R, et al. VIPS: A Vision Based Page Segmentation Algorithm[J]. Microsoft Technical Report(MSR-TR-2003-79), 2003:24.
[9] CWT200g说明[EB/OL]. ( 2006-04-12). [2007-12-17]. http://www.cwirf.org/SharedRes/DataSet/CWT200g /CWT200g_intro.txt.
[10] Baeza-Yates R, Ribeiro-Neto B. Modern Information Retrieval[M]. ACM press, 1999: 148.
[11] 第五届全国搜索引擎和网上信息挖掘学术研讨会[EB/OL].(2006-10-19). [2007-07-11]. http://www.hainu.edu.cn/sewm2007/.
[12] 中文自然语言处理开放平台[EB/OL] .( 2002-08-16). [2007-07-11].http://www.nlp.org.cn/ project/project.php?proj_id=6.

[1] Duan Jianyong,. Auto-Correction Search Model Based on Statistics and Characteristics[J]. 现代图书情报技术, 2016, 32(2): 34-42.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn