New Technology of Library and Information Service  2008, Vol. 24 Issue (3): 51-54    DOI: 10.11925/infotech.1003-3513.2008.03.09
An Algorithm for Noise Reduction in Web Pages Based on a Group of Content-related Rules
Wang Jiandong1,2  Wang Jimin1  Tian Feijia1
1(Department of Information Management, Peking University,  Beijing 100871,China)
2(Lianyungang Teacher’s College Library, Lianyungang 222000,China)
This paper presents a new algorithm for the Elimination of Noise in Web Pages Based on a Group of Content-related rules. First, we present an algorithm which can peel off noises by iteratively comparing the tables on the same level of the page’s table tree. Next, we present an algorithm in order to evaluate anchor text’s topic similarity to the content of the page. To some extent, as the new algorithm takes semantic facts of the pages into consideration, it acquires a even higher accuracy than pure rule-based algorithms, and requires a fairly low time complexity. The experiment indicates that this algorithm performs very effectively when purifying great mass of web pages.

Key wordsNoise Reduction in Web Pages      Levenshtein Distance     
Received: 27 November 2007      Published: 25 March 2008


Corresponding Authors: Wang Jiandong     E-mail:
About author:: Wang Jiandong,Wang Jimin,Tian Feijia

Wang Jiandong,Wang Jimin,Tian Feijia. An Algorithm for Noise Reduction in Web Pages Based on a Group of Content-related Rules. New Technology of Library and Information Service, 2008, 24(3): 51-54.

