%A Wang Jiandong,Wang Jimin,Tian Feijia %T An Algorithm for Noise Reduction in Web Pages Based on a Group of Content-related Rules %0 Journal Article %D 2008 %J Data Analysis and Knowledge Discovery %R 10.11925/infotech.1003-3513.2008.03.09 %P 51-54 %V 24 %N 3 %U {https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/abstract/article_481.shtml} %8 2008-03-25 %X

This paper presents a new algorithm for the Elimination of Noise in Web Pages Based on a Group of Content-related rules. First, we present an algorithm which can peel off noises by iteratively comparing the tables on the same level of the page’s table tree. Next, we present an algorithm in order to evaluate anchor text’s topic similarity to the content of the page. To some extent, as the new algorithm takes semantic facts of the pages into consideration, it acquires a even higher accuracy than pure rule-based algorithms, and requires a fairly low time complexity. The experiment indicates that this algorithm performs very effectively when purifying great mass of web pages.