This paper presents a new algorithm for the Elimination of Noise in Web Pages Based on a Group of Content-related rules. First, we present an algorithm which can peel off noises by iteratively comparing the tables on the same level of the page’s table tree. Next, we present an algorithm in order to evaluate anchor text’s topic similarity to the content of the page. To some extent, as the new algorithm takes semantic facts of the pages into consideration, it acquires a even higher accuracy than pure rule-based algorithms, and requires a fairly low time complexity. The experiment indicates that this algorithm performs very effectively when purifying great mass of web pages.
王建冬,王继民,田飞佳. 一种基于内容规则的网页去噪算法*[J]. 现代图书情报技术, 2008, 24(3): 51-54.
Wang Jiandong,Wang Jimin,Tian Feijia. An Algorithm for Noise Reduction in Web Pages Based on a Group of Content-related Rules. New Technology of Library and Information Service, 2008, 24(3): 51-54.