|
|
An Algorithm for Noise Reduction in Web Pages Based on a Group of Content-related Rules |
Wang Jiandong1,2 Wang Jimin1 Tian Feijia1 |
1(Department of Information Management, Peking University, Beijing 100871,China)
2(Lianyungang Teacher’s College Library, Lianyungang 222000,China) |
|
|
Abstract This paper presents a new algorithm for the Elimination of Noise in Web Pages Based on a Group of Content-related rules. First, we present an algorithm which can peel off noises by iteratively comparing the tables on the same level of the page’s table tree. Next, we present an algorithm in order to evaluate anchor text’s topic similarity to the content of the page. To some extent, as the new algorithm takes semantic facts of the pages into consideration, it acquires a even higher accuracy than pure rule-based algorithms, and requires a fairly low time complexity. The experiment indicates that this algorithm performs very effectively when purifying great mass of web pages.
|
Received: 27 November 2007
Published: 25 March 2008
|
|
Corresponding Authors:
Wang Jiandong
E-mail: ZS.Wagner@yahoo.com.cn
|
About author:: Wang Jiandong,Wang Jimin,Tian Feijia |
[1] 张志刚,陈静,李晓明. 一种HTML网页净化方法[J]. 情报学报,2004,23(4):387-393.
[2] 时达明,林鸿飞,杨志豪. 基于网页框架和规则的网页噪音去除方法[C]. 第三届学生计算语言学研讨会. 沈阳.2006.
[3] 荆涛,左万利. 基于可视布局信息的网页噪音去除算法[J]. 华南理工大学学报:自然科学版,2004,32(21):84-87.
[4] 封化民,刘飚,刘艳敏,等. 含有位置坐标树的Web页面分析和内容提取框架[J]. 清华大学学报,2005,45(S1):1767-1771.
[5] 孙承杰,关毅. 基于统计的网页正文信息抽取方法的研究[J]. 中文信息学报,2004,18(5):17-22.
[6] 欧健文,董守斌,蔡斌. 模板化网页主题信息的提取方法[J]. 清华大学学报,2005,45(S1):1743-1747.
[7] Lin S-H,Ho J-M. Discovering Informative Content Blocks from Web Documents [C]. In:Proceedings of the ACM SIGKDD Int Conf on Knowledge Discovery & Data Mining (SIGKDD’02). 2002.
[8] Cai D, Yu S, Wen J R, et al. VIPS: A Vision Based Page Segmentation Algorithm[J]. Microsoft Technical Report(MSR-TR-2003-79), 2003:24.
[9] CWT200g说明[EB/OL]. ( 2006-04-12). [2007-12-17]. http://www.cwirf.org/SharedRes/DataSet/CWT200g /CWT200g_intro.txt.
[10] Baeza-Yates R, Ribeiro-Neto B. Modern Information Retrieval[M]. ACM press, 1999: 148.
[11] 第五届全国搜索引擎和网上信息挖掘学术研讨会[EB/OL].(2006-10-19). [2007-07-11]. http://www.hainu.edu.cn/sewm2007/.
[12] 中文自然语言处理开放平台[EB/OL] .( 2002-08-16). [2007-07-11].http://www.nlp.org.cn/ project/project.php?proj_id=6. |
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|