Web页面最大有意义节点发现算法研究

doi:10.11925/infotech.1003-3513.2009.10.04

现代图书情报技术

2009, Vol.

Issue (10): 22-27 https://doi.org/10.11925/infotech.1003-3513.2009.10.04

数字图书馆

本期目录 | 过刊浏览 | 高级检索

Web页面最大有意义节点发现算法研究

李亚子方安陈薇朱峰

（中国医学科学院医学信息研究所北京 100020）

Research on Identifying Maximal Meaningful Node from Web Page

Li Yazi Fang An Chen Wei Zhu Feng

(Institute of Medical Information, Chinese Academy of Medical Sciences, Beijing 100020, China)

摘要
参考文献
相关文章
Metrics

全文: PDF (548 KB)
输出: BibTeX | EndNote (RIS)

摘要

在分析国内外研究和实现发现Web页面中最大有意义节点算法的基础上，将多个相似页面压缩成为样式树，通过计算节点的重要性发现最大有意义节点并给出样例分析。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	李亚子

关键词 ：样式树, 最大有意义节点, 节点重要性, DOM树

Abstract：

The paper analyzes the research and implementation algorithm about identifying the maximal meaningful node. Making uses of and improving the style tree，it computes the importance of nodes to find the maximal meaningful node. Finally， an example is given.

Key words： Style tree Maximal meaningful node Node importance Dom tree

收稿日期: 2009-08-31 出版日期: 2009-10-25

G250

通讯作者: 李亚子 E-mail: 8982632@163.com

作者简介: 李亚子,方安,陈薇,朱峰

引用本文:

李亚子,方安,陈薇,朱峰. Web页面最大有意义节点发现算法研究[J]. 现代图书情报技术, 2009, (10): 22-27.
Li Yazi,Fang An,Chen Wei,Zhu Feng. Research on Identifying Maximal Meaningful Node from Web Page. New Technology of Library and Information Service, 2009, (10): 22-27.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2009.10.04 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2009/V/I10/22

［1］ Yi L, Liu B. Web Page Cleaning for Web Mining Through Feature Weighting［C］. In：Proceedings of the 18th International Joint Conference on Artificial Intelligence, Acapulco, Mexico. 2003：9-15.
［2］ Li J, Ezeife C I. Cleaning Web Pages for Effective Web Content Mining［C］.In:Proceedings of the 17th International Conference. 2006:560-571.
［3］ Laender A, Riberiro-Neto B, Da Silva A S, et al. A Brief Survey of Web Data Extraction Tools［J］. ACM SIGMOD Record, 2002,31(2):84-93.
［4］ Adelberg B. NoDoSE - A Tool for Semi-Automatically Extracting Structured and Semistructured Data from Text Documents［J］. ACM SIGMOD Record，1998，27（2）:283-294.
［5］ Baumgartner R, Flesca S, Gottlob G. Visual Web Information Extraction with Lixto［C］. In:Proceedings of the 27th International Conference on Very Large Data Bases.2001:119-128.
［6］ Chidlowskii B, Ragetli J, De Rijke M. Automatic Wrapper Generation for Web Search Engines［C］.In: Proceedings of the 1st International Conference on Web-Age Information Management,Shanghai, China. London, UK：Springer-Verlag，2005:66-75.
［7］ Crescenzi V, Mecca G, Merialdo P. RoadRunner:Towards Automatic Data Extraction from Large Web Sites［C］. In:Proceedings of the 27th International Conference on Very Large Data Bases.2001:109-118.
［8］ DOM Specification［EB/OL］.［2008-09-03］.http://www.w3.org/DOM/DOMTR.
［9］ Debnath S，Mitra P，Giles C L. Automatic Extraction of Informative Blocks from Webpages［C］.In: Proceedings of the 2005 ACM Symposium on Applied Computing，Santa Fe, New Mexico.2005:1722-1726.
［10］骆思安,徐俊杰.应用MMB算法清理网页噪声和撷取网页［EB/OL］.［2009-06-25］. http://ccnet.km.nccu.edu.tw/xms/read_attach.php?id=129.
［11］ Zhao H K，Meng W Y, Wu Z H，et al.Fully Automatic Wrapper Generation for Search Engines［C］. In: Proceedings of the 14th International Conference on World Wide Web,Chiba, Japan. ACM Press，2005: 66-75.
［12］ Song R, Liu H, Wen J R,et al. Learning Important Models for Web Page Blocks Based on Layout and Content Analysis［J］. ACM SIGKDD Explorations,2005,6(2):14-23.
［13］ Buttler D, Liu L, Pu C. A Fully Automated Object Extraction System for the World Wide Web［EB/OL］. ［2008-11-10］. http://ieeexplore.ieee.org/iel5/7339/19871/00918966.pdf?arnumber=918966.
［14］ Yi L, Liu B,Xiao L.Eliminating Noisy Information in Web Pages for Data Mining［C］.In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington DC, USA.2003:296-305.
［15］ Debnath S，Mitra P，Giles C L.Identifying Content Blocks from Web Documents［J］. Lecture Notes in Computer Science, 2005（3488）:285-293.

[1]	刘志强,都云程,施水才. 基于改进的隐马尔科夫模型的网页新闻关键信息抽取^*[J]. 数据分析与知识发现, 2019, 3(3): 120-128.
[2]	朱毅华, 张超群, 曾通, 吴龙凤, 徐玛丽, 王东波, 李晓晖. 基于子树相似度计算的网页评论提取算法研究[J]. 现代图书情报技术, 2013, 29(11): 52-59.
[3]	吕聚旺,都云程,王弘蔚,施水才. 基于新型主题信息量化方法的Web主题信息提取研究*[J]. 现代图书情报技术, 2008, 24(12): 48-53.

Viewed

Full text

Abstract

Cited

Shared

Discussed