The paper analyzes the research and implementation algorithm about identifying the maximal meaningful node. Making uses of and improving the style tree,it computes the importance of nodes to find the maximal meaningful node. Finally, an example is given.
李亚子,方安,陈薇,朱峰. Web页面最大有意义节点发现算法研究[J]. 现代图书情报技术, 2009, (10): 22-27.
Li Yazi,Fang An,Chen Wei,Zhu Feng. Research on Identifying Maximal Meaningful Node from Web Page. New Technology of Library and Information Service, 2009, (10): 22-27.
[1] Yi L, Liu B. Web Page Cleaning for Web Mining Through Feature Weighting[C]. In:Proceedings of the 18th International Joint Conference on Artificial Intelligence, Acapulco, Mexico. 2003:9-15.
[2] Li J, Ezeife C I. Cleaning Web Pages for Effective Web Content Mining[C].In:Proceedings of the 17th International Conference. 2006:560-571.
[3] Laender A, Riberiro-Neto B, Da Silva A S, et al. A Brief Survey of Web Data Extraction Tools[J]. ACM SIGMOD Record, 2002,31(2):84-93.
[4] Adelberg B. NoDoSE - A Tool for Semi-Automatically Extracting Structured and Semistructured Data from Text Documents[J]. ACM SIGMOD Record,1998,27(2):283-294.
[5] Baumgartner R, Flesca S, Gottlob G. Visual Web Information Extraction with Lixto[C]. In:Proceedings of the 27th International Conference on Very Large Data Bases.2001:119-128.
[6] Chidlowskii B, Ragetli J, De Rijke M. Automatic Wrapper Generation for Web Search Engines[C].In: Proceedings of the 1st International Conference on Web-Age Information Management,Shanghai, China. London, UK:Springer-Verlag,2005:66-75.
[7] Crescenzi V, Mecca G, Merialdo P. RoadRunner:Towards Automatic Data Extraction from Large Web Sites[C]. In:Proceedings of the 27th International Conference on Very Large Data Bases.2001:109-118.
[8] DOM Specification[EB/OL].[2008-09-03].http://www.w3.org/DOM/DOMTR.
[9] Debnath S,Mitra P,Giles C L. Automatic Extraction of Informative Blocks from Webpages[C].In: Proceedings of the 2005 ACM Symposium on Applied Computing,Santa Fe, New Mexico.2005:1722-1726.
[10] 骆思安,徐俊杰.应用MMB算法清理网页噪声和撷取网页[EB/OL].[2009-06-25]. http://ccnet.km.nccu.edu.tw/xms/read_attach.php?id=129.
[11] Zhao H K,Meng W Y, Wu Z H,et al.Fully Automatic Wrapper Generation for Search Engines[C]. In: Proceedings of the 14th International Conference on World Wide Web,Chiba, Japan. ACM Press,2005: 66-75.
[12] Song R, Liu H, Wen J R,et al. Learning Important Models for Web Page Blocks Based on Layout and Content Analysis[J]. ACM SIGKDD Explorations,2005,6(2):14-23.
[13] Buttler D, Liu L, Pu C. A Fully Automated Object Extraction System for the World Wide Web[EB/OL]. [2008-11-10]. http://ieeexplore.ieee.org/iel5/7339/19871/00918966.pdf?arnumber=918966.
[14] Yi L, Liu B,Xiao L.Eliminating Noisy Information in Web Pages for Data Mining[C].In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington DC, USA.2003:296-305.
[15] Debnath S,Mitra P,Giles C L.Identifying Content Blocks from Web Documents[J]. Lecture Notes in Computer Science, 2005(3488):285-293.