树编辑距离在Web信息抽取中的应用与实现*

doi:10.11925/infotech.1003-3513.2010.05.06

现代图书情报技术

2010, Vol. 26

Issue (5): 29-34 https://doi.org/10.11925/infotech.1003-3513.2010.05.06

知识组织与知识管理

本期目录 | 过刊浏览 | 高级检索

树编辑距离在Web信息抽取中的应用与实现*

聂卉,黄贵鹏

(中山大学资讯管理系广州 510275)

The Application and Implementation of Tree Edit Distance in Web Information Extraction

Nie Hui,Huang Guipeng

(School of Information Management，Sun Yat-Sen University, Guangzhou 510275，China)

摘要
参考文献
相关文章
Metrics

全文: PDF (601 KB) HTML
输出: BibTeX | EndNote (RIS)

摘要

引入编辑距离的概念，探讨如何构造标签树，并利用标签树匹配算法来量化网页结构相似度。该算法被应用于Web信息抽取，通过URL相似度算法进行样本网页的粗聚类，进一步采用树的相似度匹配算法实现细聚类，从而获取模板网页。在模板网页的基础上，再次引入结构相似度算法并结合基于模板网页的抽取规则实现网页的自动化抽取。实验证明，该算法的引入能够有效提高包装器的抽取精度和半自动化能力。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	聂卉
	黄贵鹏

关键词 ： Web信息抽取 , 树编辑距离 , 结构相似度 , Web聚类

Abstract：

In this paper，the concept of edit distance is introduced， and the issues about how to construct a tag tree and calculate the similarity of two Web pages by using the tree-matching algorithm are discussed. Firstly, the pages are roughly clustered according to their URL similarities and further classified by tree-matching algorithm. Based on the model page obtained by clustering, Web information can be extracted automatically by using Web structure similarity algorithm jointed with extraction rules. The test is able to verify the feasibility and efficiency of the algorithm in system.

Key words： Web information extraction Tree edit distance Structural similarity Web clustering Tree-matching algorithm

收稿日期: 2010-03-10 出版日期: 2010-05-25

TP311

基金资助:

*本文系2008年度教育部人文社会科学研究项目“基于信息抽取的数字图书馆的知识获取研究”(项目编号：08JC870013)和2009年度中山大学青年教师培育项目“智能化深度搜索引擎实现技术的研究”(项目编号：2000-3161101)的研究成果之一。

通讯作者: 聂卉 E-mail: issnh@mail.sysu.edu.cn

引用本文:

聂卉黄贵鹏. 树编辑距离在Web信息抽取中的应用与实现*[J]. 现代图书情报技术, 2010, 26(5): 29-34.
Nie Hui Huang Guipeng. The Application and Implementation of Tree Edit Distance in Web Information Extraction. New Technology of Library and Information Service, 2010, 26(5): 29-34.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2010.05.06 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2010/V26/I5/29

［1］ Crescenzi V, Mecca G, Merialdo P. Wrapping-oriented Classification of Web Pages ［C］. In：Proceedings of the 2002 ACM Symposium on Applied Computing. New York: ACM Press, 2002: 1108 – 1112.
［2］ Crescenzi V, Mecca G, Merialdo P. RoadRunner: Towards Automatic Data Extraction from Large Web Sites［C］. In: Proceedings of the 27th International Conference on Very Large Data Base. San Francisco，CA，USA：Morgan Kaufman Publishers Inc., 2001: 109-118.
［4］ Reis D C, Golgher P B，Silva A S，et al. Automatic Web News Extraction Using Tree Edit Distance［C］. In：Proceedings of the 13th International Conference on World Wide Web. New York, NY, USA：ACM，2004: 502-511.
［5］ Zheng S Y, Wu D, Song R H, et al. Joint Optimization of Wrapper Generation and Template Detection［EB/OL］.［ 2009-11-05］. http://www.cse.psu.edu/~shzheng/sigkdd-2007.pdf.
［6］李亚子，方安，陈薇，等. Web页面最大有意义节点发现算法研究［J］. 现代图书情报技术,2009(10): 22-27.
［7］姜波，丁岳伟. 基于约束树编辑距离与导航树的信息采集［J］. 计算机工程, 2009,35(14): 75-80.
［8］刘守群，朱明，谭晓彬. 一种基于树匹配的网页语义块挖掘算法［J］. 小型微型计算机系统, 2009,30(8):1541-1545.
［9］ Tai K C. The Tree-to-Tree Editing Correction Problem［J］. Journal of the ACM, 1979,26(3):422-433.
［10］ Yang W. Identifying Syntactic Differences Between Two Programs［EB/OL］. ［2009-11-05］.http://eprints.kfupm.edu.sa/44597/1/44597.pdf.
［11］徐东兴. 基于Gate框架的信息抽取系统的研究与实现［D］．上海:华东师范大学,2007.

[1]	姚占雷, 郭金龙, 许鑫. 联合虚拟参考咨询中的自动问答系统设计与实现[J]. 现代图书情报技术, 2012, (9): 15-22.
[2]	刘田, 张文德. 营利性数字图书馆著作权评估系统开发[J]. 现代图书情报技术, 2012, 28(4): 89-94.
[3]	邰丽君, 胡如夫, 赵韩, 陈曹维. 改进遗传神经网络算法在销售预测中的应用研究[J]. 现代图书情报技术, 2012, 28(1): 63-67.
[4]	张勇, 朝乐门, 邢春晓, 张铭, 王文清, 张健. 新一代数字图书馆应用支撑平台的研究与开发[J]. 现代图书情报技术, 2011, 27(6): 3-13.
[5]	姚飞, 纪磊, 张成昱, 陈武. 实时虚拟参考咨询服务新尝试——清华大学图书馆智能聊天机器人[J]. 现代图书情报技术, 2011, 27(4): 77-81.
[6]	于晓繁, 王效岳, 白如江. 本体集成方法和工具综述[J]. 现代图书情报技术, 2011, 27(1): 14-21.
[7]	王平水. 基于聚类的匿名化隐私保护技术研究[J]. 现代图书情报技术, 2010, 26(11): 53-58.
[8]	武帅. 基于条件随机域模型的事实信息抽取方法应用[J]. 现代图书情报技术, 2010, 26(10): 59-64.
[9]	黄微高俊峰. 基于概念格的Web学术信息搜索结果的二次组织*[J]. 现代图书情报技术, 2010, 26(5): 8-12.
[10]	孙玉艳,张文德. 企业专利量化评估系统的设计与实现*[J]. 现代图书情报技术, 2009, 25(11): 64-68.
[11]	赵晋巍,真溱. 本体匹配技术研究概述[J]. 现代图书情报技术, 2009, 25(11): 6-9.
[12]	饶洋辉,叶良,程洁. WordNet在文本聚类中的应用研究*[J]. 现代图书情报技术, 2009, (10): 67-70.
[13]	宋敏. 基于SOA图书馆数字资源整合平台关键技术的研究与实现[J]. 现代图书情报技术, 2009, (9): 22-27.
[14]	张玉连,李帅,周兴林. 基于本体的Deep Web自动标注方法研究*[J]. 现代图书情报技术, 2009, (9): 45-50.
[15]	薛建武,陈尧清,崔璇. 基于Ajax的异步语义检索实验模型研究[J]. 现代图书情报技术, 2009, 25(5): 6-10.

Viewed

Full text

Abstract

Cited

Shared

Discussed