|
|
The Application and Implementation of Tree Edit Distance in Web Information Extraction |
Nie Hui,Huang Guipeng |
(School of Information Management,Sun Yat-Sen University, Guangzhou 510275,China) |
|
|
Abstract In this paper,the concept of edit distance is introduced, and the issues about how to construct a tag tree and calculate the similarity of two Web pages by using the tree-matching algorithm are discussed. Firstly, the pages are roughly clustered according to their URL similarities and further classified by tree-matching algorithm. Based on the model page obtained by clustering, Web information can be extracted automatically by using Web structure similarity algorithm jointed with extraction rules. The test is able to verify the feasibility and efficiency of the algorithm in system.
|
Received: 10 March 2010
Published: 25 May 2010
|
|
Fund: *本文系2008年度教育部人文社会科学研究项目“基于信息抽取的数字图书馆的知识获取研究”(项目编号:08JC870013)和2009年度中山大学青年教师培育项目“智能化深度搜索引擎实现技术的研究”(项目编号:2000-3161101)的研究成果之一。 |
[1] Crescenzi V, Mecca G, Merialdo P. Wrapping-oriented Classification of Web Pages [C]. In:Proceedings of the 2002 ACM Symposium on Applied Computing. New York: ACM Press, 2002: 1108 – 1112.
[2] Crescenzi V, Mecca G, Merialdo P. RoadRunner: Towards Automatic Data Extraction from Large Web Sites[C]. In: Proceedings of the 27th International Conference on Very Large Data Base. San Francisco,CA,USA:Morgan Kaufman Publishers Inc., 2001: 109-118.
[4] Reis D C, Golgher P B,Silva A S,et al. Automatic Web News Extraction Using Tree Edit Distance[C]. In:Proceedings of the 13th International Conference on World Wide Web. New York, NY, USA:ACM,2004: 502-511.
[5] Zheng S Y, Wu D, Song R H, et al. Joint Optimization of Wrapper Generation and Template Detection[EB/OL].[ 2009-11-05]. http://www.cse.psu.edu/~shzheng/sigkdd-2007.pdf.
[6] 李亚子,方安,陈薇,等. Web页面最大有意义节点发现算法研究[J]. 现代图书情报技术,2009(10): 22-27.
[7] 姜波,丁岳伟. 基于约束树编辑距离与导航树的信息采集[J]. 计算机工程, 2009,35(14): 75-80.
[8] 刘守群,朱明,谭晓彬. 一种基于树匹配的网页语义块挖掘算法[J]. 小型微型计算机系统, 2009,30(8):1541-1545.
[9] Tai K C. The Tree-to-Tree Editing Correction Problem[J]. Journal of the ACM, 1979,26(3):422-433.
[10] Yang W. Identifying Syntactic Differences Between Two Programs[EB/OL]. [2009-11-05].http://eprints.kfupm.edu.sa/44597/1/44597.pdf.
[11] 徐东兴. 基于Gate框架的信息抽取系统的研究与实现[D].上海:华东师范大学,2007. |
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|