Please wait a minute...
New Technology of Library and Information Service  2010, Vol. 26 Issue (5): 29-34    DOI: 10.11925/infotech.1003-3513.2010.05.06
article Current Issue | Archive | Adv Search |
The Application and Implementation of Tree Edit Distance in Web Information Extraction
Nie Hui,Huang Guipeng
(School of Information Management,Sun Yat-Sen University, Guangzhou  510275,China)
Download:
Export: BibTeX | EndNote (RIS)      
Abstract  

 In this paper,the concept of edit distance is introduced, and the issues about how to construct a tag tree and calculate the similarity of two Web pages by using the tree-matching algorithm are discussed. Firstly, the pages are roughly clustered according to their URL similarities and further classified by tree-matching algorithm. Based on the model page obtained by clustering, Web information can be extracted automatically by using Web structure similarity algorithm jointed with extraction rules. The test is able to verify the feasibility and efficiency of the algorithm in system.

Key wordsWeb information extraction         Tree edit distance         Structural similarity         Web clustering          Tree-matching algorithm     
Received: 10 March 2010      Published: 25 May 2010
: 

TP311

 
Fund:

*本文系2008年度教育部人文社会科学研究项目“基于信息抽取的数字图书馆的知识获取研究”(项目编号:08JC870013)和2009年度中山大学青年教师培育项目“智能化深度搜索引擎实现技术的研究”(项目编号:2000-3161101)的研究成果之一。

Cite this article:

Nie Hui Huang Guipeng. The Application and Implementation of Tree Edit Distance in Web Information Extraction. New Technology of Library and Information Service, 2010, 26(5): 29-34.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2010.05.06     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2010/V26/I5/29

[1] Crescenzi V, Mecca G, Merialdo P. Wrapping-oriented Classification of Web Pages [C]. In:Proceedings of the 2002 ACM Symposium on Applied Computing. New York: ACM Press, 2002: 1108 – 1112.
[2] Crescenzi V, Mecca G, Merialdo P. RoadRunner: Towards Automatic Data Extraction from Large Web Sites[C]. In: Proceedings of the 27th International Conference on Very Large Data Base. San Francisco,CA,USA:Morgan Kaufman Publishers Inc., 2001: 109-118.
[4] Reis D C, Golgher  P B,Silva A S,et al. Automatic Web News Extraction Using Tree Edit Distance[C]. In:Proceedings of the 13th International Conference on World Wide Web. New York, NY, USA:ACM,2004: 502-511.
[5] Zheng S Y, Wu D, Song R H, et al. Joint Optimization of Wrapper Generation and Template Detection[EB/OL].[ 2009-11-05]. http://www.cse.psu.edu/~shzheng/sigkdd-2007.pdf.
[6] 李亚子,方安,陈薇,等. Web页面最大有意义节点发现算法研究[J]. 现代图书情报技术,2009(10): 22-27.
[7] 姜波,丁岳伟. 基于约束树编辑距离与导航树的信息采集[J]. 计算机工程, 2009,35(14): 75-80.
[8] 刘守群,朱明,谭晓彬. 一种基于树匹配的网页语义块挖掘算法[J]. 小型微型计算机系统, 2009,30(8):1541-1545.
[9] Tai K C. The Tree-to-Tree Editing Correction Problem[J]. Journal of the ACM, 1979,26(3):422-433.
[10] Yang W. Identifying Syntactic Differences Between Two Programs[EB/OL]. [2009-11-05].http://eprints.kfupm.edu.sa/44597/1/44597.pdf.
[11] 徐东兴. 基于Gate框架的信息抽取系统的研究与实现[D].上海:华东师范大学,2007.

[1] Yao Zhanlei, Guo Jinlong, Xu Xin. QA System Design and Implementation in Collaborative Virtual Reference Service[J]. 现代图书情报技术, 2012, (9): 15-22.
[2] Liu Tian, Zhang Wende. Development of Copyright Valuation System in Profit Digital Library[J]. 现代图书情报技术, 2012, 28(4): 89-94.
[3] Tai Lijun, Hu Rufu, Zhao Han, Chen Caowei. Application Research of Improved Genetic Neural Network Algorithm in Sales Forecast[J]. 现代图书情报技术, 2012, 28(1): 63-67.
[4] Zhang Yong, Chao Lemen, Xing Chunxiao, Zhang Ming, Wang Wenqing, Zhang Jian. R&D on the Supporting Platform for New Generation Digital Library Applications[J]. 现代图书情报技术, 2011, 27(6): 3-13.
[5] Yao Fei, Ji Lei, Zhang Chengyu, Chen Wu. New Attempt on Real-time Virtual Reference Service ——The Smart Chat Robot of Tsinghua University Library[J]. 现代图书情报技术, 2011, 27(4): 77-81.
[6] Yu Xiaofan, Wang Xiaoyue, Bai Rujiang. Review on the Methods and Tools for Ontology Integration[J]. 现代图书情报技术, 2011, 27(1): 14-21.
[7] Wang Pingshui. Research on Anonymous Privacy-Preserving Techniques Based on Clustering[J]. 现代图书情报技术, 2010, 26(11): 53-58.
[8] Wu Shuai. Application on Information Extraction from Factual Information Based on Conditional Random Fields Method[J]. 现代图书情报技术, 2010, 26(10): 59-64.
[9] Huang Wei Gao Junfeng. A Second Organization of Academic Retrieved Results Based on Concept Lattice[J]. 现代图书情报技术, 2010, 26(5): 8-12.
[10] Sun Yuyan,Zhang Wende. Design and Implementation of Evaluation System of Enterprise Patent Value[J]. 现代图书情报技术, 2009, 25(11): 64-68.
[11] Zhao Jinwei,Zhen Zhen. Research Summary on Ontology Matching Technologies[J]. 现代图书情报技术, 2009, 25(11): 6-9.
[12] Rao Yanghui,Ye Liang,Cheng Jie. Research on the Application of WordNet in Text Clustering[J]. 现代图书情报技术, 2009, (10): 67-70.
[13] Song Min. Research and Realization of Key Techniques of Library’s Digital Resource Integration Platform Based on SOA[J]. 现代图书情报技术, 2009, (9): 22-27.
[14] Zhang Yulian ,Li Shuai ,Zhou Xinglin. Research on Ontology-based Automatic Annotation for Deep Web[J]. 现代图书情报技术, 2009, (9): 45-50.
[15] Xue Jianwu,Chen Yaoqing,Cui Xuan. A Model of Asynchronous Semantic Retrieval Based on Ajax[J]. 现代图书情报技术, 2009, 25(5): 6-10.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn