Please wait a minute...
New Technology of Library and Information Service  2008, Vol. 24 Issue (3): 24-31    DOI: 10.11925/infotech.1003-3513.2008.03.05
Current Issue | Archive | Adv Search |
A Survey of the Research on Information Extraction over Web Tables
Zhao Hong   Xiao Hong   Xue Dejun   Shi Qinghui
(China Academic Journal(CD) Publishing House, Beijing 100084, China)
Download: PDF(789 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  

This paper firstly introduces the characteristics and structure of Web tables and describes the process of information extraction over Web tables. Then four key technologies are analysed, including Web table detection, Web table structure recognition, Web table interpretation and presentation of table extraction. It also analyses the application of the research and points out the problems in current researches, and finally presents a prospect of its future.

Key wordsWeb tables      Information Extraction      Web Table Detection      Web Table Structure Recognition      Web Table Interpretation     
Received: 11 December 2007      Published: 25 March 2008
: 

TP391

 
Corresponding Authors: Zhao Hong     E-mail: zhaohong860112@163.com
About author:: Zhao Hong,Xiao Hong,Xue Dejun,Shi Qinghui

Cite this article:

Zhao Hong,Xiao Hong,Xue Dejun,Shi Qinghui. A Survey of the Research on Information Extraction over Web Tables. New Technology of Library and Information Service, 2008, 24(3): 24-31.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2008.03.05     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2008/V24/I3/24

[1] Gatterbauer W, Bohunsky P. Table Extraction Using Spatial Reasoning on the CSS2 Visual Box Model[C]. In:Proceedings of the 21st National Conference on Artificial Intelligence (AAAI 2006),Washington:AAAI Press,2006:1313-1318.
[2] Douglas S, Hurst M. Layout and Language:List and Tables in Technical Documents[C]. In:Proceedings of ACL SIGPARSE Workshop on Punctuation in Computational Linguistics,New Jersey: Association for Computational Linguistics,1996:19-24.
[3] Hu J, Kashi R S, Lopresti D, et al. Evaluating the Performance of Table Processing Algorithms[J].International Journal on Document Analysis and Recognition,2002,4(3):140-153.
[4] Ng H T, Kim C Y, Koo J L T. Learning to Recognize Tables in Free Texts[C]. In:Proceedings of the 37th Annual Meeting of the Association for Computional Linguistics,New Jersey: Association for Computational Linguistics,1999:443- 450.
[5] Wang Y, Haralick R, Phillips I. Document Zone Content Classification and Its Performance Evaluation[J]. Pattern Recognition,2006,39(1):57-73.
[6] Wang Y, Phillips I T, Robert R M, et al.Table Structure Understanding and Its Performance Evaluation[J]. Pattern Recognition,2004,37(7):1479-1497.
[7] McCallum A, Freitag D, Pereira F. Maximun Entropy Markov Modals for Information Extraction and Segmentation[C]. In:Proceeding of the 17th International Conference on Machine Learning,2002:591-598.
[8] Pinto D, McCallum A, Wei X, et al. Table Extraction Using Conditional Random Fields[C].In:Proceedings of the ACM SIGIR,2003:235-242.
[9] Hammer J, Garcia M H, Cho J, et al. Extracting Semi-structured Information From the Web[C]. In:Proceedings of the Workshop on Management of Semistructured Data,1997:18-25.
[10] Lim S, Ng Y. An Automated Approach for Retrieving Heirarchical Data from HTML Tables[C]. In:Proceedings of the 8th International Conference on Informaiton and Knowledge Management(CIKM’99),1999:466-474.
[11] Cui Tao. Schema Matching and Data Extraction over HTML Tables[D]. Brigham Young University,USA,2003.
[12] 林科锵. Web页中表格结构识别的研究与实现[D]. 成都:电子科技大学, 2006.
[13] Yoshida M, Torisaw K a, Tsujii J. A Method to Integrate Tables of the World Wide Web[C]. In:Proceedings of the First International Workshop on Web Document Analysis (WDA), 2001:31-34.
[14] Embley D W, Lopresti D P, Nagy G. Notes on Contemporary Table Recognition[C]. In:Proc. 7th Int. Workshopon Document Analysis Systems (DAS), 2006:164-175.
[15] 李保利, 陈玉忠, 俞士汶. 信息抽取研究综述[J]. 计算机工程与应用,2003,39(10):1-5,66.
[16] Lerman K, Getoor L, Minton S, et al. Using the Structure of Web Sites for Automatic Segmentation of Tables[C]. In:Proc. of SIGMOD, 2004:119-130.
[17] Lerman K, Knoblock C A, Minton S. Automatic Data Extraction From Lists and Tables in Web Sources[C]. In:Proceedings of the workshop on Advances in Text Extraction and Mining(IJCAI-2001).
[18] Cohen W, Hurst M, Jensen L. A Flexible Learning System for Wrapping Tables and Lists in HTML Documents[C]. In:Proceedings of WWW2002, 2002:232-241.
[19] Box Model[EB/OL].[2007-11-11]. http://www.w3.org/TR/REC-CSS2/box.html.
[20] Gatterbauer W, Bohunsky P, Herzog M,et al. Towards Domain Independent Information Extraction from Web Tables[C]. In:Proceedings of the 16th International World Wide Web Conference (WWW 2007), 2007:71-80.
[21] Chen H, Tsai S, Tsai J. Mining Tables from Large Scale HTML Texts[C]. In:Proceedings of the 18th International Conference on Computational Linguistics,New Jersey:Association for Computational Linguistics,2000:166-172.
[22] Embley D W, Cui Tao, Liddle S W. Automatically Extracting Ontologically Specified Data from HTML Tables With Unknown Structure[C]. In:Proceedings of the 21st International Conference on Conceptual Modeling(ER2002), 2002:322-337.
[23] Tengli A, Yang Y, Li N. Machine Learning Table Extraction from Examples[C]. In:Proceedingds of the 20th International Conference on Computational Linguistics(COLING),New Jersey:Association for Computational Linguistics,2004:987-993.
[24] Pivk A, Cimiano P, Sure Y. From Tables to Frames[J]. Journal of Web Semantics,2005,3(2-3):132-146.
[25] Zhai Y, Liu B. Web Data Extraction Based on Partial tree Alignment[C]. In:Proceedings of the 14th International World Wide Web Conference (WWW 2005), 2005:76-85.
[26] Wu Yangyang, Yokota H. A Method of Recognizing Tables and Lists on the Web[C]. In:Proc. of Int. Conf. on Communication, Internet, and Information Technology(CIIT 2002), 2002:479-485.
[27] 吴扬扬, 陈锻生. 识别和抽取Web列表中的关系信息[J]. 计算机科学,2003,31(6):86-88.
[28] 林琳. 基于Ontology的Web表格内容抽取的研究与实现[D]. 成都:电子科技大学,2006.
[29] Hurst M. Classifying TABLE Elements in HTML[C]. In:Proceedingds of the 11th International World Wide Web Conference(WWW 2002).
[30] Wang Y, Hu J. A Machine Learning Based Approach for Table Detection on the Web[C].In:Proceedings of the 11th International Conference on World Wide Web, 2002:242-250.
[31] Kim Y, Lee K. Detecting Tables in Web Documents[J]. Engineering Appliations of Artificial,2005(18):745-757.
[32] Liu B, Zhai Y. NET-A System for Extracting Web Data from Flat and Nested Data Records[C]. In:Proceedings of the 6th International Conference on Web Information Systems Engineering(WISE-05), Washington:IEEE Computer Society Press,2005:487-495.
[33] 王放, 顾宁, 吴国文. 基于本体的Web表格信息抽取[J]. 小型微型计算机系统, 2003,24(12):2142-2146.
[34] Hurst M. Layout and language: Challenges for Table Understanding on the Web[C]. In:Proc. 1st International Workshop on Web Document Analysis,CA:Prima Communications,2001:27-30.
[35] Yang Y, Luk W. A Framework for Web Table Mining[C]. In:Proceedings of the 4th International Workshop on Web Information and Data Management, 2002:36-42.
[36] Wohlberg T. Hypertables: Development of a Structure Description Language for Tables in XML[D]. University of Hamburg,Germany,1999.
[37] CNKI数字搜索[EB/OL]. [2007-11-15]. http:// number.cnki.net/.
[38] Tijerino Y A, Embley D W, Deryle L, et al. Towards Ontology Generation from Tables[J]. WorldWide Web Journal,2005(8):261-285.
[39] TRS InfoRadar[EB/OL]. [2007-11-15]. http://www.trs.com.cn/products/wse/radar/.
[40] 酷讯生活搜索[EB/OL]. [2007-11-15]. http://www.kooxoo.com/.
[41] Google生活搜索[EB/OL]. [2007-11-15]. http://www.google.cn/shenghuo/.

[1] Zhiqiang Liu,Yuncheng Du,Shuicai Shi. Extraction of Key Information in Web News Based on Improved Hidden Markov Model[J]. 数据分析与知识发现, 2019, 3(3): 120-128.
[2] Dongmei Mu,Shan Jin,Yuanhong Ju. Finding Association Between Diseases and Genes from Literature Abstracts[J]. 数据分析与知识发现, 2018, 2(8): 98-106.
[3] Yufeng Duan,Sisi Huang. Information Extraction from Chinese Plant Species Diversity Description Text[J]. 现代图书情报技术, 2016, 32(1): 87-96.
[4] Liu Wei, Wang Xing, Song Peiyan. A Noise Cleaning Method for Synonym Extraction Results[J]. 现代图书情报技术, 2015, 31(6): 64-70.
[5] Jiang Chuntao. Automatic Annotation of Bibliographical References in Chinese Patent Documents[J]. 现代图书情报技术, 2015, 31(10): 81-87.
[6] Li Xiangdong, Huo Yayong, Huang Li. Study of Book Pages Automatic Identification and Bibliographic Information Extraction[J]. 现代图书情报技术, 2014, 30(4): 71-77.
[7] Liu Yajing, Wang Yanxi, Hao Dan, Zhou Jinhui. Study on the Methods of Institutional Repository Supporting Research Services[J]. 现代图书情报技术, 2014, 30(3): 1-7.
[8] Zhang Han, Liu Shuangmei. Comparative Analysis of Centrality Indices in Extracting Concepts from Semantic Predication Network——Based on Disease Treatment Research[J]. 现代图书情报技术, 2013, (6): 30-35.
[9] Huang Xun, You Hongliang, Yu Yang. A Review of Relation Extraction[J]. 现代图书情报技术, 2013, 29(11): 30-39.
[10] He Lin, He Juan, Shen Gengyu, Yang Bo, Huang Shuiqing. An Approach to Discovery of Reference Control Gene for qRT-PCR Experiment Based on Texting Mining[J]. 现代图书情报技术, 2012, 28(7): 109-114.
[11] Gao Qiang, You Hongliang. Study on Named Entity Recognition Based on Cascaded Model for Field of Defense[J]. 现代图书情报技术, 2012, (11): 47-52.
[12] Wang Xiuyan, Cui Lei. Overview of Semantic Relations Extraction Between Biomedical Entities by Key Verbs[J]. 现代图书情报技术, 2011, 27(9): 21-27.
[13] Zhou Hong, Zhang Bei, Jiang Airong, Zhang Chengyu. Design and Implementation of Library Bibliography Information Self SMS Push Service[J]. 现代图书情报技术, 2011, 27(7/8): 127-131.
[14] Wang Zhichao, Weng Nan, Wang Yu. Research of Title Party News Identification Technology Based on Topic Sentence Similarity[J]. 现代图书情报技术, 2011, (11): 48-53.
[15] Lu Wanhui, Ma Jianxia. Research on Complex Time Information Extraction Based on CRF Model[J]. 现代图书情报技术, 2011, 27(10): 29-33.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn