This paper firstly introduces the characteristics and structure of Web tables and describes the process of information extraction over Web tables. Then four key technologies are analysed, including Web table detection, Web table structure recognition, Web table interpretation and presentation of table extraction. It also analyses the application of the research and points out the problems in current researches, and finally presents a prospect of its future.
赵洪,肖洪,薛德军,师庆辉. Web表格信息抽取研究综述[J]. 现代图书情报技术, 2008, 24(3): 24-31.
Zhao Hong,Xiao Hong,Xue Dejun,Shi Qinghui. A Survey of the Research on Information Extraction over Web Tables. New Technology of Library and Information Service, 2008, 24(3): 24-31.
[1] Gatterbauer W, Bohunsky P. Table Extraction Using Spatial Reasoning on the CSS2 Visual Box Model[C]. In:Proceedings of the 21st National Conference on Artificial Intelligence (AAAI 2006),Washington:AAAI Press,2006:1313-1318.
[2] Douglas S, Hurst M. Layout and Language:List and Tables in Technical Documents[C]. In:Proceedings of ACL SIGPARSE Workshop on Punctuation in Computational Linguistics,New Jersey: Association for Computational Linguistics,1996:19-24.
[3] Hu J, Kashi R S, Lopresti D, et al. Evaluating the Performance of Table Processing Algorithms[J].International Journal on Document Analysis and Recognition,2002,4(3):140-153.
[4] Ng H T, Kim C Y, Koo J L T. Learning to Recognize Tables in Free Texts[C]. In:Proceedings of the 37th Annual Meeting of the Association for Computional Linguistics,New Jersey: Association for Computational Linguistics,1999:443- 450.
[5] Wang Y, Haralick R, Phillips I. Document Zone Content Classification and Its Performance Evaluation[J]. Pattern Recognition,2006,39(1):57-73.
[6] Wang Y, Phillips I T, Robert R M, et al.Table Structure Understanding and Its Performance Evaluation[J]. Pattern Recognition,2004,37(7):1479-1497.
[7] McCallum A, Freitag D, Pereira F. Maximun Entropy Markov Modals for Information Extraction and Segmentation[C]. In:Proceeding of the 17th International Conference on Machine Learning,2002:591-598.
[8] Pinto D, McCallum A, Wei X, et al. Table Extraction Using Conditional Random Fields[C].In:Proceedings of the ACM SIGIR,2003:235-242.
[9] Hammer J, Garcia M H, Cho J, et al. Extracting Semi-structured Information From the Web[C]. In:Proceedings of the Workshop on Management of Semistructured Data,1997:18-25.
[10] Lim S, Ng Y. An Automated Approach for Retrieving Heirarchical Data from HTML Tables[C]. In:Proceedings of the 8th International Conference on Informaiton and Knowledge Management(CIKM’99),1999:466-474.
[11] Cui Tao. Schema Matching and Data Extraction over HTML Tables[D]. Brigham Young University,USA,2003.
[12] 林科锵. Web页中表格结构识别的研究与实现[D]. 成都:电子科技大学, 2006.
[13] Yoshida M, Torisaw K a, Tsujii J. A Method to Integrate Tables of the World Wide Web[C]. In:Proceedings of the First International Workshop on Web Document Analysis (WDA), 2001:31-34.
[14] Embley D W, Lopresti D P, Nagy G. Notes on Contemporary Table Recognition[C]. In:Proc. 7th Int. Workshopon Document Analysis Systems (DAS), 2006:164-175.
[15] 李保利, 陈玉忠, 俞士汶. 信息抽取研究综述[J]. 计算机工程与应用,2003,39(10):1-5,66.
[16] Lerman K, Getoor L, Minton S, et al. Using the Structure of Web Sites for Automatic Segmentation of Tables[C]. In:Proc. of SIGMOD, 2004:119-130.
[17] Lerman K, Knoblock C A, Minton S. Automatic Data Extraction From Lists and Tables in Web Sources[C]. In:Proceedings of the workshop on Advances in Text Extraction and Mining(IJCAI-2001).
[18] Cohen W, Hurst M, Jensen L. A Flexible Learning System for Wrapping Tables and Lists in HTML Documents[C]. In:Proceedings of WWW2002, 2002:232-241.
[19] Box Model[EB/OL].[2007-11-11]. http://www.w3.org/TR/REC-CSS2/box.html.
[20] Gatterbauer W, Bohunsky P, Herzog M,et al. Towards Domain Independent Information Extraction from Web Tables[C]. In:Proceedings of the 16th International World Wide Web Conference (WWW 2007), 2007:71-80.
[21] Chen H, Tsai S, Tsai J. Mining Tables from Large Scale HTML Texts[C]. In:Proceedings of the 18th International Conference on Computational Linguistics,New Jersey:Association for Computational Linguistics,2000:166-172.
[22] Embley D W, Cui Tao, Liddle S W. Automatically Extracting Ontologically Specified Data from HTML Tables With Unknown Structure[C]. In:Proceedings of the 21st International Conference on Conceptual Modeling(ER2002), 2002:322-337.
[23] Tengli A, Yang Y, Li N. Machine Learning Table Extraction from Examples[C]. In:Proceedingds of the 20th International Conference on Computational Linguistics(COLING),New Jersey:Association for Computational Linguistics,2004:987-993.
[24] Pivk A, Cimiano P, Sure Y. From Tables to Frames[J]. Journal of Web Semantics,2005,3(2-3):132-146.
[25] Zhai Y, Liu B. Web Data Extraction Based on Partial tree Alignment[C]. In:Proceedings of the 14th International World Wide Web Conference (WWW 2005), 2005:76-85.
[26] Wu Yangyang, Yokota H. A Method of Recognizing Tables and Lists on the Web[C]. In:Proc. of Int. Conf. on Communication, Internet, and Information Technology(CIIT 2002), 2002:479-485.
[27] 吴扬扬, 陈锻生. 识别和抽取Web列表中的关系信息[J]. 计算机科学,2003,31(6):86-88.
[28] 林琳. 基于Ontology的Web表格内容抽取的研究与实现[D]. 成都:电子科技大学,2006.
[29] Hurst M. Classifying TABLE Elements in HTML[C]. In:Proceedingds of the 11th International World Wide Web Conference(WWW 2002).
[30] Wang Y, Hu J. A Machine Learning Based Approach for Table Detection on the Web[C].In:Proceedings of the 11th International Conference on World Wide Web, 2002:242-250.
[31] Kim Y, Lee K. Detecting Tables in Web Documents[J]. Engineering Appliations of Artificial,2005(18):745-757.
[32] Liu B, Zhai Y. NET-A System for Extracting Web Data from Flat and Nested Data Records[C]. In:Proceedings of the 6th International Conference on Web Information Systems Engineering(WISE-05), Washington:IEEE Computer Society Press,2005:487-495.
[33] 王放, 顾宁, 吴国文. 基于本体的Web表格信息抽取[J]. 小型微型计算机系统, 2003,24(12):2142-2146.
[34] Hurst M. Layout and language: Challenges for Table Understanding on the Web[C]. In:Proc. 1st International Workshop on Web Document Analysis,CA:Prima Communications,2001:27-30.
[35] Yang Y, Luk W. A Framework for Web Table Mining[C]. In:Proceedings of the 4th International Workshop on Web Information and Data Management, 2002:36-42.
[36] Wohlberg T. Hypertables: Development of a Structure Description Language for Tables in XML[D]. University of Hamburg,Germany,1999.
[37] CNKI数字搜索[EB/OL]. [2007-11-15]. http:// number.cnki.net/.
[38] Tijerino Y A, Embley D W, Deryle L, et al. Towards Ontology Generation from Tables[J]. WorldWide Web Journal,2005(8):261-285.
[39] TRS InfoRadar[EB/OL]. [2007-11-15]. http://www.trs.com.cn/products/wse/radar/.
[40] 酷讯生活搜索[EB/OL]. [2007-11-15]. http://www.kooxoo.com/.
[41] Google生活搜索[EB/OL]. [2007-11-15]. http://www.google.cn/shenghuo/.