Please wait a minute...
Advanced Search
现代图书情报技术  2008, Vol. 24 Issue (3): 24-31     https://doi.org/10.11925/infotech.1003-3513.2008.03.05
  知识组织与知识管理 本期目录 | 过刊浏览 | 高级检索 |
Web表格信息抽取研究综述
赵洪 肖洪 薛德军 师庆辉
(中国学术期刊(光盘版)电子杂志社 北京 100084)
A Survey of the Research on Information Extraction over Web Tables
Zhao Hong   Xiao Hong   Xue Dejun   Shi Qinghui
(China Academic Journal(CD) Publishing House, Beijing 100084, China)
全文: PDF (789 KB)  
输出: BibTeX | EndNote (RIS)      
摘要 

介绍Web表格的特点与结构、Web表格信息抽取及其过程,分析Web表格信息抽取的4个关键技术:Web表格定位、Web表格结构识别、Web表格内容整合和抽取结果表示,以及Web表格信息抽取的应用。最后指出目前国内外该项研究的不足之处及未来发展方向。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
赵洪
肖洪
薛德军
师庆辉
关键词 Web表格信息抽取表格定位表格结构识别表格内容整合    
Abstract

This paper firstly introduces the characteristics and structure of Web tables and describes the process of information extraction over Web tables. Then four key technologies are analysed, including Web table detection, Web table structure recognition, Web table interpretation and presentation of table extraction. It also analyses the application of the research and points out the problems in current researches, and finally presents a prospect of its future.

Key wordsWeb tables    Information Extraction    Web Table Detection    Web Table Structure Recognition    Web Table Interpretation
收稿日期: 2007-12-11      出版日期: 2008-03-25
: 

TP391

 
通讯作者: 赵洪     E-mail: zhaohong860112@163.com
作者简介: 赵洪,肖洪,薛德军,师庆辉
引用本文:   
赵洪,肖洪,薛德军,师庆辉. Web表格信息抽取研究综述[J]. 现代图书情报技术, 2008, 24(3): 24-31.
Zhao Hong,Xiao Hong,Xue Dejun,Shi Qinghui. A Survey of the Research on Information Extraction over Web Tables. New Technology of Library and Information Service, 2008, 24(3): 24-31.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2008.03.05      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2008/V24/I3/24

[1] Gatterbauer W, Bohunsky P. Table Extraction Using Spatial Reasoning on the CSS2 Visual Box Model[C]. In:Proceedings of the 21st National Conference on Artificial Intelligence (AAAI 2006),Washington:AAAI Press,2006:1313-1318.
[2] Douglas S, Hurst M. Layout and Language:List and Tables in Technical Documents[C]. In:Proceedings of ACL SIGPARSE Workshop on Punctuation in Computational Linguistics,New Jersey: Association for Computational Linguistics,1996:19-24.
[3] Hu J, Kashi R S, Lopresti D, et al. Evaluating the Performance of Table Processing Algorithms[J].International Journal on Document Analysis and Recognition,2002,4(3):140-153.
[4] Ng H T, Kim C Y, Koo J L T. Learning to Recognize Tables in Free Texts[C]. In:Proceedings of the 37th Annual Meeting of the Association for Computional Linguistics,New Jersey: Association for Computational Linguistics,1999:443- 450.
[5] Wang Y, Haralick R, Phillips I. Document Zone Content Classification and Its Performance Evaluation[J]. Pattern Recognition,2006,39(1):57-73.
[6] Wang Y, Phillips I T, Robert R M, et al.Table Structure Understanding and Its Performance Evaluation[J]. Pattern Recognition,2004,37(7):1479-1497.
[7] McCallum A, Freitag D, Pereira F. Maximun Entropy Markov Modals for Information Extraction and Segmentation[C]. In:Proceeding of the 17th International Conference on Machine Learning,2002:591-598.
[8] Pinto D, McCallum A, Wei X, et al. Table Extraction Using Conditional Random Fields[C].In:Proceedings of the ACM SIGIR,2003:235-242.
[9] Hammer J, Garcia M H, Cho J, et al. Extracting Semi-structured Information From the Web[C]. In:Proceedings of the Workshop on Management of Semistructured Data,1997:18-25.
[10] Lim S, Ng Y. An Automated Approach for Retrieving Heirarchical Data from HTML Tables[C]. In:Proceedings of the 8th International Conference on Informaiton and Knowledge Management(CIKM’99),1999:466-474.
[11] Cui Tao. Schema Matching and Data Extraction over HTML Tables[D]. Brigham Young University,USA,2003.
[12] 林科锵. Web页中表格结构识别的研究与实现[D]. 成都:电子科技大学, 2006.
[13] Yoshida M, Torisaw K a, Tsujii J. A Method to Integrate Tables of the World Wide Web[C]. In:Proceedings of the First International Workshop on Web Document Analysis (WDA), 2001:31-34.
[14] Embley D W, Lopresti D P, Nagy G. Notes on Contemporary Table Recognition[C]. In:Proc. 7th Int. Workshopon Document Analysis Systems (DAS), 2006:164-175.
[15] 李保利, 陈玉忠, 俞士汶. 信息抽取研究综述[J]. 计算机工程与应用,2003,39(10):1-5,66.
[16] Lerman K, Getoor L, Minton S, et al. Using the Structure of Web Sites for Automatic Segmentation of Tables[C]. In:Proc. of SIGMOD, 2004:119-130.
[17] Lerman K, Knoblock C A, Minton S. Automatic Data Extraction From Lists and Tables in Web Sources[C]. In:Proceedings of the workshop on Advances in Text Extraction and Mining(IJCAI-2001).
[18] Cohen W, Hurst M, Jensen L. A Flexible Learning System for Wrapping Tables and Lists in HTML Documents[C]. In:Proceedings of WWW2002, 2002:232-241.
[19] Box Model[EB/OL].[2007-11-11]. http://www.w3.org/TR/REC-CSS2/box.html.
[20] Gatterbauer W, Bohunsky P, Herzog M,et al. Towards Domain Independent Information Extraction from Web Tables[C]. In:Proceedings of the 16th International World Wide Web Conference (WWW 2007), 2007:71-80.
[21] Chen H, Tsai S, Tsai J. Mining Tables from Large Scale HTML Texts[C]. In:Proceedings of the 18th International Conference on Computational Linguistics,New Jersey:Association for Computational Linguistics,2000:166-172.
[22] Embley D W, Cui Tao, Liddle S W. Automatically Extracting Ontologically Specified Data from HTML Tables With Unknown Structure[C]. In:Proceedings of the 21st International Conference on Conceptual Modeling(ER2002), 2002:322-337.
[23] Tengli A, Yang Y, Li N. Machine Learning Table Extraction from Examples[C]. In:Proceedingds of the 20th International Conference on Computational Linguistics(COLING),New Jersey:Association for Computational Linguistics,2004:987-993.
[24] Pivk A, Cimiano P, Sure Y. From Tables to Frames[J]. Journal of Web Semantics,2005,3(2-3):132-146.
[25] Zhai Y, Liu B. Web Data Extraction Based on Partial tree Alignment[C]. In:Proceedings of the 14th International World Wide Web Conference (WWW 2005), 2005:76-85.
[26] Wu Yangyang, Yokota H. A Method of Recognizing Tables and Lists on the Web[C]. In:Proc. of Int. Conf. on Communication, Internet, and Information Technology(CIIT 2002), 2002:479-485.
[27] 吴扬扬, 陈锻生. 识别和抽取Web列表中的关系信息[J]. 计算机科学,2003,31(6):86-88.
[28] 林琳. 基于Ontology的Web表格内容抽取的研究与实现[D]. 成都:电子科技大学,2006.
[29] Hurst M. Classifying TABLE Elements in HTML[C]. In:Proceedingds of the 11th International World Wide Web Conference(WWW 2002).
[30] Wang Y, Hu J. A Machine Learning Based Approach for Table Detection on the Web[C].In:Proceedings of the 11th International Conference on World Wide Web, 2002:242-250.
[31] Kim Y, Lee K. Detecting Tables in Web Documents[J]. Engineering Appliations of Artificial,2005(18):745-757.
[32] Liu B, Zhai Y. NET-A System for Extracting Web Data from Flat and Nested Data Records[C]. In:Proceedings of the 6th International Conference on Web Information Systems Engineering(WISE-05), Washington:IEEE Computer Society Press,2005:487-495.
[33] 王放, 顾宁, 吴国文. 基于本体的Web表格信息抽取[J]. 小型微型计算机系统, 2003,24(12):2142-2146.
[34] Hurst M. Layout and language: Challenges for Table Understanding on the Web[C]. In:Proc. 1st International Workshop on Web Document Analysis,CA:Prima Communications,2001:27-30.
[35] Yang Y, Luk W. A Framework for Web Table Mining[C]. In:Proceedings of the 4th International Workshop on Web Information and Data Management, 2002:36-42.
[36] Wohlberg T. Hypertables: Development of a Structure Description Language for Tables in XML[D]. University of Hamburg,Germany,1999.
[37] CNKI数字搜索[EB/OL]. [2007-11-15]. http:// number.cnki.net/.
[38] Tijerino Y A, Embley D W, Deryle L, et al. Towards Ontology Generation from Tables[J]. WorldWide Web Journal,2005(8):261-285.
[39] TRS InfoRadar[EB/OL]. [2007-11-15]. http://www.trs.com.cn/products/wse/radar/.
[40] 酷讯生活搜索[EB/OL]. [2007-11-15]. http://www.kooxoo.com/.
[41] Google生活搜索[EB/OL]. [2007-11-15]. http://www.google.cn/shenghuo/.

[1] 谭荧, 唐亦非. 基于指代消解的引文内容抽取研究*[J]. 数据分析与知识发现, 2021, 5(8): 25-33.
[2] 陶玥,余丽,张润杰. 科技文献中短语级主题抽取的主动学习方法研究*[J]. 数据分析与知识发现, 2020, 4(10): 134-143.
[3] 刘志强,都云程,施水才. 基于改进的隐马尔科夫模型的网页新闻关键信息抽取*[J]. 数据分析与知识发现, 2019, 3(3): 120-128.
[4] 章成志,李铮. 基于学术论文全文的创新研究评价句抽取研究 *[J]. 数据分析与知识发现, 2019, 3(10): 12-18.
[5] 牟冬梅, 金姗, 琚沅红. 基于文献数据的疾病与基因关联关系研究*[J]. 数据分析与知识发现, 2018, 2(8): 98-106.
[6] 段宇锋,黄思思. 中文植物物种多样性描述文本的信息抽取研究*[J]. 现代图书情报技术, 2016, 32(1): 87-96.
[7] 刘伟, 王星, 宋培彦. 同义词抽取结果的噪音清洗方法研究[J]. 现代图书情报技术, 2015, 31(6): 64-70.
[8] 李湘东, 霍亚勇, 黄莉. 图书网页的自动识别及书目信息抽取研究[J]. 现代图书情报技术, 2014, 30(4): 71-77.
[9] 刘雅静, 王衍喜, 郝丹, 周津慧. 机构知识库支撑科研服务方法研究[J]. 现代图书情报技术, 2014, 30(3): 1-7.
[10] 翟东升, 张欣琦, 张杰, 康宁. 分布式专利信息抽取系统设计与构建[J]. 现代图书情报技术, 2013, 29(7/8): 114-121.
[11] 张晗, 刘双梅. 中心度指标对语义述谓网络概念抽取的比较分析——以疾病治疗学研究为例[J]. 现代图书情报技术, 2013, (6): 30-35.
[12] 黄勋, 游宏梁, 于洋. 关系抽取技术研究综述[J]. 现代图书情报技术, 2013, 29(11): 30-39.
[13] 何琳, 何娟, 沈耕宇, 杨波, 黄水清. 一种通过文本挖掘发现实时定量聚合酶链式反应实验内参基因的方法研究[J]. 现代图书情报技术, 2012, 28(7): 109-114.
[14] 高强, 游宏梁. 基于层叠模型的国防领域命名实体识别研究[J]. 现代图书情报技术, 2012, (11): 47-52.
[15] 王秀艳, 崔雷. 应用关键动词抽取生物医学实体间语义关系研究综述[J]. 现代图书情报技术, 2011, 27(9): 21-27.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn