HTML到XML转换技术的研究与实现

doi:10.11925/infotech.1003-3513.2003.05.21

现代图书情报技术

2003, Vol. 19

Issue (5): 66-67 https://doi.org/10.11925/infotech.1003-3513.2003.05.21

网络资源与建设

本期目录 | 过刊浏览 | 高级检索

HTML到XML转换技术的研究与实现

陈艳梅¹ 张斌²

¹(东北大学图书馆　沈阳 110004)
²(东北大学信息与工程学院　沈阳110004)

The Research and Realization of Technology Converting HTML to XML

Chen Yanmei¹ Zhang Bin²

¹(Northeastern University Library, Shenyang 110004, China)
²(Information Engineering Institute of Northeastern University, Shenyang 110004, China)

摘要
参考文献
相关文章
Metrics

全文:
输出: BibTeX | EndNote (RIS)

摘要

网络上大多数的信息都是用HTML 写的, 这种语言不能处理网络上的很多需求, 因为它只是一种用于浏览信息的语言, 不能表达数据本身, 网络还没有形成一个良好的结构化文档的存贮, 而只是一个可变的HTML 页的聚集, 我们迫切希望来自网络资源的信息以一种结构化的方式来存贮。XML 和它的各种扩展功能如数据模型、查询语言等是实现结构化方式的一种, 是一种元语言, 可以弥补很多HTML 的不足。未来的网页会使用具有很好结构化的XML 语言, 但是现在这一阶段是过渡阶段, 必须思考一种方法来实现HTML 到XML 的转换, 以更好地利用网络资源。本文提出了一种实现HTML 到XML 转换的方法。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章

关键词 ：包装器, 信息抽取, HTML 解析, HTML - XML 转换技术

Abstract：

Nowadays, the whole world can possibly communicate with all different people by using web. Internet usually uses HTML, it cannot handle the various requirement of Internet and also express the data itself.To do so, information from web sources needs to be accessible in a structured way. XML and its various extensions are a step in this direction. Unfortunately, the web is not yet a well organized repository of nicely structured XML documents but rather a conglomerate of volatile HTML pages, for which structure has to be extracted. This thesis shows the design and imp lementation of a conversion system of HTML to XML.

Key words： Web wrapper Information extraction HTML parsing HTML to XML conversion

收稿日期: 2003-03-19 出版日期: 2003-10-25

ZTFLH:

TP39

通讯作者: 陈艳梅,张斌

作者简介: 陈艳梅,张斌

引用本文:

陈艳梅,张斌. HTML到XML转换技术的研究与实现[J]. 现代图书情报技术, 2003, 19(5): 66-67.
Chen Yanmei,Zhang Bin. The Research and Realization of Technology Converting HTML to XML. New Technology of Library and Information Service, 2003, 19(5): 66-67.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2003.05.21 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2003/V19/I5/66

[1]Ling Liu,Calton Pu,Wei Han,XWRAP:an XML-enabled wrapper construction system for web information sources
[J].2000 IEEE on data engineering

[2]S.Abitebonl,D.Quass,J.Mc Hugh,J.Widom,and J.L.Wiener.The Lorel Query Language for Semistructured Data
[J].Journal on Digital Libraries,1997

[3]Brad Adelberg.XoDoSe-ATool for SemiAutomatically Extracting Semi-Structured Data from Text
[J].InProc.Of the
SIGMOD Conference,Seattle,June1998

[4]Gustavo Arocena and Alberto Mendelzon.WebOQL:Restructuring Documents,Databases,and Webs
[J].InProc.ICDE’98,Orlando,February 1998

[5]Jean-Robert Gruser,Louiqa Raschid,M.E.Vidal and L.Bright.Wrapper Generation for Web Accessible Data Sources
[J].In COOPIS,1998

[6]J.Hammer,H.Garcia-Molina,J.Cho,R.Aranba,and A.Crespo.Extracting Semistructured Information from the Web
[J].In Proceedings of the Workshop on Management of Semistructured Data.Tueson,Arizona,May1997

[7Gerald Huck,Peter Fankhauser,Karl Aberer,and ErichJ.Neuhold.JEDI:Extracting and Synthesizing Information from the Web
[J].In COOPOIS,New-York,1998

[8]Mary Tork Roth and Peter Schwartz.A Wrapper Architecture for Legacy Data Sources
[J].Technical Report RJ10077,IBM Almaden Research Center,1997

[9]World Wide Web Consortium(W3C).The Document Object Model,1998.http://www.w3.org/DOM

[10]Jon Bosak.XML,Java and the Future of the Web
[J] http://sunsite.unc.edu/pub/sun-info/standards/xml/why/xmlapps.html

[1]	谭荧, 唐亦非. 基于指代消解的引文内容抽取研究^*[J]. 数据分析与知识发现, 2021, 5(8): 25-33.
[2]	陶玥,余丽,张润杰. 科技文献中短语级主题抽取的主动学习方法研究^*[J]. 数据分析与知识发现, 2020, 4(10): 134-143.
[3]	刘志强,都云程,施水才. 基于改进的隐马尔科夫模型的网页新闻关键信息抽取^*[J]. 数据分析与知识发现, 2019, 3(3): 120-128.
[4]	章成志,李铮. 基于学术论文全文的创新研究评价句抽取研究 ^*[J]. 数据分析与知识发现, 2019, 3(10): 12-18.
[5]	牟冬梅, 金姗, 琚沅红. 基于文献数据的疾病与基因关联关系研究^*[J]. 数据分析与知识发现, 2018, 2(8): 98-106.
[6]	段宇锋,黄思思. 中文植物物种多样性描述文本的信息抽取研究^*[J]. 现代图书情报技术, 2016, 32(1): 87-96.
[7]	刘伟, 王星, 宋培彦. 同义词抽取结果的噪音清洗方法研究[J]. 现代图书情报技术, 2015, 31(6): 64-70.
[8]	李湘东, 霍亚勇, 黄莉. 图书网页的自动识别及书目信息抽取研究[J]. 现代图书情报技术, 2014, 30(4): 71-77.
[9]	刘雅静, 王衍喜, 郝丹, 周津慧. 机构知识库支撑科研服务方法研究[J]. 现代图书情报技术, 2014, 30(3): 1-7.
[10]	翟东升, 张欣琦, 张杰, 康宁. 分布式专利信息抽取系统设计与构建[J]. 现代图书情报技术, 2013, 29(7/8): 114-121.
[11]	张晗, 刘双梅. 中心度指标对语义述谓网络概念抽取的比较分析——以疾病治疗学研究为例[J]. 现代图书情报技术, 2013, (6): 30-35.
[12]	黄勋, 游宏梁, 于洋. 关系抽取技术研究综述[J]. 现代图书情报技术, 2013, 29(11): 30-39.
[13]	何琳, 何娟, 沈耕宇, 杨波, 黄水清. 一种通过文本挖掘发现实时定量聚合酶链式反应实验内参基因的方法研究[J]. 现代图书情报技术, 2012, 28(7): 109-114.
[14]	高强, 游宏梁. 基于层叠模型的国防领域命名实体识别研究[J]. 现代图书情报技术, 2012, (11): 47-52.
[15]	王秀艳, 崔雷. 应用关键动词抽取生物医学实体间语义关系研究综述[J]. 现代图书情报技术, 2011, 27(9): 21-27.

Viewed

Full text

Abstract

Cited

Shared

Discussed