|
|
A General Approach to Extracting Topical Information in HTML Pages |
Xu Wen Du Yuncheng Li Yuqin Shi Shuicai |
(Chinese Information Processing Research Center,Beijing InformationScience & Technology University,Beijing 100101,China) |
|
|
Abstract By researching how to extract the topical contents in different kinds of templates of Web pages, this paper introduces a new extraction methodology based on DOM. The approach transforms HTML documents into DOM trees. According to the method, the topical contents are extracted and topic-unrelated content are deleted. The result of the approach represents the HTML document which only contains the topic information.
|
Received: 09 October 2006
Published: 25 January 2007
|
|
Corresponding Authors:
Xu Wen
E-mail: xu.wen@trs.com.cn
|
About author:: Xu Wen,Du Yuncheng,Li Yuqin,Shi Shuicai |
1Michael W Berry, Murray Browne. Understand Search Engines:Mathematical Modeling and Text Retrieval.Philadelphia:Society for Industrial and Applied Mathematics,1999.116
2Buyukkokten O,Garcia2Molina H,Paepcke A. Accordion summarization for end-game browsing on PDAs and cellular phones.In: Proc of ACM Conf on Human Factors in Computing Systems(CHI 2001). New York:ACM Press, 2001.213-220
3Yi L, Liu B, Li X.Eliminating Noisy Information in Web Pages for Data Mining.http://www.cs.uic.edu/~liub/publications/kdd2003-WebNoise.pdf(Accessed Oct.17,2005)
4欧健文,董守斌,蔡斌.模板化网页主题信息的提取方法清华大学学报(自然科学版), 2005,45(1): 1743-1747
5Suhit Gupta, Gail Kaiser, David Neistadt, Peter Grimm, “DOM-based Content Extraction of HTML Documents”, 12th International World Wide Web Conference, 2003(5): 207-214
6孙承杰,关毅. 基于统计的网页正文信息抽取方法的研究 中文信息学报,2004(4):17-22
7Stenback J, Hegaret P L, Hors A L. Document Object Model (DOM ) Level 2 HTML Specification.http://www.w3.org/TR/2003/REC-DOM-Level-2-HTML-20030109/DOM2-HTML.html#html-ID-1176245063,2003(Accessed Oct.17,2005)
8CyberNeko HTML Parser. http://www.apache.org/~andyc/neko/ doc/ html/ index.html(Accessed Oct.17,2005) |
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|