By researching how to extract the topical contents in different kinds of templates of Web pages, this paper introduces a new extraction methodology based on DOM. The approach transforms HTML documents into DOM trees. According to the method, the topical contents are extracted and topic-unrelated content are deleted. The result of the approach represents the HTML document which only contains the topic information.
许文,都云程,李渝勤,施水才 . 一种通用HTML网页主题信息提取方法*[J]. 现代图书情报技术, 2007, 2(1): 40-43.
Xu Wen,Du Yuncheng,Li Yuqin,Shi Shuicai . A General Approach to Extracting Topical Information in HTML Pages. New Technology of Library and Information Service, 2007, 2(1): 40-43.
1Michael W Berry, Murray Browne. Understand Search Engines:Mathematical Modeling and Text Retrieval.Philadelphia:Society for Industrial and Applied Mathematics,1999.116
2Buyukkokten O,Garcia2Molina H,Paepcke A. Accordion summarization for end-game browsing on PDAs and cellular phones.In: Proc of ACM Conf on Human Factors in Computing Systems(CHI 2001). New York:ACM Press, 2001.213-220
3Yi L, Liu B, Li X.Eliminating Noisy Information in Web Pages for Data Mining.http://www.cs.uic.edu/~liub/publications/kdd2003-WebNoise.pdf(Accessed Oct.17,2005)
4欧健文,董守斌,蔡斌.模板化网页主题信息的提取方法清华大学学报(自然科学版), 2005,45(1): 1743-1747
5Suhit Gupta, Gail Kaiser, David Neistadt, Peter Grimm, “DOM-based Content Extraction of HTML Documents”, 12th International World Wide Web Conference, 2003(5): 207-214
6孙承杰,关毅. 基于统计的网页正文信息抽取方法的研究 中文信息学报,2004(4):17-22
7Stenback J, Hegaret P L, Hors A L. Document Object Model (DOM ) Level 2 HTML Specification.http://www.w3.org/TR/2003/REC-DOM-Level-2-HTML-20030109/DOM2-HTML.html#html-ID-1176245063,2003(Accessed Oct.17,2005)
8CyberNeko HTML Parser. http://www.apache.org/~andyc/neko/ doc/ html/ index.html(Accessed Oct.17,2005)