Please wait a minute...
New Technology of Library and Information Service  2007, Vol. 2 Issue (1): 40-43    DOI: 10.11925/infotech.1003-3513.2007.01.10
Current Issue | Archive | Adv Search |
A General Approach to Extracting Topical Information in HTML Pages
Xu Wen   Du Yuncheng    Li Yuqin   Shi Shuicai
(Chinese Information Processing Research Center,Beijing InformationScience & Technology University,Beijing 100101,China)
Download: PDF(706 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  

By researching how to extract the topical contents in different kinds of templates of Web pages, this paper introduces a new extraction methodology based on DOM. The approach transforms HTML documents into DOM trees. According to the method, the topical contents are extracted and topic-unrelated content are deleted. The result of the approach represents the HTML document which only contains the topic information.

Key wordsDOM      Information extraction      Partition      Correlativity     
Received: 09 October 2006      Published: 25 January 2007
: 

TP391

 
Corresponding Authors: Xu Wen     E-mail: xu.wen@trs.com.cn
About author:: Xu Wen,Du Yuncheng,Li Yuqin,Shi Shuicai

Cite this article:

Xu Wen,Du Yuncheng,Li Yuqin,Shi Shuicai . A General Approach to Extracting Topical Information in HTML Pages. New Technology of Library and Information Service, 2007, 2(1): 40-43.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2007.01.10     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2007/V2/I1/40

1Michael W Berry, Murray Browne. Understand Search Engines:Mathematical Modeling and Text Retrieval.Philadelphia:Society for Industrial and Applied Mathematics,1999.116
2Buyukkokten O,Garcia2Molina H,Paepcke A. Accordion summarization for end-game browsing on PDAs and cellular phones.In: Proc of ACM Conf on Human Factors in Computing Systems(CHI 2001). New York:ACM Press, 2001.213-220
3Yi L, Liu B,  Li X.Eliminating Noisy Information in Web Pages for Data Mining.http://www.cs.uic.edu/~liub/publications/kdd2003-WebNoise.pdf(Accessed Oct.17,2005)
4欧健文,董守斌,蔡斌.模板化网页主题信息的提取方法清华大学学报(自然科学版), 2005,45(1): 1743-1747
5Suhit Gupta, Gail Kaiser, David Neistadt, Peter Grimm, “DOM-based Content Extraction of HTML Documents”, 12th International World Wide Web Conference, 2003(5): 207-214
6孙承杰,关毅. 基于统计的网页正文信息抽取方法的研究 中文信息学报,2004(4):17-22
7Stenback J, Hegaret P L, Hors A L. Document Object Model (DOM ) Level 2 HTML Specification.http://www.w3.org/TR/2003/REC-DOM-Level-2-HTML-20030109/DOM2-HTML.html#html-ID-1176245063,2003(Accessed Oct.17,2005)
8CyberNeko HTML Parser. http://www.apache.org/~andyc/neko/ doc/ html/ index.html(Accessed Oct.17,2005)

[1] Han Huang,Hongyu Wang,Xiaoguang Wang. Automatic Recognizing Legal Terminologies with Active Learning and Conditional Random Field Model[J]. 数据分析与知识发现, 2019, 3(6): 66-74.
[2] Wancheng Chen,Haoran Dai,Yinghan Jin. Appraising Home Prices with HEDONIC Model: Case Study of Seattle, U.S.[J]. 数据分析与知识发现, 2019, 3(5): 19-26.
[3] Zhiqiang Liu,Yuncheng Du,Shuicai Shi. Extraction of Key Information in Web News Based on Improved Hidden Markov Model[J]. 数据分析与知识发现, 2019, 3(3): 120-128.
[4] Jiaxin Ye,Huixiang Xiong. Recommending Personalized Contents from Cross-Domain Resources Based on Tags[J]. 数据分析与知识发现, 2019, 3(2): 21-32.
[5] Xiangdong Li,Fan Gao,Youhai Li. Categorizing Documents Automatically within Common Semantic Space[J]. 数据分析与知识发现, 2018, 2(9): 66-73.
[6] Youshi He,Shufang He. Sentiment Mining of Online Product Reviews Based on Domain Ontology[J]. 数据分析与知识发现, 2018, 2(8): 60-68.
[7] Dongmei Mu,Shan Jin,Yuanhong Ju. Finding Association Between Diseases and Genes from Literature Abstracts[J]. 数据分析与知识发现, 2018, 2(8): 98-106.
[8] Cheng Zhou,Hongqin Wei. Identifying Crowd Participants with Modified Random Forests Algorithm[J]. 数据分析与知识发现, 2018, 2(7): 46-54.
[9] Huihui Tang,Hao Wang,Zixuan Zhang,Xueying Wang. Extracting Names of Historical Events Based on Chinese Character Tags[J]. 数据分析与知识发现, 2018, 2(7): 89-100.
[10] Yuan Chen,Chaoqun Wang,Zhongyi Hu,Jiang Wu. Identifying Malicious Websites with PCA and Random Forest Methods[J]. 数据分析与知识发现, 2018, 2(4): 71-80.
[11] Weijian Ni,Haohao Sun,Tong Liu,Qingtian Zeng. An Unsupervised Approach to Optimize Chinese Word Segmentation on Domain Literature[J]. 数据分析与知识发现, 2018, 2(2): 96-104.
[12] Liyi Zhang,Yiran Li,Xuan Wen. Predicting Repeat Purchase Intention of New Consumers[J]. 数据分析与知识发现, 2018, 2(11): 10-18.
[13] Yan Yu,Naixuan Zhao. Choosing Stopwords for Patent Topic Analysis Based on Auxiliary Set[J]. 数据分析与知识发现, 2018, 2(11): 95-103.
[14] Chuanming Yu,Bolin Feng,Lu An. Sentiment Analysis in Cross-Domain Environment with Deep Representative Learning[J]. 数据分析与知识发现, 2017, 1(7): 73-81.
[15] Xiaoyu Wang,Bin Li. Automatically Segmenting Middle Ancient Chinese Words with CRFs[J]. 数据分析与知识发现, 2017, 1(5): 62-70.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn