Please wait a minute...
New Technology of Library and Information Service  2009, Vol. 25 Issue (12): 52-56    DOI: 10.11925/infotech.1003-3513.2009.12.10
article Current Issue | Archive | Adv Search |
Web Archive Content Extracted on Feature Orienting and Boarder Forecasting
Shen Jinzhi1   Kou Wenbo2   Tian Chengeng3
1(Department of Information and Management,Huazhong Normal University,Wuhan 430079,China)
2(International School of Software, Wuhan University,Wuhan 430072,China)
3(School of Mathematics and Statistics, Wuhan University, Wuhan 430072,China)
Download:
Export: BibTeX | EndNote (RIS)      
Abstract  

This paper raises a method of Web pages extracting which is based on feature orienting boarder forecast for extracting the Web archive effective content in high-speed. Two tools named ROST CM and ROST Text Extractor, is developed to build the training data set and test the algorithm. Theory and experiment show that the algorithm is suitable for Simplified Chinese, Traditional Chinese and English Web pages,and can be well adapted to news and blog Web archive management.

Key wordsWeb archive      Archive curator      Content extract      Information extract      Webpage analysis     
Received: 17 November 2009      Published: 25 December 2009
ZTFLH: 

TP393

 
Corresponding Authors: Mike Washington     E-mail: 1047469889@qq.com
About author:: Shen Jinzhi,Kou Wenbo,Tian Chengeng

Cite this article:

Shen Jinzhi,Kou Wenbo,Tian Chengeng. Web Archive Content Extracted on Feature Orienting and Boarder Forecasting. New Technology of Library and Information Service, 2009, 25(12): 52-56.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2009.12.10     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2009/V25/I12/52

[1] Koehler W. An Analysis of  Web Page and Web Site Constancy and Permanence[J]. Journal of the American Society for Information Science,1999, 50(2): 162-180.
 [2] Grieser G,  Jantke K P,  Lange S, et al. A Unifying Approach to HTML Wrapper Representation and Learning[C]. In: Proceedings of the 3rd International Conference on Discovery Science. London, UK: Springer-Verlag,2000:50-64.
 [3] Kushmerick N. Wrapper Verification[J]. World Wide Web Journal, 2000, 3(2): 79-94.
 [4] Gupta S, Kaiser G, Stolfo S. Extracting Context to Improve Accuracy for HTML Content Extraction[C]. In: Proceedings of Special Iinterest Tracks and Posters of the 14th International Conference on World Wide Web,Chiba, Japan. New York, USA :ACM  Press,2005: 1114 - 1115.
 [5] The Wayback Machine: The Web’s Archive[EB/OL].[2009-07-11].http://www.archive.org/web/web.php.
 [6] Kosala R. Web Mining Research: A Survey[J]. ACM SIGKDD Explorations,2000, 2(1): 1-15.
 [7] 李蕾,王劲林,白鹤,等. 基于FFT的网页正文提取算法研究与实现[J]. 计算机工程与应用,2007, 43(30): 148-151.
 [8] 赵欣欣,索红光,刘玉树. 基于标记窗的网页正文信息提取方法[J]. 计算机应用研究,2007, 24(3): 144-145.
 [9] 胡国平,张巍,王仁华. 基于双层决策的新闻网页正文精确抽取[J]. 中文信息学报, 2006, 20 (6): 1-10.
[10]Yushke Shinyama.WebStemmer[EB/OL].[2009-08 03].http://www.unixuser.org/~euske/python/webstemmer/index.html.
[11] Cai D,Yu S P, Wen J R. VIPS:A Vision Based Page Segmentation Algorithm[R]. Microsoft Corporation,2003.

[1] Tan Ying, Tang Yifei. Extracting Citation Contents with Coreference Resolution[J]. 数据分析与知识发现, 2021, 5(8): 25-33.
[2] Wang Yi,Shen Zhe,Yao Yifan,Cheng Ying. Domain-Specific Event Graph Construction Methods:A Review[J]. 数据分析与知识发现, 2020, 4(10): 1-13.
[3] Tao Yue,Yu Li,Zhang Runjie. Active Learning Strategies for Extracting Phrase-Level Topics from Scientific Literature[J]. 数据分析与知识发现, 2020, 4(10): 134-143.
[4] Zhiqiang Liu,Yuncheng Du,Shuicai Shi. Extraction of Key Information in Web News Based on Improved Hidden Markov Model[J]. 数据分析与知识发现, 2019, 3(3): 120-128.
[5] Chengzhi Zhang,Zheng Li. Extracting Sentences of Research Originality from Full Text Academic Articles[J]. 数据分析与知识发现, 2019, 3(10): 12-18.
[6] Mu Dongmei,Jin Shan,Ju Yuanhong. Finding Association Between Diseases and Genes from Literature Abstracts[J]. 数据分析与知识发现, 2018, 2(8): 98-106.
[7] Hu Jiying,Wu Zhenxin,Xie Jing,Zhang Zhixiong. A Full-text Indexing System for WARC Files[J]. 现代图书情报技术, 2016, 32(5): 91-98.
[8] Yufeng Duan,Sisi Huang. Information Extraction from Chinese Plant Species Diversity Description Text[J]. 现代图书情报技术, 2016, 32(1): 87-96.
[9] Liu Wei, Wang Xing, Song Peiyan. A Noise Cleaning Method for Synonym Extraction Results[J]. 现代图书情报技术, 2015, 31(6): 64-70.
[10] Wu Zhenxin, Zhang Zhixiong, Xie Jing, Hu Jiying. Developing Web Archive System of International Institutions Based on IIPC Open Source Software[J]. 现代图书情报技术, 2015, 31(4): 1-9.
[11] Jiang Chuntao. Automatic Annotation of Bibliographical References in Chinese Patent Documents[J]. 现代图书情报技术, 2015, 31(10): 81-87.
[12] Li Xiangdong, Huo Yayong, Huang Li. Study of Book Pages Automatic Identification and Bibliographic Information Extraction[J]. 现代图书情报技术, 2014, 30(4): 71-77.
[13] Liu Yajing, Wang Yanxi, Hao Dan, Zhou Jinhui. Study on the Methods of Institutional Repository Supporting Research Services[J]. 现代图书情报技术, 2014, 30(3): 1-7.
[14] Zhang Han, Liu Shuangmei. Comparative Analysis of Centrality Indices in Extracting Concepts from Semantic Predication Network——Based on Disease Treatment Research[J]. 现代图书情报技术, 2013, (6): 30-35.
[15] Huang Xun, You Hongliang, Yu Yang. A Review of Relation Extraction[J]. 现代图书情报技术, 2013, 29(11): 30-39.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn