New Technology of Library and Information Service  2009, Vol. 25 Issue (12): 52-56    DOI: 10.11925/infotech.1003-3513.2009.12.10
Web Archive Content Extracted on Feature Orienting and Boarder Forecasting
Shen Jinzhi1   Kou Wenbo2   Tian Chengeng3
1(Department of Information and Management,Huazhong Normal University,Wuhan 430079,China)
2(International School of Software, Wuhan University,Wuhan 430072,China)
3(School of Mathematics and Statistics, Wuhan University, Wuhan 430072,China)
This paper raises a method of Web pages extracting which is based on feature orienting boarder forecast for extracting the Web archive effective content in high-speed. Two tools named ROST CM and ROST Text Extractor, is developed to build the training data set and test the algorithm. Theory and experiment show that the algorithm is suitable for Simplified Chinese, Traditional Chinese and English Web pages,and can be well adapted to news and blog Web archive management.

Key wordsWeb archive      Archive curator      Content extract      Information extract      Webpage analysis     
Received: 17 November 2009      Published: 25 December 2009


About author:: Shen Jinzhi,Kou Wenbo,Tian Chengeng

Shen Jinzhi,Kou Wenbo,Tian Chengeng. Web Archive Content Extracted on Feature Orienting and Boarder Forecasting. New Technology of Library and Information Service, 2009, 25(12): 52-56.

