|
|
Web Archive Content Extracted on Feature Orienting and Boarder Forecasting |
Shen Jinzhi1 Kou Wenbo2 Tian Chengeng3 |
1(Department of Information and Management,Huazhong Normal University,Wuhan 430079,China)
2(International School of Software, Wuhan University,Wuhan 430072,China)
3(School of Mathematics and Statistics, Wuhan University, Wuhan 430072,China) |
|
|
Abstract This paper raises a method of Web pages extracting which is based on feature orienting boarder forecast for extracting the Web archive effective content in high-speed. Two tools named ROST CM and ROST Text Extractor, is developed to build the training data set and test the algorithm. Theory and experiment show that the algorithm is suitable for Simplified Chinese, Traditional Chinese and English Web pages,and can be well adapted to news and blog Web archive management.
|
Received: 17 November 2009
Published: 25 December 2009
|
|
Corresponding Authors:
Mike Washington
E-mail: 1047469889@qq.com
|
About author:: Shen Jinzhi,Kou Wenbo,Tian Chengeng |
[1] Koehler W. An Analysis of Web Page and Web Site Constancy and Permanence[J]. Journal of the American Society for Information Science,1999, 50(2): 162-180.
[2] Grieser G, Jantke K P, Lange S, et al. A Unifying Approach to HTML Wrapper Representation and Learning[C]. In: Proceedings of the 3rd International Conference on Discovery Science. London, UK: Springer-Verlag,2000:50-64.
[3] Kushmerick N. Wrapper Verification[J]. World Wide Web Journal, 2000, 3(2): 79-94.
[4] Gupta S, Kaiser G, Stolfo S. Extracting Context to Improve Accuracy for HTML Content Extraction[C]. In: Proceedings of Special Iinterest Tracks and Posters of the 14th International Conference on World Wide Web,Chiba, Japan. New York, USA :ACM Press,2005: 1114 - 1115.
[5] The Wayback Machine: The Web’s Archive[EB/OL].[2009-07-11].http://www.archive.org/web/web.php.
[6] Kosala R. Web Mining Research: A Survey[J]. ACM SIGKDD Explorations,2000, 2(1): 1-15.
[7] 李蕾,王劲林,白鹤,等. 基于FFT的网页正文提取算法研究与实现[J]. 计算机工程与应用,2007, 43(30): 148-151.
[8] 赵欣欣,索红光,刘玉树. 基于标记窗的网页正文信息提取方法[J]. 计算机应用研究,2007, 24(3): 144-145.
[9] 胡国平,张巍,王仁华. 基于双层决策的新闻网页正文精确抽取[J]. 中文信息学报, 2006, 20 (6): 1-10.
[10]Yushke Shinyama.WebStemmer[EB/OL].[2009-08 03].http://www.unixuser.org/~euske/python/webstemmer/index.html.
[11] Cai D,Yu S P, Wen J R. VIPS:A Vision Based Page Segmentation Algorithm[R]. Microsoft Corporation,2003. |
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|