Based on the study of the duplicated Web pages detection algorithm with feature code, the paper proposes a duplicated detection algorithm based on the keyword from user’s submission for meta search engine. The main steps of algorithm are introduced. And this algorithm is tested and verified its validity in an experiment.
谢蕙,秦杰,胡双双. 基于用户查询关键词的网页去重方法研究[J]. 现代图书情报技术, 2008, 24(7): 43-46.
Xie Hui,Qin Jie,Hu Shuangshuang. The Study on the Duplicated Web Pages Detection Algorithm Based on the Keyword from User’s Submission. New Technology of Library and Information Service, 2008, 24(7): 43-46.
[1] Cho J,Shivakumar N, Garcia-Molina H.Finding Replicated Web Collections[C].In:Proceedings of the ACM International Conference on Management of the Data. USA:ACM Press, May 2000,29(2):355-366.
[2] 孔素然.基于模糊匹配思想的网页去重算法[D].上海:复旦大学,2006.
[3] 唐培丽,胡明,解飞.元搜索引擎研究[J].气象水文海洋仪器,2005(3):62-66.
[4] 刘迁,贾惠波.中文信息处理中自动分词技术的研究与展望[J].计算机工程与应用,2006,42(3):175-177,182.
[5] Ye S, Song R, Wen J-R, et al. A Query-dependent Duplicate Detection Approach for Large Scale Search Engines[C]. In: Proceedings of the 6th Asia-Pacific Web Conference, 2004:48-58.
[6] Fetterly D, Manasse M, Najork M .On the Evolution of Clusters of Near-Duplicate Web Pages[C]. In:Proceedings of the 1st Conference on Latin American Web Congress, 2003:37-45.
[7] Ye S,Wen J R,Ma W Y.A Systematic Study on Parameter Correlations in Large-scale Duplicate Document Detection[J].Knowledge and Information Systems, 2008,14(2):217-232.