On the condition of error allowing, the Bloom Filter and its improvable algorithm, can be used to filter the homology URL pages through URL Hashing. Experiment shows that it can achieve satisfactory results through reasonable adjustments of its parameter.
丁振国,吴宝贵,辛友强. 基于Bloom Filter的超大规模网页去重策略研究[J]. 现代图书情报技术, 2008, 24(3): 45-50.
Ding Zhenguo,Wu Baogui,Xin Youqiang. Research of large-scale URL Filter Base on Bloom Filter. New Technology of Library and Information Service, 2008, 24(3): 45-50.
[1] Gulli A, Signorini A.The Indexable Web is More than 11.5 Billion Pages[C]. Special Interest Tracks and Posters of the 14th International Conference on World Wide Web WWW ’05.ACM Press 2005:902-903.
[2] Bloom B. Space/time Tradeoffs in Hash Coding with Allowable Errors[J].Communication of the ACM, 1970, 13(7):422-426.
[3] Cormen T H, Leiserson C E. Introduction to Algorithms[M]. 2nd ed. Cambridge: MIT Press, 2001:221-252.
[4] 吴丽辉,白硕,张刚,等.Web信息采集中的哈希函数比较[J].小型微型计算机系统,2006,27(4):673-676.
[5] 李晓明,凤旺森.两种对URL 的散列效果很好的函数[J].软件学报,2004,15 (2) :179-184.
[6] 肖明忠,代亚非.Bloom Filter及其应用综述[J].计算机科学,2004,30(4):180-183.
[7] 池静,倪健,王华,等.Bloom Filter 和Weighted Bloom Filter 的比较与研究[J].河北师范大学学报:自然科学版,2006,30(4):398-402.
[8] Fan L, Cao P, Almeida J,et al. Summary Cache: A Scalable Wide-area Web Cache Sharing Protocol[C].In:IEEE/ACM Transactions On Networking,2000,8(3):281-293.
[9] 肖明忠,代亚非,李小明.拆分型Bloom Filter[J].电子学报,2004,32(2):241-245.
[10] 谢鲲,闵应骅,张大方,等.分档布鲁姆过滤器的查询算法[J].计算机学报,2007,30(4):597-607.
[11] Mitzenmacher M.Compressed Bloom Filters[C].In: Proceedings of the 20th ACM Symposium on Principles of Distributed Computing (PODC2001).Rhode, Island, 2001:23-34.