Please wait a minute...
New Technology of Library and Information Service  2013, Vol. 29 Issue (9): 41-47    DOI: 10.11925/infotech.1003-3513.2013.09.07
Current Issue | Archive | Adv Search |
Fast Duplicate Detection for Chinese Texts Based on Semantic Fingerprint
Li Gang, Mao Jin, Chen Jinghao
Center for the Studies of Information Resources, Wuhan University, Wuhan 430072, China
Download:
Export: BibTeX | EndNote (RIS)      
Abstract  Oriented to Chinese texts, text features are firstly extracted to generate semantic fingerprints by performing the Simhash algorithm. The Hamming Distances between semantic fingerprints are applied to determine the similarity between texts. Then, as the last step of the entire process of detecting duplicates for Chinese text, the Single-Pass clustering algorithm is integrated to cluster the generated semantic fingerprints, after which the clusters of fingerprints are the final results. By comparing with the Shingle algorithm, the experiment shows that the Simhash approach is superior at both precise and robustness, and the Simhash approach is capable to process large amount of texts due to its rapidness.
Key wordsSemantic fingerprint      Simhash      Single-Pass      Duplicate detection     
Received: 14 June 2013      Published: 27 September 2013
:  TP391.3  

Cite this article:

Li Gang, Mao Jin, Chen Jinghao. Fast Duplicate Detection for Chinese Texts Based on Semantic Fingerprint. New Technology of Library and Information Service, 2013, 29(9): 41-47.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2013.09.07     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2013/V29/I9/41

[1] 赵立磊. 基于网页去重的垂直搜索引擎设计与实现[D]. 大连:大连理工大学, 2012.(Zhao Lilei.The Design and Implementation of Vertical Search Engine Based on Duplicated Web Pages Elimination[D].Dalian: Dalian University of Technology, 2012.)
[2] 马如林,蒋华,张庆霞. 基于贝叶斯方法和信息指纹的博客评论过滤[J]. 计算机工程与应用,2008,44(24): 159-161.(Ma Rulin, Jiang Hua, Zhang Qingxia. Blog’s Content Filtering Based on Bayes Method and Information Fingerprint[J].Computer Engineering and Applications, 2008, 44(24): 159-161.)
[3] Heintze N. Scalable Document Fingerprinting[C]. In: Proceedings of the 1996 USENIX Workshop on Electronic Commerce. 1996.
[4] Broder A Z, Glassman S C, Manasse M S, et al. Syntactic Clustering of the Web[J]. Computer Networks and ISDN Systems, 1997, 29(8-13): 1157-1166.
[5] 杨虎. 面向海量短文本去重技术的研究与实现[D]. 长沙:国防科学技术大学, 2007.(Yang Hu. De-duplication Technology Research and Implementation of Large-scale Short Texts Orient[D].Changsha: National University of Defense Technology,2007.)
[6] 吴平博,陈群秀,马亮. 基于特征串的大规模中文网页快速去重算法研究[J]. 中文信息学报, 2003,17(2): 28-35.(Wu Pingbo, Chen Qunxiu, Ma Liang. The Study on Large Scale Duplicated Web Pages of Chinese Fast Detection Algorithm Based on String of Feature Code[J].Journal of Chinese Information Processing, 2003,17(2): 28-35.)
[7] 谢蕙,秦杰,胡双双. 基于用户查询关键词的网页去重方法研究[J]. 现代图书情报技术, 2008(7): 43-46.(Xie Hui, Qin Jie,Hu Shuangshuang. The Study on the Duplicated Web Pages Detection Algorithm Based on the Keyword from User’s Submission[J].New Technology of Library and Information Service, 2008(7): 43-46.)
[8] 张刚,刘挺,郑实福,等. 大规模网页快速去重算法[EB/OL]. [2013-05-31]. http://wenku.baidu.com/view/3bf04d35eefd c8d376ee32d0.html.(Zhang Gang, Liu Ting, Zheng Shifu, et al. Fast De-duplicate Algorithm for Large Scale Web Pages[EB/OL]. [2013-05-31]. http://wenku.baidu.com/view/3bf04d35eefdc8d376ee32d0.html.)
[9] 曹玉娟,牛振东,彭学平,等. 一个基于特征向量的近似网页去重算法[J]. 中国索引, 2009,7(1): 11-14.(Cao Yujuan, Niu Zhendong, Peng Xueping, et al. A Near-duplicate Web Page Detection Algorithm Based on Feature Vectors[J].Journal of the China Society of Indexers, 2009,7(1): 11-14.)
[10] 樊勇,郑家恒. 基于主题的网页去重[J]. 电脑开发与应用, 2008,21(4): 4-6.(Fan Yong, Zheng Jiaheng. Detection and Elimination of Similar Web Pages Based on Topic[J]. Computer Development & Applications, 2008,21(4): 4-6.)
[11] 黄仁,冯胜,杨吉云,等. 基于正文结构和长句提取的网页去重算法[J]. 计算机应用研究, 2010,27(7): 2489-2491.(Huang Ren, Feng Sheng, Yang Jiyun, et al. Detection and Elimination of Similar Web Pages Based on Text Structure and Extraction of Long Sentences[J]. Application Research of Computers, 2010,27(7): 2489-2491.)
[12] 王小华,卢小康. 基于N-Gram的文本去重方法研究[J]. 杭州电子科技大学学报, 2010,30(2): 61-64.(Wang Xiaohua, Lu Xiaokang. A Study on Removing Duplication Using N-gram Terms for Chinese Text [J].Journal of Hangzhou Dianzi Univeristy, 2010,30(2): 61-64.)
[13] Chowdhury A, Frieder O, Grossman D, et al. Collection Statistics for Fast Duplicate Document Detection[J]. ACM Transactions on Information Systems, 2002, 20(2): 171-191.
[14] Kocz A, Chowdhury A, Alspector J. Improved Robustness of Signature-based Near-replica Detection via Lexicon Randomization[C]. In: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA: ACM, 2004: 605-610.
[15] Charikar M S. Similarity Estimation Techniques from Rounding Algorithms[C]. In: Proceedings of the 34th Annual ACM Symposium on Theory of Computing. New York, NY, USA: ACM, 2002: 380-388.
[16] Manku G S, Jain A, Das Sarma A. Detecting Near-duplicates for Web Crawling[C]. In: Proceedings of the 16th International Conference on World Wide Web. New York, NY, USA: ACM, 2007: 141-150.
[17] General Purpose Hash Function Algorithms[EB/OL].[2013-06-06].http://www.partow.net/programming/hashfunctions/index.html.
[18] Sentential Database Manager[EB/OL]. [2013-06-06]. https://code.google.com/p/sdbm/.
[19] 朱恒民,朱卫未. 基于Single-Pass的网络话题在线聚类方法研究[J]. 现代图书情报技术, 2011(12): 52-57.(Zhu Hengmin, Zhu Weiwei. Study on Web Topic Online Clustering Approach Based on Single-Pass Algorithm[J].New Technology of Library and Information Service, 2011(12): 52-57.)
[20] 殷风景,肖卫东,葛斌,等. 一种面向网络话题发现的增量文本聚类算法[J]. 计算机应用研究, 2011,28(1): 54-57.(Yin Fengjing, Xiao Weidong,Ge Bin,et al. Incremental Algorithm for Clustering Texts in Internet-oriented Topic Detection[J]. Application Research of Computers, 2011,28(1): 54-57.)
[21] IK-Analyzer[OL]. [2012-10-13]. https://code.google.com/p/ik-analyzer/.
[22] Yang H, Callan J. Near-duplicate Detection by Instance-level Constrained Clustering[C].In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY, USA: ACM, 2006: 421-428.
[1] Zhu Hengmin, Zhu Weiwei. Study on Web Topic Online Clustering Approach Based on Single-Pass Algorithm[J]. 现代图书情报技术, 2011, 27(12): 52-57.
[2] Xie Hui,Qin Jie,Hu Shuangshuang. The Study on the Duplicated Web Pages Detection Algorithm Based on the Keyword from User’s Submission[J]. 现代图书情报技术, 2008, 24(7): 43-46.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn