Please wait a minute...
Advanced Search
现代图书情报技术  2013, Vol. 29 Issue (9): 41-47     https://doi.org/10.11925/infotech.1003-3513.2013.09.07
  知识组织与知识管理 本期目录 | 过刊浏览 | 高级检索 |
基于语义指纹的中文文本快速去重
李纲, 毛进, 陈璟浩
武汉大学信息资源研究中心 武汉 430072
Fast Duplicate Detection for Chinese Texts Based on Semantic Fingerprint
Li Gang, Mao Jin, Chen Jinghao
Center for the Studies of Information Resources, Wuhan University, Wuhan 430072, China
全文: PDF (574 KB)   HTML  
输出: BibTeX | EndNote (RIS)      
摘要 针对中文文本,抽取出文本内容特征,结合Simhash算法生成中文文本的语义指纹,通过语义指纹的海明距离判断文本间相似程度。整合Single-Pass快速聚类算法对语义指纹快速聚类,所得的语义指纹聚类即为文本去重的最终结果,从而实现面向中文文本的快速去重流程。实验过程中,通过与Shingle算法对比,可以体现该方法在算法精确度、鲁棒性等方面的优势,同时该方法的运行速度优势也能较好地支持大数据量文本的去重操作。
服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
李纲
毛进
陈璟浩
关键词 语义指纹SimhashSingle-Pass文本去重    
Abstract:Oriented to Chinese texts, text features are firstly extracted to generate semantic fingerprints by performing the Simhash algorithm. The Hamming Distances between semantic fingerprints are applied to determine the similarity between texts. Then, as the last step of the entire process of detecting duplicates for Chinese text, the Single-Pass clustering algorithm is integrated to cluster the generated semantic fingerprints, after which the clusters of fingerprints are the final results. By comparing with the Shingle algorithm, the experiment shows that the Simhash approach is superior at both precise and robustness, and the Simhash approach is capable to process large amount of texts due to its rapidness.
Key wordsSemantic fingerprint    Simhash    Single-Pass    Duplicate detection
收稿日期: 2013-06-14      出版日期: 2013-09-27
:  TP391.3  
基金资助:本文系国家自然科学基金项目“科研团队动态演化规律研究”(项目编号:71273196)的研究成果之一。
引用本文:   
李纲, 毛进, 陈璟浩. 基于语义指纹的中文文本快速去重[J]. 现代图书情报技术, 2013, 29(9): 41-47.
Li Gang, Mao Jin, Chen Jinghao. Fast Duplicate Detection for Chinese Texts Based on Semantic Fingerprint. New Technology of Library and Information Service, 2013, 29(9): 41-47.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2013.09.07      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2013/V29/I9/41
[1] 赵立磊. 基于网页去重的垂直搜索引擎设计与实现[D]. 大连:大连理工大学, 2012.(Zhao Lilei.The Design and Implementation of Vertical Search Engine Based on Duplicated Web Pages Elimination[D].Dalian: Dalian University of Technology, 2012.)
[2] 马如林,蒋华,张庆霞. 基于贝叶斯方法和信息指纹的博客评论过滤[J]. 计算机工程与应用,2008,44(24): 159-161.(Ma Rulin, Jiang Hua, Zhang Qingxia. Blog’s Content Filtering Based on Bayes Method and Information Fingerprint[J].Computer Engineering and Applications, 2008, 44(24): 159-161.)
[3] Heintze N. Scalable Document Fingerprinting[C]. In: Proceedings of the 1996 USENIX Workshop on Electronic Commerce. 1996.
[4] Broder A Z, Glassman S C, Manasse M S, et al. Syntactic Clustering of the Web[J]. Computer Networks and ISDN Systems, 1997, 29(8-13): 1157-1166.
[5] 杨虎. 面向海量短文本去重技术的研究与实现[D]. 长沙:国防科学技术大学, 2007.(Yang Hu. De-duplication Technology Research and Implementation of Large-scale Short Texts Orient[D].Changsha: National University of Defense Technology,2007.)
[6] 吴平博,陈群秀,马亮. 基于特征串的大规模中文网页快速去重算法研究[J]. 中文信息学报, 2003,17(2): 28-35.(Wu Pingbo, Chen Qunxiu, Ma Liang. The Study on Large Scale Duplicated Web Pages of Chinese Fast Detection Algorithm Based on String of Feature Code[J].Journal of Chinese Information Processing, 2003,17(2): 28-35.)
[7] 谢蕙,秦杰,胡双双. 基于用户查询关键词的网页去重方法研究[J]. 现代图书情报技术, 2008(7): 43-46.(Xie Hui, Qin Jie,Hu Shuangshuang. The Study on the Duplicated Web Pages Detection Algorithm Based on the Keyword from User’s Submission[J].New Technology of Library and Information Service, 2008(7): 43-46.)
[8] 张刚,刘挺,郑实福,等. 大规模网页快速去重算法[EB/OL]. [2013-05-31]. http://wenku.baidu.com/view/3bf04d35eefd c8d376ee32d0.html.(Zhang Gang, Liu Ting, Zheng Shifu, et al. Fast De-duplicate Algorithm for Large Scale Web Pages[EB/OL]. [2013-05-31]. http://wenku.baidu.com/view/3bf04d35eefdc8d376ee32d0.html.)
[9] 曹玉娟,牛振东,彭学平,等. 一个基于特征向量的近似网页去重算法[J]. 中国索引, 2009,7(1): 11-14.(Cao Yujuan, Niu Zhendong, Peng Xueping, et al. A Near-duplicate Web Page Detection Algorithm Based on Feature Vectors[J].Journal of the China Society of Indexers, 2009,7(1): 11-14.)
[10] 樊勇,郑家恒. 基于主题的网页去重[J]. 电脑开发与应用, 2008,21(4): 4-6.(Fan Yong, Zheng Jiaheng. Detection and Elimination of Similar Web Pages Based on Topic[J]. Computer Development & Applications, 2008,21(4): 4-6.)
[11] 黄仁,冯胜,杨吉云,等. 基于正文结构和长句提取的网页去重算法[J]. 计算机应用研究, 2010,27(7): 2489-2491.(Huang Ren, Feng Sheng, Yang Jiyun, et al. Detection and Elimination of Similar Web Pages Based on Text Structure and Extraction of Long Sentences[J]. Application Research of Computers, 2010,27(7): 2489-2491.)
[12] 王小华,卢小康. 基于N-Gram的文本去重方法研究[J]. 杭州电子科技大学学报, 2010,30(2): 61-64.(Wang Xiaohua, Lu Xiaokang. A Study on Removing Duplication Using N-gram Terms for Chinese Text [J].Journal of Hangzhou Dianzi Univeristy, 2010,30(2): 61-64.)
[13] Chowdhury A, Frieder O, Grossman D, et al. Collection Statistics for Fast Duplicate Document Detection[J]. ACM Transactions on Information Systems, 2002, 20(2): 171-191.
[14] Kocz A, Chowdhury A, Alspector J. Improved Robustness of Signature-based Near-replica Detection via Lexicon Randomization[C]. In: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA: ACM, 2004: 605-610.
[15] Charikar M S. Similarity Estimation Techniques from Rounding Algorithms[C]. In: Proceedings of the 34th Annual ACM Symposium on Theory of Computing. New York, NY, USA: ACM, 2002: 380-388.
[16] Manku G S, Jain A, Das Sarma A. Detecting Near-duplicates for Web Crawling[C]. In: Proceedings of the 16th International Conference on World Wide Web. New York, NY, USA: ACM, 2007: 141-150.
[17] General Purpose Hash Function Algorithms[EB/OL].[2013-06-06].http://www.partow.net/programming/hashfunctions/index.html.
[18] Sentential Database Manager[EB/OL]. [2013-06-06]. https://code.google.com/p/sdbm/.
[19] 朱恒民,朱卫未. 基于Single-Pass的网络话题在线聚类方法研究[J]. 现代图书情报技术, 2011(12): 52-57.(Zhu Hengmin, Zhu Weiwei. Study on Web Topic Online Clustering Approach Based on Single-Pass Algorithm[J].New Technology of Library and Information Service, 2011(12): 52-57.)
[20] 殷风景,肖卫东,葛斌,等. 一种面向网络话题发现的增量文本聚类算法[J]. 计算机应用研究, 2011,28(1): 54-57.(Yin Fengjing, Xiao Weidong,Ge Bin,et al. Incremental Algorithm for Clustering Texts in Internet-oriented Topic Detection[J]. Application Research of Computers, 2011,28(1): 54-57.)
[21] IK-Analyzer[OL]. [2012-10-13]. https://code.google.com/p/ik-analyzer/.
[22] Yang H, Callan J. Near-duplicate Detection by Instance-level Constrained Clustering[C].In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY, USA: ACM, 2006: 421-428.
[1] 朱恒民, 朱卫未. 基于Single-Pass的网络话题在线聚类方法研究[J]. 现代图书情报技术, 2011, 27(12): 52-57.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn