Please wait a minute...
New Technology of Library and Information Service  2009, Vol. Issue (10): 50-55    DOI: 10.11925/infotech.1003-3513.2009.10.09
Current Issue | Archive | Adv Search |
Algorithm of the Text Copy Detection Based on Text Structure Tree
Wang Sen  Wang Yu
(School of Management, Dalian University of Technology, Dalian 116024, China)
Download: PDF(437 KB)   HTML  
Export: BibTeX | EndNote (RIS)      

 Concerning the present problem of a growing academic plagiarism,the algorithm of the text copy detection based on text structure tree is put forward.A paper can be divided into a construction tree with three layers:the uppermost root node is a text;branch node represents a sentence bag;leaf node denotes sentence.According to synthetic similarity and a function this paper computes sentence similarity,and similarity of leaf node is based on maximal sentence similarity.At the same time,the upper similarity is derived from the adjacent lower similarity.Finally,papers of China Journal Full-Text Database is chosen for a test,and the experimental result shows that this algorithm is feasible and efficient.

Key words Copy detection      Sentence similarity      Sentence bag      Structure tree     
Received: 28 August 2009      Published: 25 October 2009


Corresponding Authors: Wang Yu     E-mail:
About author:: Wang Sen,Wang Yu

Cite this article:

Wang Sen,Wang Yu. Algorithm of the Text Copy Detection Based on Text Structure Tree. New Technology of Library and Information Service, 2009, (10): 50-55.

URL:     OR

[1] Brin S, Davis J, Garcia-Molina H. Copy Detection Mechanisms for Digital Documents [C]. In: Proceedings of the ACM SIGMOD Annual Conference. New York: ACM Press, 1995:398-409.
[2] Shivakumar N, Garcia-Molina H. SCAM: A Copy Detection Mechanism for Digital Documents [C]. In: Proceedings of the 2nd International Conference on Theory and Practice of Digital Libraries,Austin,Texas. 1995:1-13.
[3] Si A, Leong H V, Lau R H. CHECK: A Document Plagiarism Detection System [C]. In: Proceedings of the ACM Symposium for Applied Computing.1997: 70-77.
[4] 宋擒豹,沈钧毅.数字商品非法复制和扩散的监测机制[J].计算机研究与发展, 2001,38(1):121-125.
[5] 鲍军鹏,沈钧毅,刘晓东,等.自然语言文档复制检测研究综述[J].软件学报,2003,14(10): 1753-1760.
[6] 史彦军,滕弘飞,金博.抄袭论文识别研究与进展[J].大连理工大学学报, 2005,45(1): 50-57.
[7] 易彤,徐升华,万常选,等.抄袭剽窃论文识别研究综述[J]. 情报学报, 2007,26(4): 567-573.
[8] 化柏林.基于句子匹配的文章自写度测评系统[J].现代图书情报技术,2007(11): 40-44.
[9] 林鸿飞,战学刚,姚天顺.文本层次分析与文本浏览[J].中文信息学报, 1999(4):7-13.
[10] 秦新国.基于句子相似度的文档复制检测算法研究[J].现代图书情报技术, 2007(11): 63-66.
[11] ICTCLAS汉语分词系统.ICTCLAS2009版在线演示 [P/OL]. [2009-09-18].
[12] 何维,王宇.基于句子关系图的网页文本主题句抽取[J].现代图书情报技术,2009(3):57-61.
[13] 吕学强,任飞亮,黄志丹,等.句子相似模型和最相似句子查找算法[J].东北大学学报:自然科学版, 2003,24(6):531–534.
[14] Dietterich T G, Lathrop R H, Lozano-Perez T. Solving the Multiple-instance Problem with Axis-parallel Rectangles [J]. Artificial Intelligence,1997,89(1-2):31-71.

[1] Yuan Dong, Xiong Jing, Liu Yongge. Research on Example-based Machine Translation for Oracle Bone Inscriptions[J]. 现代图书情报技术, 2012, 28(5): 48-54.
[2] Wang Zhichao, Weng Nan, Wang Yu. Research of Title Party News Identification Technology Based on Topic Sentence Similarity[J]. 现代图书情报技术, 2011, (11): 48-53.
[3] He Wei,Wang Yu. Extracting Topic Sentences form Web Text Based on Sentence Relationship Map[J]. 现代图书情报技术, 2009, 3(3): 57-61.
[4] Lian Zhanjun,Lv Xueqiang,Zhang Yujie,Shi Shuicai. Information Extraction Based on Calculation of Sentence Similarity[J]. 现代图书情报技术, 2007, 2(6): 38-41.
[5] Hua Bolin. Article Novelty Evaluation System Based on Sentence Matching[J]. 现代图书情报技术, 2007, 2(11): 40-44.
[6] Qin Xinguo. Research on the Copy Detection Based on the Similarity of Sentences[J]. 现代图书情报技术, 2007, 2(11): 63-66.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938