基于文本结构树的论文复制检测算法

doi:10.11925/infotech.1003-3513.2009.10.09

现代图书情报技术

2009, Vol.

Issue (10): 50-55 https://doi.org/10.11925/infotech.1003-3513.2009.10.09

知识组织与知识管理

本期目录 | 过刊浏览 | 高级检索

基于文本结构树的论文复制检测算法

王森王宇

（大连理工大学管理学院大连 116024）

Algorithm of the Text Copy Detection Based on Text Structure Tree

Wang Sen Wang Yu

（School of Management, Dalian University of Technology, Dalian 116024, China）

摘要
参考文献
相关文章
Metrics

全文: PDF (437 KB)
输出: BibTeX | EndNote (RIS)

摘要

针对目前学术界抄袭现象日趋严重的问题，提出基于文本结构树的论文复制检测算法。将一篇论文分为三层的结构树：最上层的根节点表示整篇论文，分支节点表示句子包，叶节点表示句子。根据一个函数和句子的综合相似度计算句子相似度，以最大句子相似度计算叶节点相似度，上层节点的相似度由相邻的下层节点相似度计算得到。选用中国期刊全文数据库中的论文进行测试，实验结果证明该算法是可行的、高效的。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	王森
	王宇

关键词 ：复制检测, 句子相似度, 句子包, 结构树

Abstract：

Concerning the present problem of a growing academic plagiarism，the algorithm of the text copy detection based on text structure tree is put forward．A paper can be divided into a construction tree with three layers：the uppermost root node is a text；branch node represents a sentence bag；leaf node denotes sentence.According to synthetic similarity and a function this paper computes sentence similarity，and similarity of leaf node is based on maximal sentence similarity．At the same time，the upper similarity is derived from the adjacent lower similarity．Finally，papers of China Journal Full-Text Database is chosen for a test，and the experimental result shows that this algorithm is feasible and efficient．

Key words： Copy detection Sentence similarity Sentence bag Structure tree

收稿日期: 2009-08-28 出版日期: 2009-10-25

TP391.1

通讯作者: 王宇 E-mail: ywang@dlut.edu.cn

作者简介: 王森,王宇

引用本文:

王森,王宇. 基于文本结构树的论文复制检测算法[J]. 现代图书情报技术, 2009, (10): 50-55.
Wang Sen,Wang Yu. Algorithm of the Text Copy Detection Based on Text Structure Tree. New Technology of Library and Information Service, 2009, (10): 50-55.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2009.10.09 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2009/V/I10/50

［1］ Brin S, Davis J, Garcia-Molina H. Copy Detection Mechanisms for Digital Documents ［C］. In: Proceedings of the ACM SIGMOD Annual Conference. New York: ACM Press, 1995:398-409.
［2］ Shivakumar N, Garcia-Molina H. SCAM: A Copy Detection Mechanism for Digital Documents ［C］. In: Proceedings of the 2nd International Conference on Theory and Practice of Digital Libraries，Austin，Texas. 1995:1-13.
［3］ Si A, Leong H V, Lau R H. CHECK: A Document Plagiarism Detection System ［C］. In: Proceedings of the ACM Symposium for Applied Computing.1997: 70-77.
［4］宋擒豹,沈钧毅.数字商品非法复制和扩散的监测机制［J］.计算机研究与发展, 2001,38(1):121-125.
［5］鲍军鹏,沈钧毅,刘晓东,等.自然语言文档复制检测研究综述［J］.软件学报,2003,14(10): 1753-1760.
［6］史彦军,滕弘飞,金博.抄袭论文识别研究与进展［J］.大连理工大学学报, 2005,45(1): 50-57.
［7］易彤,徐升华,万常选,等.抄袭剽窃论文识别研究综述［J］. 情报学报, 2007,26(4): 567-573.
［8］化柏林.基于句子匹配的文章自写度测评系统［J］.现代图书情报技术,2007(11): 40-44.
［9］林鸿飞,战学刚,姚天顺.文本层次分析与文本浏览［J］.中文信息学报, 1999(4):7-13.
［10］秦新国.基于句子相似度的文档复制检测算法研究［J］.现代图书情报技术, 2007(11): 63-66.
［11］ ICTCLAS汉语分词系统.ICTCLAS2009版在线演示［P/OL］. ［2009-09-18］. http://ictclas.org/test.html.
［12］何维,王宇.基于句子关系图的网页文本主题句抽取［J］.现代图书情报技术,2009(3)：57-61.
［13］吕学强,任飞亮,黄志丹,等.句子相似模型和最相似句子查找算法［J］.东北大学学报：自然科学版, 2003，24(6):531–534.
［14］ Dietterich T G, Lathrop R H, Lozano-Perez T. Solving the Multiple-instance Problem with Axis-parallel Rectangles ［J］. Artificial Intelligence,1997,89(1-2):31-71.

[1]	袁冬, 熊晶, 刘永革. 面向甲骨文的实例机器翻译技术研究[J]. 现代图书情报技术, 2012, 28(5): 48-54.
[2]	王志超, 翁楠, 王宇. 基于主题句相似度的标题党新闻鉴别技术研究[J]. 现代图书情报技术, 2011, (11): 48-53.
[3]	何维,王宇. 基于句子关系图的网页文本主题句抽取*[J]. 现代图书情报技术, 2009, 3(3): 57-61.
[4]	廉站俊,吕学强,张玉杰,施水才. 基于句子相似度计算的信息抽取*[J]. 现代图书情报技术, 2007, 2(6): 38-41.
[5]	化柏林 . 基于句子匹配的文章自写度测评系统[J]. 现代图书情报技术, 2007, 2(11): 40-44.
[6]	秦新国. 基于句子相似度的文档复制检测算法研究[J]. 现代图书情报技术, 2007, 2(11): 63-66.

Viewed

Full text

Abstract

Cited

Shared

Discussed