面向论文相似性检测的数据预处理研究

doi:10.11925/infotech.1003-3513.2015.05.07

现代图书情报技术

2015, Vol. 31

Issue (5): 50-56 https://doi.org/10.11925/infotech.1003-3513.2015.05.07

研究论文

本期目录 | 过刊浏览 | 高级检索

面向论文相似性检测的数据预处理研究

刘伙玉^1,3, 王东波²

1 南京大学信息管理学院南京 210023;
2 南京农业大学信息科学技术学院南京 210095;
3 江苏省数据工程与知识服务重点实验室南京 210023

Research and Implementation of Data Preprocessing Oriented to Paper Similarity Detection

Liu Huoyu^1,3, Wang Dongbo²

1 School of Information Management, Nanjing University, Nanjing 210023, China;
2 College of Information Science and Technology, Nanjing Agricultural University, Nanjing 210095, China;
3 Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China

摘要
参考文献
相关文章
Metrics

全文: PDF (632 KB) HTML
输出: BibTeX | EndNote (RIS)

摘要

[目的]探究论文相似性检测中数据预处理的数据问题及相关方法。[方法]对数据进行细致的分析, 采用基于规则的方法、基于统计的方法、基于语义的方法进行预处理。[结果]揭示论文相似性检测中原始数据存在的数据质量问题, 并在此基础上给出数据预处理模型。[局限]语料规模有限, 且暂未考虑对语料中图表内容的预处理。[结论]数据预处理有助于提高论文相似性检测结果的准确性; 有效结合基于规则、统计、语义的三种方法有助于提高数据预处理效果。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	王东波
	刘伙玉

关键词 ：相似性检测, 抄袭检测, 数据预处理, 数据质量, 数据清洗

Abstract：

[Objective] Explore the data issues and methods of data preprocessing on paper similarity detection. [Methods] This paper makes a deep analysis to original data, and briefly introduces three data preprocessing methods, namely rule-based method, statistics-based method and semantic-based method. [Results] There are many data problems in the original data, based on which it describes the model of data preprocessing. [Limitations] The number of the corpora is limited and the preprocessing of figures and tables is not included. [Conclusions] Data preprocessing can help to improve the accuracy of paper similarity detection, and using the three methods together can improve the effect of data preprocessing.

Key words： Similarity detection Plagiarism detection Data preprocessing Data quality Data cleaning

收稿日期: 2014-11-15 出版日期: 2015-06-11

TP311.13

基金资助:

本文系国家自然科学基金管理学部青年项目“基于CSSCI的句法级汉英平行语料库构建及知识挖掘研究”(项目编号:71303120)和江苏省社会科学基金项目“大数据环境下汉英短语级平行语料标注及知识挖掘研究”(项目编号:13XWC017)的研究成果之一。

通讯作者: 王东波,ORCID:0000-0002-9894-9550,E-mail:wangdongbo0102@gmail.com。 E-mail: wangdongbo0102@gmail.com

作者简介: 作者贡献声明: 刘伙玉:提出研究思路,设计并实现研究方案,起草论文;王东波:论文审阅及最终版本修订。

引用本文:

刘伙玉, 王东波. 面向论文相似性检测的数据预处理研究[J]. 现代图书情报技术, 2015, 31(5): 50-56.
Liu Huoyu, Wang Dongbo. Research and Implementation of Data Preprocessing Oriented to Paper Similarity Detection. New Technology of Library and Information Service, 2015, 31(5): 50-56.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2015.05.07 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2015/V31/I5/50

[1] Clough P. Plagiarism in Natural and Programming Languages: An Overview of Current Tools and Technologies [R]. Research Memoranda: CS-00-05, Department of Computer Science, University of Sheffield, UK, 2000: 1-31.
[2] 金博, 史彦军, 滕弘飞. 基于篇章结构相似度的复制检测算法 [J]. 大连理工大学学报, 2007, 47(1): 125-130. (Jin Bo, Shi Yanjun, Teng Hongfei. Document-structure-based Copy Detection Algorithm [J]. Journal of Dalian University of Technology, 2007, 47(1): 125-130.)
[3] 王森, 王宇. 基于文本结构树的论文复制检测算法[J].现代图书情报技术, 2009(10): 50-55. (Wang Sen, Wang Yu. Algorithm of the Text Copy Detection Based on Text Structure Tree [J]. New Technology of Library and Information Service, 2009(10): 50-55.)
[4] 秦玉平, 冷强奎, 王秀坤, 等. 基于局部词频指纹的论文抄袭检测算法 [J]. 计算机工程, 2011, 37(6): 193-197. (Qin Yuping, Leng Qiangkui, Wang Xiukun, et al. Plagiarism-detection Algorithm for Scientific Papers Based on Local Word-Frequency Fingerprint [J]. Computer Engineering, 2011, 37(6): 193-197.)
[5] 赵俊杰, 胡学钢. 一种基于段落词频统计的论文抄袭判定算法 [J]. 计算机技术与发展, 2009, 19(4): 231-233, 238. (Zhao Junjie, Hu Xuegang. A Way to Judge Plagiarism in Academic Papers Based on Word-Frequency Statistics of Paragraphs [J]. Computer Technology and Development, 2009, 19(4): 231-233, 238. )
[6] 赵俊杰, 汪丽, 王平水. 基于自动文摘的论文抄袭检测研究[J]. 电脑与电信, 2010(2): 31-33, 39. (Zhao Junjie, Wang Li, Wang Pingshui. The Research on How to Detect Plagiarism in the Theses Based on Automatic Abstraction [J]. Computer & Telecommunication, 2010(2): 31-33, 39.)
[7] 刘明吉, 王秀峰, 黄亚楼. 数据挖掘中的数据预处理[J].计算机科学, 2000, 27(4): 54-57. (Liu Mingji, WangXiufeng, Huang Yalou. Data Preprocessing in Data Mining [J]. Computer Science, 2000, 27(4): 54-57.)
[8] 陆丽娜, 杨怡玲, 管旭东, 等. Web日志挖掘中的数据预处理的研究[J]. 计算机工程, 2000, 26(4): 66-67, 72. (Lu Li'na, Yang Yiling, Guan Xudong, et al. Data Preparation in Web Log Mining [J]. Computer Engineering, 2000, 26(4): 66-67, 72.)
[9] 李瑞欣, 张水平.数据仓库建设中的数据预处理[J]. 计算机系统应用, 2002(5): 18-21. (Li Ruixin, Zhang Shuiping. Data-processing in the Building of Data Warehouse [J]. Computer Systems & Applications, 2002(5): 18-21.)
[10] 吕景耀. 数据清洗及XML技术在数字报刊中的研究与应用 [D]. 北京: 北京邮电大学, 2009. (Lv Jingyao. Research and Application of Data Cleaning and XML Technologies Based on Digital Newspaper [D]. Beijing: Beijing University of Posts and Telecommunications, 2009.)
[11] Peng F, McCallum A. Information Extraction from Research Papers Using Conditional Random Fields [J]. Information Processing & Management, 2006, 42(4): 963-979.
[12] Han H, Giles C L, Manavoglu E, et al. Automatic Document Metadata Extraction Using Support Vector Machines [C]. In: Proceedings of 2003 Joint Conference on Digital Libraries. IEEE, 2003: 37-48.
[13] Hulth A. Combining Machine Learning and Natural Language Processing for Automatic Keyword Extraction [D]. Department of Computer and Systems Sciences, Stockholm University, 2004.
[14] 高燕. 关键词自动标引方法综述 [J]. 电子世界, 2012(6): 118-120. (Gao Yan. Literature Review on Keywords Automatic Indexing [J]. Electronic World, 2012(6): 118-120.)
[15] 耿崇, 薛德军. 中文文档复制检测方法研究 [J]. 现代图书情报技术, 2007(6): 33-37. (Geng Chong, Xue Dejun. Study on Chinese Document Copy Detection [J]. New Technology of Library and Information Service, 2007(6): 33-37.)
[16] 郭志懋, 周傲英.数据质量和数据清洗研究综述[J]. 软件学报, 2002, 13(11): 2076-2082. (Guo Zhimao, Zhou Aoying. Research on Data Quality and Data Cleaning: A Survey [J]. Journal of Software, 2002, 13(11): 2076-2082.)
[17] 张宁. 基于语义的中文文本预处理研究[D]. 西安: 西安电子科技大学, 2011. (Zhang Ning. Research of Chinese Text Preprocessing Based on Semantic [D]. Xi'an: Xidian University, 2011.)

[1]	陈先来, 罗霄, 刘莉, 李忠民, 安莹. 基于识别率的多叉树森林k-匿名算法^*[J]. 数据分析与知识发现, 2020, 4(12): 14-25.
[2]	叶焕倬, 吴迪. 相似重复记录清理方法研究综述[J]. 现代图书情报技术, 2010, 26(9): 56-66.
[3]	雷孝平, 张旭, 赵蕴华, 郑佳. 基于IRPU算法的专利数据相似重复属性及记录检测方法[J]. 现代图书情报技术, 2010, 26(12): 46-51.
[4]	邵增荣,李英,范体军. 正则表达式在油价事件网页提取中的应用*[J]. 现代图书情报技术, 2009, 3(2): 83-88.
[5]	耿崇,薛德军. 中文文档复制检测方法研究[J]. 现代图书情报技术, 2007, 2(6): 33-37.
[6]	黄永文,李广建. 数字图书馆中的ETL应用研究综述[J]. 现代图书情报技术, 2007, 2(12): 1-5.
[7]	王曰芬,章成志,张蓓蓓,吴婷婷. 数据清洗研究综述[J]. 现代图书情报技术, 2007, 2(12): 50-56.
[8]	史晓刚,黄铁军. 电子图书内容与结构的自动检查*[J]. 现代图书情报技术, 2005, 21(8): 23-26.
[9]	秦峰,唐详,段永威. 引文索引中标引词规范的研究与实践[J]. 现代图书情报技术, 2004, 20(4): 87-89.
[10]	柳胜国. Web日志挖掘数据预处理方法研究 [J]. 现代图书情报技术, 2004, 20(12): 55-57.
[11]	程小澜,泮杏梅. 光盘数据库的情报价值与评价选择[J]. 现代图书情报技术, 1998, 14(4): 34-37.

Viewed

Full text

Abstract

Cited

Shared

Discussed