Research and Implementation of Data Preprocessing Oriented to Paper Similarity Detection
Liu Huoyu1,3, Wang Dongbo2
1 School of Information Management, Nanjing University, Nanjing 210023, China;
2 College of Information Science and Technology, Nanjing Agricultural University, Nanjing 210095, China;
3 Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China
[Objective] Explore the data issues and methods of data preprocessing on paper similarity detection. [Methods] This paper makes a deep analysis to original data, and briefly introduces three data preprocessing methods, namely rule-based method, statistics-based method and semantic-based method. [Results] There are many data problems in the original data, based on which it describes the model of data preprocessing. [Limitations] The number of the corpora is limited and the preprocessing of figures and tables is not included. [Conclusions] Data preprocessing can help to improve the accuracy of paper similarity detection, and using the three methods together can improve the effect of data preprocessing.
刘伙玉, 王东波. 面向论文相似性检测的数据预处理研究[J]. 现代图书情报技术, 2015, 31(5): 50-56.
Liu Huoyu, Wang Dongbo. Research and Implementation of Data Preprocessing Oriented to Paper Similarity Detection. New Technology of Library and Information Service, 2015, 31(5): 50-56.
[1] Clough P. Plagiarism in Natural and Programming Languages: An Overview of Current Tools and Technologies [R]. Research Memoranda: CS-00-05, Department of Computer Science, University of Sheffield, UK, 2000: 1-31.
[2] 金博, 史彦军, 滕弘飞. 基于篇章结构相似度的复制检测算法 [J]. 大连理工大学学报, 2007, 47(1): 125-130. (Jin Bo, Shi Yanjun, Teng Hongfei. Document-structure-based Copy Detection Algorithm [J]. Journal of Dalian University of Technology, 2007, 47(1): 125-130.)
[3] 王森, 王宇. 基于文本结构树的论文复制检测算法[J].现代图书情报技术, 2009(10): 50-55. (Wang Sen, Wang Yu. Algorithm of the Text Copy Detection Based on Text Structure Tree [J]. New Technology of Library and Information Service, 2009(10): 50-55.)
[4] 秦玉平, 冷强奎, 王秀坤, 等. 基于局部词频指纹的论文抄袭检测算法 [J]. 计算机工程, 2011, 37(6): 193-197. (Qin Yuping, Leng Qiangkui, Wang Xiukun, et al. Plagiarism-detection Algorithm for Scientific Papers Based on Local Word-Frequency Fingerprint [J]. Computer Engineering, 2011, 37(6): 193-197.)
[5] 赵俊杰, 胡学钢. 一种基于段落词频统计的论文抄袭判定算法 [J]. 计算机技术与发展, 2009, 19(4): 231-233, 238. (Zhao Junjie, Hu Xuegang. A Way to Judge Plagiarism in Academic Papers Based on Word-Frequency Statistics of Paragraphs [J]. Computer Technology and Development, 2009, 19(4): 231-233, 238. )
[6] 赵俊杰, 汪丽, 王平水. 基于自动文摘的论文抄袭检测研究[J]. 电脑与电信, 2010(2): 31-33, 39. (Zhao Junjie, Wang Li, Wang Pingshui. The Research on How to Detect Plagiarism in the Theses Based on Automatic Abstraction [J]. Computer & Telecommunication, 2010(2): 31-33, 39.)
[7] 刘明吉, 王秀峰, 黄亚楼. 数据挖掘中的数据预处理[J].计算机科学, 2000, 27(4): 54-57. (Liu Mingji, WangXiufeng, Huang Yalou. Data Preprocessing in Data Mining [J]. Computer Science, 2000, 27(4): 54-57.)
[8] 陆丽娜, 杨怡玲, 管旭东, 等. Web日志挖掘中的数据预处理的研究[J]. 计算机工程, 2000, 26(4): 66-67, 72. (Lu Li'na, Yang Yiling, Guan Xudong, et al. Data Preparation in Web Log Mining [J]. Computer Engineering, 2000, 26(4): 66-67, 72.)
[9] 李瑞欣, 张水平.数据仓库建设中的数据预处理[J]. 计算机系统应用, 2002(5): 18-21. (Li Ruixin, Zhang Shuiping. Data-processing in the Building of Data Warehouse [J]. Computer Systems & Applications, 2002(5): 18-21.)
[10] 吕景耀. 数据清洗及XML技术在数字报刊中的研究与应用 [D]. 北京: 北京邮电大学, 2009. (Lv Jingyao. Research and Application of Data Cleaning and XML Technologies Based on Digital Newspaper [D]. Beijing: Beijing University of Posts and Telecommunications, 2009.)
[11] Peng F, McCallum A. Information Extraction from Research Papers Using Conditional Random Fields [J]. Information Processing & Management, 2006, 42(4): 963-979.
[12] Han H, Giles C L, Manavoglu E, et al. Automatic Document Metadata Extraction Using Support Vector Machines [C]. In: Proceedings of 2003 Joint Conference on Digital Libraries. IEEE, 2003: 37-48.
[13] Hulth A. Combining Machine Learning and Natural Language Processing for Automatic Keyword Extraction [D]. Department of Computer and Systems Sciences, Stockholm University, 2004.
[14] 高燕. 关键词自动标引方法综述 [J]. 电子世界, 2012(6): 118-120. (Gao Yan. Literature Review on Keywords Automatic Indexing [J]. Electronic World, 2012(6): 118-120.)
[15] 耿崇, 薛德军. 中文文档复制检测方法研究 [J]. 现代图书情报技术, 2007(6): 33-37. (Geng Chong, Xue Dejun. Study on Chinese Document Copy Detection [J]. New Technology of Library and Information Service, 2007(6): 33-37.)
[16] 郭志懋, 周傲英.数据质量和数据清洗研究综述[J]. 软件学报, 2002, 13(11): 2076-2082. (Guo Zhimao, Zhou Aoying. Research on Data Quality and Data Cleaning: A Survey [J]. Journal of Software, 2002, 13(11): 2076-2082.)
[17] 张宁. 基于语义的中文文本预处理研究[D]. 西安: 西安电子科技大学, 2011. (Zhang Ning. Research of Chinese Text Preprocessing Based on Semantic [D]. Xi'an: Xidian University, 2011.)