New Technology of Library and Information Service  2015, Vol. 31 Issue (5): 50-56    DOI: 10.11925/infotech.1003-3513.2015.05.07
Research and Implementation of Data Preprocessing Oriented to Paper Similarity Detection
Liu Huoyu1,3, Wang Dongbo2
1 School of Information Management, Nanjing University, Nanjing 210023, China;
2 College of Information Science and Technology, Nanjing Agricultural University, Nanjing 210095, China;
3 Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China
[Objective] Explore the data issues and methods of data preprocessing on paper similarity detection. [Methods] This paper makes a deep analysis to original data, and briefly introduces three data preprocessing methods, namely rule-based method, statistics-based method and semantic-based method. [Results] There are many data problems in the original data, based on which it describes the model of data preprocessing. [Limitations] The number of the corpora is limited and the preprocessing of figures and tables is not included. [Conclusions] Data preprocessing can help to improve the accuracy of paper similarity detection, and using the three methods together can improve the effect of data preprocessing.

Key wordsSimilarity detection      Plagiarism detection      Data preprocessing      Data quality      Data cleaning     
Received: 15 November 2014      Published: 11 June 2015
Liu Huoyu, Wang Dongbo. Research and Implementation of Data Preprocessing Oriented to Paper Similarity Detection. New Technology of Library and Information Service, 2015, 31(5): 50-56.

