Please wait a minute...
New Technology of Library and Information Service  2015, Vol. 31 Issue (5): 50-56    DOI: 10.11925/infotech.1003-3513.2015.05.07
Current Issue | Archive | Adv Search |
Research and Implementation of Data Preprocessing Oriented to Paper Similarity Detection
Liu Huoyu1,3, Wang Dongbo2
1 School of Information Management, Nanjing University, Nanjing 210023, China;
2 College of Information Science and Technology, Nanjing Agricultural University, Nanjing 210095, China;
3 Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China
Download:
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] Explore the data issues and methods of data preprocessing on paper similarity detection. [Methods] This paper makes a deep analysis to original data, and briefly introduces three data preprocessing methods, namely rule-based method, statistics-based method and semantic-based method. [Results] There are many data problems in the original data, based on which it describes the model of data preprocessing. [Limitations] The number of the corpora is limited and the preprocessing of figures and tables is not included. [Conclusions] Data preprocessing can help to improve the accuracy of paper similarity detection, and using the three methods together can improve the effect of data preprocessing.

Key wordsSimilarity detection      Plagiarism detection      Data preprocessing      Data quality      Data cleaning     
Received: 15 November 2014      Published: 11 June 2015
:  TP311.13  

Cite this article:

Liu Huoyu, Wang Dongbo. Research and Implementation of Data Preprocessing Oriented to Paper Similarity Detection. New Technology of Library and Information Service, 2015, 31(5): 50-56.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2015.05.07     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2015/V31/I5/50

[1] Clough P. Plagiarism in Natural and Programming Languages: An Overview of Current Tools and Technologies [R]. Research Memoranda: CS-00-05, Department of Computer Science, University of Sheffield, UK, 2000: 1-31.
[2] 金博, 史彦军, 滕弘飞. 基于篇章结构相似度的复制检测算法 [J]. 大连理工大学学报, 2007, 47(1): 125-130. (Jin Bo, Shi Yanjun, Teng Hongfei. Document-structure-based Copy Detection Algorithm [J]. Journal of Dalian University of Technology, 2007, 47(1): 125-130.)
[3] 王森, 王宇. 基于文本结构树的论文复制检测算法[J].现代图书情报技术, 2009(10): 50-55. (Wang Sen, Wang Yu. Algorithm of the Text Copy Detection Based on Text Structure Tree [J]. New Technology of Library and Information Service, 2009(10): 50-55.)
[4] 秦玉平, 冷强奎, 王秀坤, 等. 基于局部词频指纹的论文抄袭检测算法 [J]. 计算机工程, 2011, 37(6): 193-197. (Qin Yuping, Leng Qiangkui, Wang Xiukun, et al. Plagiarism-detection Algorithm for Scientific Papers Based on Local Word-Frequency Fingerprint [J]. Computer Engineering, 2011, 37(6): 193-197.)
[5] 赵俊杰, 胡学钢. 一种基于段落词频统计的论文抄袭判定算法 [J]. 计算机技术与发展, 2009, 19(4): 231-233, 238. (Zhao Junjie, Hu Xuegang. A Way to Judge Plagiarism in Academic Papers Based on Word-Frequency Statistics of Paragraphs [J]. Computer Technology and Development, 2009, 19(4): 231-233, 238. )
[6] 赵俊杰, 汪丽, 王平水. 基于自动文摘的论文抄袭检测研究[J]. 电脑与电信, 2010(2): 31-33, 39. (Zhao Junjie, Wang Li, Wang Pingshui. The Research on How to Detect Plagiarism in the Theses Based on Automatic Abstraction [J]. Computer & Telecommunication, 2010(2): 31-33, 39.)
[7] 刘明吉, 王秀峰, 黄亚楼. 数据挖掘中的数据预处理[J].计算机科学, 2000, 27(4): 54-57. (Liu Mingji, WangXiufeng, Huang Yalou. Data Preprocessing in Data Mining [J]. Computer Science, 2000, 27(4): 54-57.)
[8] 陆丽娜, 杨怡玲, 管旭东, 等. Web日志挖掘中的数据预处理的研究[J]. 计算机工程, 2000, 26(4): 66-67, 72. (Lu Li'na, Yang Yiling, Guan Xudong, et al. Data Preparation in Web Log Mining [J]. Computer Engineering, 2000, 26(4): 66-67, 72.)
[9] 李瑞欣, 张水平.数据仓库建设中的数据预处理[J]. 计算机系统应用, 2002(5): 18-21. (Li Ruixin, Zhang Shuiping. Data-processing in the Building of Data Warehouse [J]. Computer Systems & Applications, 2002(5): 18-21.)
[10] 吕景耀. 数据清洗及XML技术在数字报刊中的研究与应用 [D]. 北京: 北京邮电大学, 2009. (Lv Jingyao. Research and Application of Data Cleaning and XML Technologies Based on Digital Newspaper [D]. Beijing: Beijing University of Posts and Telecommunications, 2009.)
[11] Peng F, McCallum A. Information Extraction from Research Papers Using Conditional Random Fields [J]. Information Processing & Management, 2006, 42(4): 963-979.
[12] Han H, Giles C L, Manavoglu E, et al. Automatic Document Metadata Extraction Using Support Vector Machines [C]. In: Proceedings of 2003 Joint Conference on Digital Libraries. IEEE, 2003: 37-48.
[13] Hulth A. Combining Machine Learning and Natural Language Processing for Automatic Keyword Extraction [D]. Department of Computer and Systems Sciences, Stockholm University, 2004.
[14] 高燕. 关键词自动标引方法综述 [J]. 电子世界, 2012(6): 118-120. (Gao Yan. Literature Review on Keywords Automatic Indexing [J]. Electronic World, 2012(6): 118-120.)
[15] 耿崇, 薛德军. 中文文档复制检测方法研究 [J]. 现代图书情报技术, 2007(6): 33-37. (Geng Chong, Xue Dejun. Study on Chinese Document Copy Detection [J]. New Technology of Library and Information Service, 2007(6): 33-37.)
[16] 郭志懋, 周傲英.数据质量和数据清洗研究综述[J]. 软件学报, 2002, 13(11): 2076-2082. (Guo Zhimao, Zhou Aoying. Research on Data Quality and Data Cleaning: A Survey [J]. Journal of Software, 2002, 13(11): 2076-2082.)
[17] 张宁. 基于语义的中文文本预处理研究[D]. 西安: 西安电子科技大学, 2011. (Zhang Ning. Research of Chinese Text Preprocessing Based on Semantic [D]. Xi'an: Xidian University, 2011.)

[1] Chen Xianlai, Luo Xiao, Liu Li, Li Zhongmin, An Ying. k-Anonymity Algorithm of Multi-Branch-Tree Forest Based on Recognition Rate[J]. 数据分析与知识发现, 2020, 4(12): 14-25.
[2] Zhai Dongsheng,Cai Wenhao,Zhang Jie,Li Zhenfei. An Improved Method of Semantic Similarity Calculation of Chinese Trademarks[J]. 数据分析与知识发现, 2017, 1(11): 19-28.
[3] Shao Zengrong,Li Ying,Fan Tijun. The Application of Regular Expressions in Online Oil Price Event[J]. 现代图书情报技术, 2009, 3(2): 83-88.
[4] Geng Chong,Xue Dejun. Study on Chinese Document Copy Detection[J]. 现代图书情报技术, 2007, 2(6): 33-37.
[5] Huang Yongwen,Li Guangjian. Review on the Application Reasearch of ETL in Digital Library[J]. 现代图书情报技术, 2007, 2(12): 1-5.
[6] Wang Yuefen,Zhang Chengzhi,Zhang Beibei,Wu Tingting. A Survey of Data Cleaning[J]. 现代图书情报技术, 2007, 2(12): 50-56.
[7] Shi Xiaogang,Huang Tiejun. Auto Check of Digital Book’s Content and Structure[J]. 现代图书情报技术, 2005, 21(8): 23-26.
[8] Qin Feng,Tang Xiang,Duan Yongwei. The Study and Fulfill about Criterion of Key Word in Citation Indexes[J]. 现代图书情报技术, 2004, 20(4): 87-89.
[9] Liu Shengguo. Research on Data Preprocessing Method in Web Log Mining[J]. 现代图书情报技术, 2004, 20(12): 55-57.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn