Data Analysis and Knowledge Discovery  2020, Vol. 4 Issue (7): 87-95    DOI: 10.11925/infotech.2096-3467.2020.0137
Classification and Indexing Method with CNN for Imbalanced Datasets
Weng Mengjuan,Yao Changqing,Han Hongqi(),Wang Lijun,Ran Yaxin
Key Laboratory of Rich-media Knowledge Organization and Service of Digital Publishing Content, Beijing 100038, China;Key Laboratory of Rich-media Knowledge Organization and Service of Digital Publishing Content, Beijing 100038, China
Abstract

[Objective] This paper proposes a new classficiation method based on Convolutional Neural Network(CNN), aiming to improve the indexing accuracy of the skewed datasets.[Methods] Compared with stacking fusion methods, we stacked each base model’s distribution information of the classification label probabilities as CNN inputs. Our method does not need to manually set the weight for each base model. We examined the proposed model with the third-level categories of the Chinese Library Classification (CLC).[Results] The accuracy of our method was upto 60%, which was 19% higher than the performance of baselinemodels.[Limitations] Our method needs to design convolution kernels, which can only be determined with experiments. Meanwhile, the complexity of classifier training at the fusion stage depends on the number of categories and base models.[Conclusions] The porposed method can effectively improve the indexing accuracy of imbalanced datasets. With the help of hierarchical classification strategy, it can automatically finish classification and indexing tasks of CLC.

Received: 26 February 2020      Published: 25 July 2020
Corresponding Authors: Han Hongqi     E-mail: bithhq@163.com