Please wait a minute...
Data Analysis and Knowledge Discovery  2020, Vol. 4 Issue (7): 87-95    DOI: 10.11925/infotech.2096-3467.2020.0137
Current Issue | Archive | Adv Search |
Classification and Indexing Method with CNN for Imbalanced Datasets
Weng Mengjuan,Yao Changqing,Han Hongqi(),Wang Lijun,Ran Yaxin
Key Laboratory of Rich-media Knowledge Organization and Service of Digital Publishing Content, Beijing 100038, China;Key Laboratory of Rich-media Knowledge Organization and Service of Digital Publishing Content, Beijing 100038, China
Download: PDF (922 KB)   HTML ( 5
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a new classficiation method based on Convolutional Neural Network(CNN), aiming to improve the indexing accuracy of the skewed datasets.[Methods] Compared with stacking fusion methods, we stacked each base model’s distribution information of the classification label probabilities as CNN inputs. Our method does not need to manually set the weight for each base model. We examined the proposed model with the third-level categories of the Chinese Library Classification (CLC).[Results] The accuracy of our method was upto 60%, which was 19% higher than the performance of baselinemodels.[Limitations] Our method needs to design convolution kernels, which can only be determined with experiments. Meanwhile, the complexity of classifier training at the fusion stage depends on the number of categories and base models.[Conclusions] The porposed method can effectively improve the indexing accuracy of imbalanced datasets. With the help of hierarchical classification strategy, it can automatically finish classification and indexing tasks of CLC.

Key wordsClassification Indexing      Imbalanced Data      CNN      Stacking     
Received: 26 February 2020      Published: 25 July 2020
ZTFLH:  TP391 G35  
Corresponding Authors: Han Hongqi     E-mail: bithhq@163.com

Cite this article:

Weng Mengjuan,Yao Changqing,Han Hongqi,Wang Lijun,Ran Yaxin. Classification and Indexing Method with CNN for Imbalanced Datasets. Data Analysis and Knowledge Discovery, 2020, 4(7): 87-95.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2020.0137     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2020/V4/I7/87

Illustration of the Label Probability Distribution Output by a Single Base Model
CNN as a Fusion Model in the Stacking Heterogeneous Integration
序号 类号 类名 样本数 IR
1 G48 学校建筑和设备的管理 257 16
2 G77 社会教育 282 15
3 G76 特殊教育 298 14
4 G65 师范教育 575 7
5 G46 教育行政 576 7
6 G43 电化教育 624 7
7 G72 成人教育、业余教育 853 5
8 G40 教育学 1 001 4
9 G42 教学理论 1 001 4
10 G47 学校管理 1 001 4
11 G51 世界教育事业 1 001 4
12 G75 少数民族教育 1 001 4
13 G41 思想政治教育、德育 1 002 4
14 G52 中国教育事业 1 020 4
15 G45 教师与学生 1 052 4
16 G44 教育心理学 1 088 4
17 G61 学前教育、幼儿教育 1 089 4
18 G71 职业技术教育 1 480 3
19 G62 初等教育 1 564 3
20 G63 中等教育 2 649 2
21 G64 高等教育 4 118 1
总计 23 532
Imbalance Ratio with G64 as Majority Class
分类号 NB LR KNN SVM 基线模型 融合模型
G64 0.69 0.83 0.61 0.74 0.90 0.68
G42 0.29 0.28 0.24 0.42 0.09 0.45
G63 0.54 0.74 0.55 0.71 0.82 0.70
G44 0.83 0.74 0.77 0.73 0.69 0.78
G61 0.70 0.81 0.72 0.77 0.82 0.83
G71 0.66 0.68 0.51 0.74 0.68 0.80
G41 0.57 0.53 0.64 0.58 0.52 0.62
G51 0.46 0.37 0.33 0.36 0.31 0.47
G72 0.66 0.82 0.67 0.79 0.71 0.79
G75 0.61 0.62 0.54 0.66 0.58 0.75
G45 0.49 0.56 0.48 0.54 0.47 0.59
G48 0.38 0.42 0.58 0.58 0.00 0.65
G46 0.24 0.14 0.48 0.24 0.00 0.27
G43 0.58 0.45 0.42 0.39 0.10 0.49
G62 0.52 0.75 0.52 0.70 0.69 0.74
G76 0.47 0.27 0.57 0.40 0.00 0.72
G40 0.46 0.45 0.43 0.41 0.39 0.48
G52 0.65 0.53 0.55 0.53 0.56 0.58
G47 0.25 0.41 0.34 0.48 0.17 0.46
G65 0.33 0.22 0.21 0.26 0.07 0.33
G77 0.25 0.43 0.39 0.32 0.00 0.39
平均 0.51 0.53 0.50 0.54 0.41 0.60
Classification Accuracy of Each Model
模型 KNN LR MNB SVM 融合模型
G47 0.54 0.71 0.61 0.64 0.64
G75 0.69 0.78 0.69 0.74 0.81
G45 0.58 0.64 0.62 0.60 0.68
G41 0.70 0.68 0.74 0.67 0.70
G40 0.49 0.49 0.53 0.52 0.53
G44 0.89 0.89 0.86 0.84 0.89
G52 0.65 0.67 0.63 0.73 0.70
G61 0.85 0.88 0.84 0.84 0.91
G42 0.59 0.73 0.63 0.71 0.75
G51 0.54 0.64 0.65 0.61 0.66
平均 0.65 0.71 0.68 0.69 0.73
Algorithm Performance on Balanced Dataset
The Performance with Increasing Length of the Convolution Kernel
The Performance with Increasing Width of the Convolution Kernel
[1] 何琳, 刘竞, 侯汉清. 基于《中图法》的多层自动分类影响因素分析[J]. 中国图书馆学报, 2009,35(6):49-55.
[1] ( He Lin, Liu Jing, Hou Hanqing. Analysis of Influential Factors of Multi-layered Automatic Classification Based on Chinese Library Classification[J]. Journal of Library Science in China, 2009,35(6):49-55.)
[2] 何琳, 刘竞, 侯汉清. 基于标引经验和机器学习相结合的多层自动分类[J]. 情报学报, 2006,26(4):725-729.
[2] ( He Lin, Liu Jing, Hou Hanqing. Multi-level Automatic Classification Based on the Combination of Indexing Experience and Machine Learning[J]. Journal of the China Society for Scientific and Technical Information, 2006,26(4):725-729.)
[3] 李艳霞, 柴毅, 胡友强, 等. 不平衡数据分类方法综述[J]. 控制与决策, 2019,34(4):673-688.
[3] ( Li Yanxia, Chai Yi, Hu Youqiang, et al. Review of Imbalanced Data Classification Methods[J]. Control and Decision, 2019,34(4):673-688.)
[4] Galar M, Fernandez A, Barrenechea E, et al. A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches[J]. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews), 2012,42(4):463-484.
[5] Somasundaram A, Reddy S. Modelling a Stable Classifier for Handling Large Scale Data with Noise and Imbalance[C] //Proceedings of the 2017 International Conference on Computational Intelligence in Data Science. 2017: 1-6.
[6] Wei Y Y, Li T S, Ge Z H. Combining Distributed Classifies by Stacking[C] //Proceedings of the 3rd International Conference on Genetic and Evolutionary Computing. 2009: 418-421.
[7] Yan J, Han S. Classifying Imbalanced Data Sets by a Novel RE-sample and Cost-sensitive Stacked Generalization Method[J]. Mathematical Problems in Engineering, DOI: 10.1155/2018/5036710.
pmid: 29578548
[8] 郭利敏, 刘炜, 吴佩娟, 等. 机器学习在图书馆应用初探:以 TensorFlow为例[J]. 大学图书馆学报, 2017,35(6):31-40.
[8] ( Guo Limin, Liu Wei, Wu Peijuan, et al. Machine Learning and Its Application in Library:Take TensorFlow as an Example[J]. Journal of Academic Libraries, 2017,35(6):31-40.)
[9] 郭利敏. 基于卷积神经网络的文献自动分类研究[J]. 图书与情报, 2017(6):96-103.
[9] ( Guo Limin. Study of Automatic Classification of Literature Based on Convolution Neural Network[J]. Library & Information, 2017(6):96-103.)
[10] 张玉芳. 基于知识库的多层次文本自动分类研究[D]. 南京:南京理工大学, 2014.
[10] ( Zhang Yufang. The Research of Hierarchical Automatic Text Classification Based on the Knowledge Database[D]. Nanjing:Nanjing University of Science and Technology, 2014.)
[11] Wolpert D. Stacked Generalization[J]. Neural Networks, 1992,5(2):241-260.
[12] Ting K M, Witten I H. Issues in Stacked Generalization[J]. Journal of Artificial Intelligence Research, 1999,10(1):271-289.
[13] Xiang Y, Xie Y P. Imbalanced Data Classification Method Based on Ensemble Learning[A]//Communications, Signal Processing, and Systems[M]. Berlin, German:Springer, 2018: 18-24.
[14] Tsoumakas G, Vlahavas I. Distributed Data Mining of Large Classifier Ensembles[C] // Proceedings of the 2nd Hellenic Conference on AI. 2002: 249-256.
[15] Yoon K. Convolutional Neural Networks for Sentence Classification[OL]. arXiv Preprint, arXiv: 1408. 5882.
[16] 涂曼姝, 潘接林. 关于深度神经网络在交叉领域的情感分类任务中的可迁移性探究[J]. 情报工程, 2018,4(6):13-24.
[16] ( Tu Manshu, Pan Jielin. How Features Transferred in Very Deep Neural Networks on Cross Domain Sentiment Classification[J]. Technology Intelligence Engineering, 2018,4(6):13-24.)
[17] 翟文洁, 闫琰, 张博文, 等. 基于混合深度信念网络的多类文本表示与分类方法[J]. 情报工程, 2016,2(5):30-40.
[17] ( Zhai Wenjie, Yan Yan, Zhang Bowen, et al. A Model for Text Representation and Classification Based on Hybrid Deep Belief Networks[J]. Technology Intelligence Engineering, 2016,2(5):30-40.)
[18] Ran Y X, Han H Q, Zhang Y L, et al. Hierarchical Classification Algorithm Based on FastText[C] //Proceedings of the 7th International Conference on Computational and Information Sciences. 2019: 909-916.
[1] Xiang Fei,Xie Yaotan. Recognition Model of Patient Reviews Based on Mixed Sampling and Transfer Learning[J]. 数据分析与知识发现, 2020, 4(2/3): 39-47.
[2] Na Ma,Zhixiong Zhang,Pengmin Wu. Automatic Identification of Term Citation Object with Feature Fusion[J]. 数据分析与知识发现, 2020, 4(1): 89-98.
[3] Yunfei Shao,Dongsu Liu. Classifying Short-texts with Class Feature Extension[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[4] Lianjie Xiao,Mengrui Gao,Xinning Su. An Under-sampling Ensemble Classification Algorithm Based on Fuzzy C-Means Clustering for Imbalanced Data[J]. 数据分析与知识发现, 2019, 3(4): 90-96.
[5] Hui Li,Yaqing Chai. Fine-Grained Sentiment Analysis Based on Convolutional Neural Network[J]. 数据分析与知识发现, 2019, 3(1): 95-103.
[6] Cuiqing Jiang,Kailun Song,Yong Ding,Yao Liu. Identifying Potential Customers Based on User-Generated Contents[J]. 数据分析与知识发现, 2018, 2(3): 1-8.
[7] Guoming Feng,Xiaodong Zhang,Suhui Liu. Classifying Chinese Texts with CapsNet[J]. 数据分析与知识发现, 2018, 2(12): 68-76.
[8] Yang Zhao,Qiqi Li,Yuhan Chen,Wenhang Cao. Examining Consumer Reviews of Overseas Shopping APP with Sentiment Analysis[J]. 数据分析与知识发现, 2018, 2(11): 19-27.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn