Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (5): 59-70    DOI: 10.11925/infotech.2096-3467.2020.0902
Vocal Music Classification Based on Multi-category Feature Fusion
Meng Zhen,Wang Hao(),Yu Wei,Deng Sanhong,Zhang Baolong
Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China
[Objective] This paper creates a new model combining the statistical characteristics of audio and image properties, aiming to address the classification issues facing music retrieval. [Methods] First, we extracted the statistical characteristics of audios and the Mel spectrogram characteristics of images with the help of machine learning methods. Then, we transformed the audio classification tasks to image categorization. Finally, we constructed a deep learning method combining audio statistics and Mel spectrogram image features. [Results] In vocal music classification, the F1 value of the new method based on image features was about 6 percentage points higher than that of the classic machine learning methods. The F1 value of the deep learning model based on feature fusion was more than 69%, which is 3.4 percentage points higher than that of the model with image features. [Limitations] The size of experimental data is small, and the advantages of deep learning methods were not fully utilized. [Conclusions] The setting of the sampling parameters of the Mel spectrogram influences the experimental results. The new feature fusion method can effectively improve the performance of vocal music classification.

Key wordsVocal Music Classification      CNN      Feature Fusion      Music Information Retrieval      Mel-Frequency Cepstrum     
Received: 15 September 2020      Published: 08 March 2021
ZTFLH:  TP391  
Fund:The work is supported by the National Social Science Fund of China(17ZDA291)
Meng Zhen,Wang Hao,Yu Wei,Deng Sanhong,Zhang Baolong. Vocal Music Classification Based on Multi-category Feature Fusion. Data Analysis and Knowledge Discovery, 2021, 5(5): 59-70.

The Research Framework
特征类别 特征名 特征说明
时域特征 中心距 波形信号的均值、标准差、偏度、峰度等统计特征。主要用来区分浊音和清音段,区分声母和韵母的分界、无话段和有话段的分界
过零率 对于连续语音信号,过零意味着时域波形通过时间轴,对于离散信号,如果相邻的取样值改变符号,则称为过零。浊音时具有较低的过零率,而清音时具有较高的过零率
节拍 节拍可以表征音乐的快慢,被定义为每分钟的节拍数
频域特征 梅尔倒谱系数 信号的梅尔倒谱系数是一小组特征,简明地描述了频谱包络的整体形状,模拟了人声的特征
色度特征 色度是音乐音频重要的表示,其中整个频谱被投影到12个区间,代表音乐八度音的12个不同的半音(或色度)
频谱质心 频谱质心指示声音的“质心”位于何处,并按照声音的频率的加权平均值计算
Description of Statistical Characteristics of Speech Signals
Example of Sonogram
序号 参数名 参数解释 取值
1 sampling_rate 采样率,每秒对声音的采样频率 默认44 100Hz
2 duration 时长 默认30s
3 n_mels 产生的梅尔频带数,即频谱图的高度 64,128,256
4 hop_length 每个连续帧包含的样本数 128,256,512,1 024,
2 048
5 spec_width 频谱图截取宽度 64,128,256
Description of librosa Mel Spectrum Graph Sampling Parameters
Example of Mel-Frequency Cepstrum Diagram
Diagram of Network Data Flow
序号 模型 精确率 召回率 F1
1 LR 0.510 3 0.516 2 0.511 0
2 NB 0.395 3 0.355 0 0.325 3
3 SVM 0.592 3 0.592 5 0.592 2
4 DT 0.334 6 0.331 2 0.332 6
5 XGBoost 0.572 0 0.568 7 0.568 3
Music Classification Results of Machine Learning Models
序号 类别 精确率 召回率 F1
1 Electronic 0.510 4 0.490 0 0.500 0
2 Experimental 0.492 0 0.500 0 0.495 0
3 Folk 0.641 5 0.680 0 0.660 1
4 Hip-Hop 0.686 8 0.680 0 0.683 4
5 Instrumental 0.587 6 0.570 0 0.578 6
6 International 0.625 0 0.650 0 0.637 2
7 Pop 0.510 0 0.510 0 0.510 0
8 Rock 0.687 5 0.660 0 0.673 4
宏平均 0.592 3 0.592 5 0.592 2
Various Vocal Recognition Indexes of SVM Model Based on Statistical Features
SVM Vocal Classification Result Confusion Matrix
Statistical Feature Visualization
Change in Learning Rate
hop_length Value Change and Experimental Results


64 128 256
64 0.647 0 0.655 4 0.656 0
128 0.627 6 0.651 9 0.651 3
256 0.616 2 0.629 7 0.648 9
n_mels and spec_width Value Changes and Experimental Results
序号 类别 精确率 召回率 F1
1 Electronic 0.609 7 0.750 0 0.672 6
2 Experimental 0.558 8 0.570 0 0.564 3
3 Folk 0.657 8 0.750 0 0.700 9
4 Hip-Hop 0.767 6 0.760 0 0.763 8
5 Instrumental 0.674 7 0.560 0 0.612 0
6 International 0.783 5 0.760 0 0.771 5
7 Pop 0.510 8 0.470 0 0.489 5
8 Rock 0.711 1 0.640 0 0.673 6
宏平均 0.659 2 0.657 5 0.656 0
Various Vocal Recognition Indicators of Deep Learning Model Based on Image Features
序号 模型类别 精确率 召回率 F1
1 ResNet18 0.636 0 0.643 7 0.643 7
2 ResNet50 0.635 8 0.637 5 0.631 7
3 Inception V4 0.619 9 0.632 5 0.619 3
4 MobileNet 0.639 8 0.646 2 0.641 0
5 ShuffleNet 0.644 3 0.643 7 0.643 7
6 EfficientNet 0.639 8 0.646 2 0.641 0
7 多层CNN 0.659 2 0.657 5 0.656 0
Vocal Music Classification Index Based on Image Pre-training Model
序号 特征 精确率 召回率 F1
1 MEL+STATISTICS 0.689 2 0.693 7 0.690 4
2 MEL 0.659 2 0.657 5 0.656 0
Feature Fusion and Single Image Feature Deep Learning Model Recognition Index
F1 Value of Classification Results of Various Vocal Music on Each Classifier
