Experimenting by addressing the class imbalance problem

In this dataset, the number of patients in the normal, suspect, and pathological categories is not the same. In the original dataset, the number of normal, suspect, and pathological patients are 1,655, 295, and 176, respectively.

We will make use of the following code to develop a bar plot:

# Bar plot
barplot(prop.table(table(data$NSP)),
col = rainbow(3),
ylim = c(0, 0.8),
ylab = 'Proportion',
xlab = 'NSP',
cex.names = 1.5)

After running the preceding code, we obtain the following bar plot:

Proportion of samples in each of the three classes

In the preceding bar plot, the percentages of normal, suspect, and pathological patients are approximately 78%, 14%, and 8% respectively. When we compare these classes, we observe that the number of normal patients is about 5.6 times (1,655/295) greater than the number of suspect patients and about 9.4 times greater than the number of pathological patients. The dataset exhibiting a pattern where classes are not balanced but contain significantly different numbers of cases per class is described as having a class imbalance problem. The class that has a significantly higher number of cases may benefit from this at the time of training the model, but at the cost of the other classes.

As a result, a classification model may contain a bias toward the class that has a significantly higher number of cases, and provide results with higher classification accuracy for this class compared to the other classes. When data is influenced by such a class imbalance, it is important to address the issue to avoid bias in the final classification model. In such situations, we can make use of class weights to address the class imbalance issue in a dataset.

Very often, datasets that are used for developing classification models have an unequal number of samples for each class. Such class imbalance issues can easily be handled using the class_weight function.

The code that includes class_weight to incorporate class imbalance information is shown in the following code:

# Fit model
model_five <- model %>%
fit(training,
trainLabels,
epochs = 200,
batch_size = 32,
validation_split = 0.2,
class_weight = list("0"=1,"1"=5.6, "2" = 9.4))
plot(model_five)

As you can see in the preceding code, we have specified a weight of 1 for the normal class, a weight of 5.6 for the suspect class, and a weight of 9.4 for the pathological class. Assigning these weights creates a level playing field for all three categories. We have kept all other settings the same as they were in the previous model. After training the network, the loss and accuracy values for each epoch are stored in model_five.

The loss and accuracy plot for this experiment is shown in the following screenshot:

From the accuracy and loss plot based on training and validation data, we do not see any obvious pattern suggesting overfitting. After about 100 epochs, we do not see any major improvement in model performance in terms of loss and accuracy values.

The code for the predictions from the model and the resulting confusion matrix is as follows:

# Prediction and confusion matrix
pred <- model %>%
predict_classes(test)
table(Predicted=pred, Actual=testtarget)

OUTPUT
Actual
Predicted 0 1 2
0 358 12 3
1 79 74 5
2 23 8 41

From the preceding confusion matrix, we can make the following observations:

  • The correct classifications for the 0, 1, and 2 categories are 358, 74, and 41 respectively.
  • The overall accuracy is now reduced to 78.4%, which is mainly due to the drop in accuracy for the normal class, since we increased the weights for the other two classes.
  • We can also find that this classification model correctly classifies normal, suspect, and pathological cases with percentages of about 77.8%, 78.7%, and 83.7% respectively.
  • Clearly, the biggest gains are for the suspect class, which is now correctly classified at the rate of 78.7% versus the earlier rate of only 56.4%.
  • In the pathological class, we do not see any major gain or loss in accuracy value.
  • These results clearly indicate the influence of using weights to address class imbalance problems, as now the classification performance across the three classes is more consistent.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset