How it works...

In Step 1, we took a look at the dimensions of our dataset. We also saw the statistics of our numerical variables. In Step 2, we looked at the datatypes of each of our variables. In Step 3, we dropped the sku attribute, because it is an identifier that will be of no use to us for our model. In Step 4, we checked for missing values and noticed that the lead_time attribute had 3,403 missing values, which is roughly 5% of the total number of observations. In Step 5, we dropped the observations for which the lead_time had missing values. Note that there are various strategies to impute missing values, but we haven't considered these in this exercise.

In Step 6, we used get_dummies() from the pandas library with drop_first=True as one of the parameters to perform a k-1 dummy coding on the categorical variables. In Step 7, we took a look at the distribution of our target variable. We see the class labels, 0 and 1, are in the ratio of 19%-81% approximately, which is not very well balanced. However, we had enough observations for both classes to proceed to our next steps. In Step 8, we separated our predictor and response variables. We also split our dataset to create a training dataset and a testing dataset. In Step 9, we used a DecisionTreeClassifier() to build our model. We noted the default hyperparameters values and noticed that, by default, DecisionTreeClassifier() uses the Gini impurity measure as the splitting criterion.

In Step 10, we used the model to predict our test sample. We took a note of the overall accuracy and the amount of TP, TN, FP, and FN values that we achieved. In Step 11, we used plot_confusion_matrix() to plot these values in the form of a confusion matrix. Please note that plot_confusion_matrix() is readily available at https://bit.ly/2MdyDU9 and is also provided with the book in the code folder for this chapter.

We then looked at changing the hyperparameter values to fine-tune our model. We performed a grid search to find the optimum hyperparameter values. In Step 12, we defined the combination of values for our hyperparameters that we want to apply to our grid search algorithm. In Step 13 and 14, we used GridSearchCV() to look for the optimum hyperparameters. In Step 15, we used the model returned by the grid search to predict our test observations. Finally, in Step 16, we used classification_report() from sklearn.metrics to generate various scores including precision, recall, f1-score, and support.

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...