Cross-validation

We can go forward with manual optimization and find the right model after having exhaustively tried many different configurations. Doing that would lead to both an immense waste of time (and reusability of the code) and will overfit the test dataset. Cross-validation is instead the correct key to run the hyperparameter optimization. Let's now see how Spark performs this crucial task.

First of all, as the training will be used many times, we can cache it. Therefore, let's cache it after all the transformations:

In: pipeline_to_clf = Pipeline(
        stages=preproc_stages + [assembler]).fit(sampled_train_df)
    train = pipeline_to_clf.transform(sampled_train_df).cache()
    test = pipeline_to_clf.transform(test_df)

The useful classes for hyperparameter optimization with cross-validation are contained in the pyspark.ml.tuning package. Two elements are essential: a grid map of parameters (which can be built with ParamGridBuilder) and the actual cross-validation procedure (run by the CrossValidator class).

In this example, we want to set some parameters of our classifier that won't change throughout the cross-validation. Exactly as with scikit-learn, they're set when the classification object is created (in this case, column names, seed, and the maximum number of bins).

Then, thanks to the grid builder, we decide which arguments should be changed for each iteration of the cross-validation algorithm. In this example, we want to check the classification performance changes the maximum depth of each tree in the forest from 3 to 12 (incrementing by 3) and the number of trees in the forest from 20 or 50. Finally, we launch the cross-validation (with the fit method) after having set the grid map, the classifier that we want to test, and the number of folds. The parameter evaluator is essential: it will tell us which is the best model to keep after the cross-validation. Note that this operation may take 15-20 minutes to run (under the hood, 4*2*3=24 models are trained and tested):

In: from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

    rf = RandomForestClassifier(
        cacheNodeIds=True, seed=101, labelCol="target_cat",
        featuresCol="features", maxBins=100)
    grid = (ParamGridBuilder()
            .addGrid(rf.maxDepth, [3, 6, 9, 12])
            .addGrid(rf.numTrees, [20, 50])
            .build())
    cv = CrossValidator(
        estimator=rf, estimatorParamMaps=grid,
        evaluator=evaluator, numFolds=3)
    cvModel = cv.fit(train)

In the end, we can predict the label using the cross-validated model, as we're using a pipeline or classifier by itself. In this case, the performances of the classifier chosen with cross-validation are slightly better than in the previous case, and allow us to beat the 0.97 barriers:

In: predictions = cvModel.transform(test)
    f1_preds = evaluator.evaluate(predictions)
    print("F1-score test set: %0.3f" % f1_preds)

Out: F1-score test set: 0.970

Furthermore, by plotting the normalized confusion matrix, you immediately realize that this solution is able to discover a wider variety of attacks, even the less popular ones:

In: metrics = MulticlassMetrics(
        predictions.select("prediction", "target_cat").rdd)
    conf_matrix = metrics.confusionMatrix().toArray()
    plot_confusion_matrix(conf_matrix)

This time, the output is the normalized confusion matrix, showing where misplacement in predictions happens the most:

Table of Contents for Cross-validation

Create new playlist

Sign In

Sign Up

Table of Contents for
Cross-validation