We can also ask which instances were considered outliers or anomalies within our test data. Based on the autoencoder model that was trained before, the input data will be reconstructed, and for each instance, the MSE between actual value and reconstruction is calculated. I am also calculating the mean MSE for both class labels:
test_dim_score.add("Class", test.vec("Class"))
val testDF = asDataFrame(test_dim_score).rdd.zipWithIndex.map(r => Row.fromSeq(r._1.toSeq :+ r._2))
val schema = StructType(Array(StructField("Reconstruction-MSE", DoubleType, nullable = false), StructField("Class", ByteType, nullable = false), StructField("idRow", LongType, nullable = false)))
val dffd = spark.createDataFrame(testDF, schema)
dffd.show()
>>>
Seeing this DataFrame, it's really difficult to identify outliers. But plotting them would provide some more insights:
Vegas("Reduced Test", width = 800, height = 600).withDataFrame(dffd).mark(Point).encodeX("idRow", Quantitative).encodeY("Reconstruction-MSE", Quantitative).encodeColor(field = "Class", dataType = Nominal).show
>>>
As we can see in the plot, there is no perfect classification into fraudulent and non-fraudulent cases, but the mean MSE is definitely higher for fraudulent transactions than for regular ones. But a minimum interpretation is necessary.
From the preceding figure, we can at least see that most of the idRows have an MSE of 5µ. Or, if we extend the MSE threshold up to 10µ, then the data points exceeding this threshold can be considered as outliers or anomalies, that is, fraudulent transactions.