Step 8 - Anomaly detection

We can also ask which instances were considered outliers or anomalies within our test data. Based on the autoencoder model that was trained before, the input data will be reconstructed, and for each instance, the MSE between actual value and reconstruction is calculated. I am also calculating the mean MSE for both class labels:

test_dim_score.add("Class", test.vec("Class"))
val testDF = asDataFrame(test_dim_score).rdd.zipWithIndex.map(r => Row.fromSeq(r._1.toSeq :+ r._2))

val schema = StructType(Array(StructField("Reconstruction-MSE", DoubleType, nullable = false), StructField("Class", ByteType, nullable = false), StructField("idRow", LongType, nullable = false)))

val dffd = spark.createDataFrame(testDF, schema)
dffd.show()
>>>
Figure 15: DataFrame showing MSE, class, and row ID

Seeing this DataFrame, it's really difficult to identify outliers. But plotting them would provide some more insights:

Vegas("Reduced Test", width = 800, height = 600).withDataFrame(dffd).mark(Point).encodeX("idRow", Quantitative).encodeY("Reconstruction-MSE", Quantitative).encodeColor(field = "Class", dataType = Nominal).show
>>>
Figure 16: Distribution of the reconstructed MSE, across different row IDs

As we can see in the plot, there is no perfect classification into fraudulent and non-fraudulent cases, but the mean MSE is definitely higher for fraudulent transactions than for regular ones. But a minimum interpretation is necessary.

From the preceding figure, we can at least see that most of the idRows have an MSE of . Or, if we extend the MSE threshold up to 10µ, then the data points exceeding this threshold can be considered as outliers or anomalies, that is, fraudulent transactions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset