Step 4 - Exploratory analysis of the input data

As described earlier, the dataset contains numerical input variables V1 to V28, which are the result of a PCA transformation of the original features. The response variable Class tells us whether a transaction was fraudulent (value = 1) or not (value = 0).

There are two additional features, Time and Amount. The Time column signifies the time in seconds between the current transaction and the first transaction. Whereas the Amount column signifies how much money was transferred in this transaction. So let's see a glimpse of the input data (only V1, V2, V26, and V27 are shown, though) in Figure 6:

Figure 6: A snapshot of the credit card fraud detection dataset

We have been able to load the transaction, but the preceding DataFrame does not tell us about the class distribution. So, let's compute the class distribution and think about plotting them:

val distribution = transactions.groupBy("Class").count.collect
Vegas("Class Distribution").withData(distribution.map(r => Map("class" -> r(0), "count" -> r(1)))).encodeX("class", Nom).encodeY("count", Quant).mark(Bar).show
>>>
Figure 7: Class distribution in the credit card fraud detection dataset

Now, let's see if the time has any important contribution to suspicious transactions. The Time column tells us the order in which transactions were done, but doesn't tell us anything about the actual times (that is, time of day) of the transactions. Therefore, normalizing them by day and binning those into four groups according to time of day to build a Day column from Time would be useful. I have written a UDF for this:

val daysUDf = udf((s: Double) => 
if (s > 3600 * 24) "day2"
else "day1")

val t1 = transactions.withColumn("day", daysUDf(col("Time")))
val dayDist = t1.groupBy("day").count.collect

Now let's plot it:

Vegas("Day Distribution").withData(dayDist.map(r => Map("day" -> r(0), "count" -> r(1)))).encodeX("day", Nom).encodeY("count", Quant).mark(Bar).show
>>>
Figure 8: Day distribution in the credit card fraud detection dataset

The preceding graph shows that the same number of transactions was made on these two days, but to be more specific, slightly more transactions were made in day1. Now let's build the dayTime column. Again, I have written a UDF for it:

val dayTimeUDf = udf((day: String, t: Double) => if (day == "day2") t - 86400 else t)
val t2 = t1.withColumn("dayTime", dayTimeUDf(col("day"), col("Time")))

t2.describe("dayTime").show()
>>>
+-------+------------------+
|summary| dayTime |
+-------+------------------+
| count| 284807|
| mean| 52336.926072744|
| stddev|21049.288810608432|
| min| 0.0|
| max| 86400.0|
+-------+------------------+

Now that we need to get the quantiles (q1, median, q2) and building time bins (gr1, gr2, gr3, and gr4):


val d1 = t2.filter($"day" === "day1")
val d2 = t2.filter($"day" === "day2")
val quantiles1 = d1.stat.approxQuantile("dayTime", Array(0.25, 0.5, 0.75), 0)

val quantiles2 = d2.stat.approxQuantile("dayTime", Array(0.25, 0.5, 0.75), 0)

val bagsUDf = udf((t: Double) =>
if (t <= (quantiles1(0) + quantiles2(0)) / 2) "gr1"
elseif (t <= (quantiles1(1) + quantiles2(1)) / 2) "gr2"
elseif (t <= (quantiles1(2) + quantiles2(2)) / 2) "gr3"
else "gr4")

val t3 = t2.drop(col("Time")).withColumn("Time", bagsUDf(col("dayTime")))

Then let's get the distribution for class 0 and 1:

val grDist = t3.groupBy("Time", "class").count.collect
val grDistByClass = grDist.groupBy(_(1))

Now let's plot the group distribution for class 0:

Vegas("gr Distribution").withData(grDistByClass.get(0).get.map(r => Map("Time" -> r(0), "count" -> r(2)))).encodeX("Time", Nom).encodeY("count", Quant).mark(Bar).show
>>>
Figure 9: Group distribution for class 0 in the credit card fraud detection dataset

From the preceding graph, it is clear that most of them are normal transactions. Now let's see the group distribution for class 1:

Vegas("gr Distribution").withData(grDistByClass.get(1).get.map(r => Map("Time" -> r(0), "count" -> r(2)))).encodeX("Time", Nom).encodeY("count", Quant).mark(Bar).show
>>>
Figure 10: Group distribution for class 1 in the credit card fraud detection dataset

So, the distribution of transactions over the four Time bins shows that the majority of fraud cases happened in group 1. We can of course look at the distribution of the amounts of money that were transferred:

val c0Amount = t3.filter($"Class" === "0").select("Amount")
val c1Amount = t3.filter($"Class" === "1").select("Amount")

println(c0Amount.stat.approxQuantile("Amount", Array(0.25, 0.5, 0.75), 0).mkString(","))

Vegas("Amounts for class 0").withDataFrame(c0Amount).mark(Bar).encodeX("Amount", Quantitative, bin = Bin(50.0)).encodeY(field = "*", Quantitative, aggregate = AggOps.Count).show
>>>
Figure 11: Distribution of the amounts of money that were transferred for class 0

Now let's plot the same for class 1:

Vegas("Amounts for class 1").withDataFrame(c1Amount).mark(Bar).encodeX("Amount", Quantitative, bin = Bin(50.0)).encodeY(field = "*", Quantitative, aggregate = AggOps.Count).show
>>>
Figure 12: Distribution of the amounts of money that were transferred for class 1

So, from the preceding two graphs, it can be observed that fraudulent credit card transactions had a higher mean amount of money that was transferred, but the maximum amount was much lower compared to regular transactions. As we have seen in the dayTime column that we manually constructed, it is not that significant, so we can simply drop it. Let's do it:

val t4 = t3.drop("day").drop("dayTime")
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset