Autoencoder code implementation for credit card fraud detection

As usual, like all other projects, let's first load the data into an R dataframe and then perform EDA to understand the dataset better. Please note the inclusion of h2o as well as the doParallel library in the code. These inclusions enable us to use the AE that is part of the h2o library, as well as to utilize the multiple CPU cores that are present in the laptop/desktop as follows:

# including the required libraries
library(tidyverse)
library(h2o)
library(rio)
library(doParallel)
library(viridis)
library(RColorBrewer)
library(ggthemes)
library(knitr)
library(caret)
library(caretEnsemble)
library(plotly)
library(lime)
library(plotROC)
library(pROC)

Initializing the H2O cluster in localhost under the port 54321. The nthreads defines the number of thread pools to be used, this is close to the number of cpus to be used. In our case, we are saying use all CPUs, we are also specifying the maximum memory to use by H2O cluster as 8G:

localH2O = h2o.init(ip = 'localhost', port = 54321, nthreads = -1,max_mem_size = "8G")
# Detecting the available number of cores
no_cores <- detectCores() - 1
# utilizing all available cores
cl<-makeCluster(no_cores)
registerDoParallel(cl)

You will get a similar output to that shown in the following code block:

H2O is not running yet, starting it now...
Note: In case of errors look at the following log files:
/tmp/RtmpKZvQ3m/h2o_sunil_started_from_r.out
/tmp/RtmpKZvQ3m/h2o_sunil_started_from_r.err
java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
Starting H2O JVM and connecting: ..... Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 4 seconds 583 milliseconds
H2O cluster timezone: Asia/Kolkata
H2O data parsing timezone: UTC
H2O cluster version: 3.20.0.8
H2O cluster version age: 2 months and 27 days
H2O cluster name: H2O_started_from_R_sunil_jgw200
H2O cluster total nodes: 1
H2O cluster total memory: 7.11 GB
H2O cluster total cores: 4
H2O cluster allowed cores: 4
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4
R Version: R version 3.5.1 (2018-07-02)

Now, to set the working directory of the data file location, load Rdata and read it into the dataframe, and view the dataframe using the following code:

# setting the working directory where the data file is location
setwd("/home/sunil/Desktop/book/chapter 20")
# loading the Rdata file and reading it into the dataframe called cc_fraud
cc_fraud<-get(load("creditcard.Rdata"))
# performing basic EDA on the dataset
# Viewing the dataframe to confirm successful load of the dataset
View(cc_fraud)

The will give the following output:

Let's now print the dataframe structure using the following code:

print(str(cc_fraud))

This will give the following output:

'data.frame':     284807 obs. of  31 variables:
$ Time : num 0 0 1 1 2 2 4 7 7 9 ...
$ V1 : num -1.36 1.192 -1.358 -0.966 -1.158 ...
$ V2 : num -0.0728 0.2662 -1.3402 -0.1852 0.8777 ...
$ V3 : num 2.536 0.166 1.773 1.793 1.549 ...
$ V4 : num 1.378 0.448 0.38 -0.863 0.403 ...
$ V5 : num -0.3383 0.06 -0.5032 -0.0103 -0.4072 ...
$ V6 : num 0.4624 -0.0824 1.8005 1.2472 0.0959 ...
$ V7 : num 0.2396 -0.0788 0.7915 0.2376 0.5929 ...
$ V8 : num 0.0987 0.0851 0.2477 0.3774 -0.2705 ...
$ V9 : num 0.364 -0.255 -1.515 -1.387 0.818 ...
$ V10 : num 0.0908 -0.167 0.2076 -0.055 0.7531 ...
$ V11 : num -0.552 1.613 0.625 -0.226 -0.823 ...
$ V12 : num -0.6178 1.0652 0.0661 0.1782 0.5382 ...
$ V13 : num -0.991 0.489 0.717 0.508 1.346 ...
$ V14 : num -0.311 -0.144 -0.166 -0.288 -1.12 ...
$ V15 : num 1.468 0.636 2.346 -0.631 0.175 ...
$ V16 : num -0.47 0.464 -2.89 -1.06 -0.451 ...
$ V17 : num 0.208 -0.115 1.11 -0.684 -0.237 ...
$ V18 : num 0.0258 -0.1834 -0.1214 1.9658 -0.0382 ...
$ V19 : num 0.404 -0.146 -2.262 -1.233 0.803 ...
$ V20 : num 0.2514 -0.0691 0.525 -0.208 0.4085 ...
$ V21 : num -0.01831 -0.22578 0.248 -0.1083 -0.00943 ...
$ V22 : num 0.27784 -0.63867 0.77168 0.00527 0.79828 ...
$ V23 : num -0.11 0.101 0.909 -0.19 -0.137 ...
$ V24 : num 0.0669 -0.3398 -0.6893 -1.1756 0.1413 ...
$ V25 : num 0.129 0.167 -0.328 0.647 -0.206 ...
$ V26 : num -0.189 0.126 -0.139 -0.222 0.502 ...
$ V27 : num 0.13356 -0.00898 -0.05535 0.06272 0.21942 ...
$ V28 : num -0.0211 0.0147 -0.0598 0.0615 0.2152 ...
$ Amount: num 149.62 2.69 378.66 123.5 69.99 ...
$ Class : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

Now, to view the class distribution, use the following code:

print(table(cc_fraud$Class))

You will get the following output:

     0      1
284315 492

To view the relationship between the V1 and Class variables, use the following code:

# Printing the Histograms for Multivariate analysis
theme_set(theme_economist_white())
# visualization showing the relationship between variable V1 and the class
ggplot(cc_fraud,aes(x="",y=V1,fill=Class))+geom_boxplot()+labs(x="V1",y="")

This will give the following output:

To visualize the distribution of transaction amounts with respect to class, use the following code:

# visualization showing the distribution of transaction amount with
# respect to the class, it may be observed that the amount are discretized
# into 50 bins for plotting purposes
ggplot(cc_fraud,aes(x = Amount)) + geom_histogram(color = "#D53E4F", fill = "#D53E4F", bins = 50) + facet_wrap( ~ Class, scales = "free", ncol = 2)

This will give the following output:

To visualize the distribution of transaction times with respect to class, use the following code:

ggplot(cc_fraud, aes(x =Time,fill = Class))+ geom_histogram(bins = 30)+
facet_wrap( ~ Class, scales = "free", ncol = 2)

This will give the following output:

Use the following code to visualize the V2 variable with respect to Class:

ggplot(cc_fraud, aes(x =V2, fill=Class))+ geom_histogram(bins = 30)+
facet_wrap( ~ Class, scales = "free", ncol = 2)

You will get the following as the output:

Use the following code to visualize V3 with respect to Class:

ggplot(cc_fraud, aes(x =V3, fill=Class))+ geom_histogram(bins = 30)+
facet_wrap( ~ Class, scales = "free", ncol = 2)

The following graph is the resultant output:

To visualize the V3 variable with respect to Class, use the following code:

ggplot(cc_fraud, aes(x =V4,fill=Class))+ geom_histogram(bins = 30)+
facet_wrap( ~ Class, scales = "free", ncol = 2)

The following graph is the resultant output:

Use the following code to visualize the V6 variable with respect to Class:

ggplot(cc_fraud, aes(x=V6, fill=Class)) + geom_density(alpha=1/3) + scale_fill_hue()

The following graph is the resultant output:

Use the following code to visualize the V7 variable with respect to Class:

ggplot(cc_fraud, aes(x=V7, fill=Class)) + geom_density(alpha=1/3) + scale_fill_hue()

The following graph is the resultant output:

Use the following code to visualize the V8 variable with respect to Class:

ggplot(cc_fraud, aes(x=V8, fill=Class)) + geom_density(alpha=1/3) + scale_fill_hue()

The following graph is the resultant output:

To visualize the V9 variable with respect to Class, use the following code:

# visualizationshowing the V7 variable with respect to the class
ggplot(cc_fraud, aes(x=V9, fill=Class)) + geom_density(alpha=1/3) + scale_fill_hue()

The following graph is the resultant output:

To visualize the V10 variable with respect to Class, use the following code:

# observe we are plotting the data quantiles
ggplot(cc_fraud, aes(x ="",y=V10, fill=Class))+ geom_violin(adjust = .5,draw_quantiles = c(0.25, 0.5, 0.75))+labs(x="V10",y="")

The following graph is the resultant output:

From all the visualizations related to variables with respect to class, we can infer that most of the principal components are centered on 0. Now, to plot the distribution of classes in the data, use the following code:

cc_fraud %>%
ggplot(aes(x = Class)) +
geom_bar(color = "chocolate", fill = "chocolate", width = 0.2) +
theme_bw()

The following bar graph is the resultant output:

We observe that the distribution of classes is very imbalanced. The representation of the major class (non-fraudulent transactions, represented by 0) in the dataset is too heavy when compared to the minority class (fraudulent transactions: 1). In the traditional supervised ML way of dealing with this kind of problem, we would have treated the class imbalance problem with techniques such as Synthetic Minority Over-Sampling Technique (SMORT). However, with AEs, we do not treat the class imbalance during data preprocessing; rather, we feed the data as is to the AE for learning. In fact, the AE is learning the thresholds and the characteristics of the data from the majority class; this is the reason we call it a one-class classification problem.

We will need to do some feature engineering prior to training our AE. Let's first focus on the Time variable in the data. Currently, it is in the seconds format, but we may better represent it as days. Run the following code to see the current form of time in the dataset:

print(summary(cc_fraud$Time))

You will get the following output:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0 54202 84692 94814 139320 172792

We know that there are 86,400 seconds in a given day (60 seconds per minute * 60 minutes per hour * 24 hours per day). We will convert the Time variable into Day by considering the value in Time and representing it as day1 if the number of seconds is less than or equal to 86,400, and anything over 86,400 becomes day2. There are only two days possible, as we can see from the summary that the maximum value represented by the time variable is 172792 seconds:

# creating a new variable called day based on the seconds 
# represented in Time variable
cc_fraud=cc_fraud %>% mutate(Day = case_when(.$Time > 3600 * 24 ~ "day2",.$Time < 3600 * 24 ~ "day1"))
#visualizing the dataset post creating the new variable
View(cc_fraud%>%head())

The following is the resultant output of the first six rows after the conversion:

Now, use the following code to view the last six rows:

View(cc_fraud%>%tail())

The following is the resultant output of the last six rows after the conversion:

Now, let's print the distribution of transactions by the day in which the transaction falls, using the following code:

print(table(cc_fraud[,"Day"]))

You will get the following as the output:

  day1   day2
144786 140020

Let's create a new variable, Time_day, based on the seconds represented in the Time variable, and summarize the Time_day variable with respect to Day using the following code:

cc_fraud$Time_day <- if_else(cc_fraud$Day == "day2", cc_fraud$Time - 86400, cc_fraud$Time)
print(tapply(cc_fraud$Time_day,cc_fraud$Day,summary,simplify = FALSE))

We get the following as the resultant output:

$day1
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 38432 54689 52948 70976 86398

$day2
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 37843 53425 51705 68182 86392

Use the following code the convert all character variables in the dataset to factors:

cc_fraud<-cc_fraud%>%mutate_if(is.character,as.factor)

We can further fine-tune the Time_day variable by converting the variable into a factor. The factors represents the time of day at which the transaction happened, for example, morning, afternoon, evening, and night. We can create a new variable called Time_Group, based on the various buckets of the day, using the following code:

cc_fraud=cc_fraud %>% 
mutate(Time_Group = case_when(.$Time_day <= 38138~ "morning" ,
.$Time_day <= 52327~ "afternoon",
.$Time_day <= 69580~"evening",
.$Time_day > 69580~"night"))
#Visualizing the data post creating the new variable
View(head(cc_fraud))

The following is the resultant output of the first six rows:

Use the following code to view and confirm the last six rows:

View(tail(cc_fraud))

This will give the following output, and we see that we have successfully converted the data which represent the various time of the day:

Take a look at the following code:

#visualizing the transaction count by day
cc_fraud %>%drop_na()%>%
ggplot(aes(x = Day)) +
geom_bar(fill = "chocolate",width = 0.3,color="chocolate") +
theme_economist_white()

The preceding code will generate the following output:

We can infer from the visualization that there is no difference in the count of transactions that happened on day 1 and day 2. Both remain close to 150,000 transactions. 

Now we will convert the Class variable as a factor and then visualize the data by Time_Group variable using the following code:

cc_fraud$Class <- factor(cc_fraud$Class)
cc_fraud %>%drop_na()%>%
ggplot(aes(x = Time_Group)) +
geom_bar(color = "#238B45", fill = "#238B45") +
theme_bw() +
facet_wrap( ~ Class, scales = "free", ncol = 2)

This will generate the following output:

The inference obtained from this visualization is that the number of non-fraudulent transactions remains almost the same across all time periods of the day, whereas we see a huge rise in the number of fraudulent transactions during the morning Time group.

Let's do a last bit of exploration of the transaction amount with respect to class:

# getting the summary of amount with respect to the class
print(tapply(cc_fraud$Amount ,cc_fraud$Class,summary))

The preceding code will generate the following output:

$`0`
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 5.65 22.00 88.29 77.05 25691.16
$`1`
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 1.00 9.25 122.21 105.89 2125.87

One interesting insight from the summary is that the mean amount in fraudulent transactions is higher compared to genuine transactions. However, the maximum transaction amount that we see in fraudulent transactions is much lower than the genuine transactions. It can also be seen that genuine transactions have a higher median amount.

Now, let's convert our R dataframe to an H2O dataframe to apply the AE to it. This is a requirement in order to use the functions from the h2o library:

# converting R dataframe to H2O dataframe
cc_fraud_h2o <- as.h2o(cc_fraud)
#splitting the data into 60%, 20%, 20% chunks to use them as training,
#vaidation and test datasets
splits <- h2o.splitFrame(cc_fraud_h2o,ratios = c(0.6, 0.2), seed = 148)
# creating new train, validation and test h2o dataframes
train <- splits[[1]]
validation <- splits[[2]]
test <- splits[[3]]
# getting the target and features name in vectors
target <- "Class"
features <- setdiff(colnames(train), target)

The tanh activation function is a rescaled and shifted logistic function. Other functions, such as ReLu and Maxout, are also provided by the h2o library and they can also be used.  In the first AE model, let's use the tanh activation function. This choice is arbitrary and other activation functions may also be tried as desired.

The h2o.deeplearning function has a parameter AE and this should be set to TRUE to train a AE model. Let's build our AE model now:

model_one = h2o.deeplearning(x = features, training_frame = train,
AE = TRUE,
reproducible = TRUE,
seed = 148,
hidden = c(10,10,10), epochs = 100,
activation = "Tanh",
validation_frame = test)

The preceding code generates the following output:

 |===========================================================================================================================| 100%

We will save the model so we do not have to retrain t again and again. Then load the model that is persisted on the disk and print the model to verify the AE learning using the following code:

h2o.saveModel(model_one, path="model_one", force = TRUE)
model_one<-h2o.loadModel("/home/sunil/model_one/DeepLearning_model_R_1544970545051_1")
print(model_one)

This will generate the following output:

Model Details:
==============
H2OAutoEncoderModel: deeplearning
Model ID: DeepLearning_model_R_1544970545051_1
Status of Neuron Layers: auto-encoder, gaussian distribution, Quadratic loss, 944 weights/biases, 20.1 KB, 2,739,472 training samples, mini-batch size 1
layer units type dropout l1 l2 mean_rate rate_rms momentum mean_weight weight_rms mean_bias bias_rms
1 1 34 Input 0.00 % NA NA NA NA NA NA NA NA NA
2 2 10 Tanh 0.00 % 0.000000 0.000000 0.610547 0.305915 0.000000 -0.000347 0.309377 -0.028166 0.148318
3 3 10 Tanh 0.00 % 0.000000 0.000000 0.181705 0.103598 0.000000 0.022774 0.262611 -0.056455 0.099918
4 4 10 Tanh 0.00 % 0.000000 0.000000 0.133090 0.079663 0.000000 0.000808 0.337259 0.032588 0.101952
5 5 34 Tanh NA 0.000000 0.000000 0.116252 0.129859 0.000000 0.006941 0.357547 0.167973 0.688510
H2OAutoEncoderMetrics: deeplearning
Reported on training data.
Training Set Metrics:
=====================
MSE: (Extract with `h2o.mse`) 0.0003654009
RMSE: (Extract with `h2o.rmse`) 0.01911546
H2OAutoEncoderMetrics: deeplearning
Reported on validation data.
Validation Set Metrics:
=====================
MSE: (Extract with `h2o.mse`) 0.0003508435
RMSE: (Extract with `h2o.rmse`) 0.01873082

We will now make predictions on test dataset using the AE model that is built, using the following code:

test_autoencoder <- h2o.predict(model_one, test)

This will generate the following output:

|===========================================================================================================================| 100%

It is possible to visualize the encoder representing the data in a conscious manner in the inner layers through the h2o.deepfeatures function. Let's try visualizing the reduced data in a second layer:

train_features <- h2o.deepfeatures(model_one, train, layer = 2) %>%
as.data.frame() %>%
mutate(Class = as.vector(train[, 31]))
# printing the reduced data represented in layer2
print(train_features%>%head(3))

The preceding code will generate the following output:

DF.L2.C1  DF.L2.C2     DF.L2.C3    DF.L2.C4   DF.L2.C5 
-0.12899115 0.1312075 0.115971952 -0.12997648 0.23081912
-0.10437942 0.1832959 0.006427409 -0.08018725 0.05575977
-0.07135827 0.1705700 -0.023808057 -0.11383244 0.10800857
DF.L2.C6 DF.L2.C7 DF.L2.C8 DF.L2.C9 DF.L2.C10 Class0.1791547 0.10325721 0.05589069 0.5607497 -0.9038150 0
0.1588236 0.11009450 -0.04071038 0.5895413 -0.8949729 0
0.1676358 0.10703990 -0.03263755 0.5762191 -0.8989759 0

Let us now plot the data of DF.L2.C1 with respect to DF.L2.C2 to verify if the encoder has detected the fraudulent transactions, using the following code:

ggplot(train_features, aes(x = DF.L2.C1, y = DF.L2.C2, color = Class)) +
geom_point(alpha = 0.1,size=1.5)+theme_bw()+
scale_fill_brewer(palette = "Accent")

This will generate the following output:

Again we plot the data of DF.L2.C3 with respect to DF.L2.C4 to verify the if the encoder have detected any fraud transaction, using the following code:

ggplot(train_features, aes(x = DF.L2.C3, y = DF.L2.C4, color = Class)) +
geom_point(alpha = 0.1,size=1.5)+theme_bw()+
scale_fill_brewer(palette = "Accent")

The preceding code will generate the following output:

We see from the two visualizations that the fraudulent transactions are indeed detected by the dimensionality reduction approach with our AE model. Those few scattered dots (represented by 1) depicts the fraud transactions that are detected. We can also train a new model with the other hidden layers, using our first model. This results in 10 columns, since the third layer has 10 nodes. We are just attempting to slice out one layer where some level of reduction was done and use that to build a new model:

# let's consider the third hidden layer. This is again a random choice
# in fact we could have taken any layer among the 10 inner layers
train_features <- h2o.deepfeatures(model_one, validation, layer = 3) %>%
as.data.frame() %>%
mutate(Class = as.factor(as.vector(validation[, 31]))) %>%
as.h2o()

The preceding code will generate the following output:

|===========================================================================================================================| 100% |===========================================================================================================================| 100%

As we can see, the training models and data are successfully created. We will now go ahead and train the new model, save it and the print it. First, we will get the feature names from the sliced encoder layer:

features_two <- setdiff(colnames(train_features), target)

Then we will training a new model:

model_two <- h2o.deeplearning(y = target,
x = features_two,
training_frame = train_features,
reproducible = TRUE,
balance_classes = TRUE,
ignore_const_cols = FALSE,
seed = 148,
hidden = c(10, 5, 10),
epochs = 100,
activation = "Tanh")

We will then save the model to avoid retraining again, then retrieve the model and print it using the following code:

h2o.saveModel(model_two, path="model_two", force = TRUE)
model_two <- h2o.loadModel("/home/sunil/model_two/DeepLearning_model_R_1544970545051_2")
print(model_two)

This will generate the following output:

Model Details:
==============
H2OBinomialModel: deeplearning
Model ID: DeepLearning_model_R_1544970545051_2
Status of Neuron Layers: predicting Class, 2-class classification, bernoulli distribution, CrossEntropy loss, 247 weights/biases, 8.0 KB, 2,383,962 training samples, mini-batch size 1
layer units type dropout l1 l2 mean_rate rate_rms momentum mean_weight weight_rms mean_bias bias_rms
1 1 10 Input 0.00 % NA NA NA NA NA NA NA NA NA
2 2 10 Tanh 0.00 % 0.000000 0.000000 0.001515 0.001883 0.000000 -0.149216 0.768610 -0.038682 0.891455
3 3 5 Tanh 0.00 % 0.000000 0.000000 0.003293 0.004916 0.000000 -0.251950 0.885017 -0.307971 0.531144
4 4 10 Tanh 0.00 % 0.000000 0.000000 0.002252 0.001780 0.000000 0.073398 1.217405 -0.354956 0.887678
5 5 2 Softmax NA 0.000000 0.000000 0.007459 0.007915 0.000000 -0.095975 3.579932 0.223286 1.172508
H2OBinomialMetrics: deeplearning
Reported on training data.
Metrics reported on temporary training frame with 9892 samples
MSE: 0.1129424
RMSE: 0.336069
LogLoss: 0.336795
Mean Per-Class Error: 0.006234916
AUC: 0.9983688
Gini: 0.9967377
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
0 1 Error Rate
0 4910 62 0.012470 =62/4972
1 0 4920 0.000000 =0/4920
Totals 4910 4982 0.006268 =62/9892
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.009908 0.993739 153
2 max f2 0.009908 0.997486 153
3 max f0point5 0.019214 0.990107 142
4 max accuracy 0.009908 0.993732 153
5 max precision 1.000000 1.000000 0
6 max recall 0.009908 1.000000 153
7 max specificity 1.000000 1.000000 0
8 max absolute_mcc 0.009908 0.987543 153
9 max min_per_class_accuracy 0.019214 0.989541 142
10 max mean_per_class_accuracy 0.009908 0.993765 153
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)

For measuring model performance on test data, we need to convert the test data to the same reduced dimensions as the training data:

test_3 <- h2o.deepfeatures(model_one, test, layer = 3)
print(test_3%>%head())

The preceding code will generate the following output:

|===========================================================================================================================| 100%

We see, the data has been converted successfully. Now, to make predictions on the test dataset with model_two, we will use the following code:

test_pred=h2o.predict(model_two, test_3,type="response")%>%
as.data.frame() %>%
mutate(actual = as.vector(test[, 31]))

This will generate the following output:

|===========================================================================================================================| 100%

As we can see, from the output, predictions has been successfully completed and now let us visualize the predictions using the following code:

test_pred%>%head()
predict p0 p1 actual
1 0 1.0000000 1.468655e-23 0
2 0 1.0000000 2.354664e-23 0
3 0 1.0000000 5.987218e-09 0
4 0 1.0000000 2.888583e-23 0
5 0 0.9999988 1.226122e-06 0
6 0 1.0000000 2.927614e-23 0
# summarizing the predictions
print(h2o.predict(model_two, test_3) %>%
as.data.frame() %>%
dplyr::mutate(actual = as.vector(test[, 31])) %>%
group_by(actual, predict) %>%
dplyr::summarise(n = n()) %>%
mutate(freq = n / sum(n)))

This will generate the following output:

|===========================================================================================================================| 100%
# A tibble: 4 x 4
# Groups: actual [2]
actual predict n freq
<chr> <fct> <int> <dbl>
1 0 0 55811 0.986
2 0 1 817 0.0144
3 1 0 41 0.414
4 1 1 58 0.586

We see that our AE is able to correctly predict non-fraudulent transactions with 98% accuracy, which is good. However, it is yielding only 58% accuracy when predicting fraudulent transactions. This is definitely something to focus on. Our model needs some improvement, and this can be accomplished through the following options: 

  • Using other layers' latent space representations as input to build model_two (recollect that we currently use the layer 3 representation)
  • Using ReLu or Maxout activation functions instead of Tanh
  • Checking the misclassified instances through the h2o.anomaly function and increasing or decreasing the cutoff threshold MSE values, which separates the fraudulent transactions from non-fraudulent transactions
  • Trying out a more complex architecture in the encoder and decoder

We are not going to be attempting these options in this chapter as they are experimental in nature. However, interested readers may try and improve the accuracy of the model by trying these options.

Finally, one best practice is to explicitly shut down the h2o cluster. This can be accomplished with the following command:

h2o.shutdown()

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset