Step 7 - Dimensionality reduction with hidden layers

Since we used a shallow autoencoder with two nodes in the hidden layer in the middle, it would be worth using the dimensionality reduction to explore our feature space. We can extract this hidden feature with the scoreDeepFeatures() method and plot it to show the reduced representation of the input data.

The scoreDeepFeatures() method scores an auto-encoded reconstruction on-the-fly, and materialize the deep features of given layer. It takes the following parameters, frame Original data (can contain response, will be ignored) and layer index of the hidden layer for which to extract the features. Finally, a frame containing the deep features is returned. Where number of columns is the hidden [layer]

Now, for the supervised training, we need to extract the Deep Features. Let's do it from layer 2:

var train_features = model_nn.scoreDeepFeatures(train_unsupervised, 1) 
train_features.add("Class", train_unsupervised.vec("Class"))

The plotting for eventual cluster identification is as follows:

train_features.setNames(train_features.names.map(_.replaceAll("[.]", "-")))
train_features._key = Key.make()
water.DKV.put(train_features)

val tfDataFrame = asDataFrame(train_features) Vegas("Compressed").withDataFrame(tfDataFrame).mark(Point).encodeX("DF-L2-C1", Quantitative).encodeY("DF-L2-C2", Quantitative).encodeColor(field = "Class", dataType = Nominal).show
>>>
Figure 14: Eventual cluster for classes 0 and 1

From the preceding figure, we cannot see any cluster of fraudulent transactions that is distinct from non-fraudulent instances, so dimensionality reduction with our autoencoder model alone is not sufficient to identify fraud in this dataset. But we could use the reduced dimensionality representation of one of the hidden layers as features for model training. An example would be to use the 10 features from the first or third hidden layer. Now, let's extract the Deep Features from layer 3:

train_features = model_nn.scoreDeepFeatures(train_unsupervised, 2)
train_features._key = Key.make()
train_features.add("Class", train_unsupervised.vec("Class"))
water.DKV.put(train_features)

val features_dim = train_features.names.filterNot(_ == response)
val train_features_H2O = asH2OFrame(train_features)

Now let's do unsupervised DL using the dataset of the new dimension again:

dlParams = new DeepLearningParameters()
dlParams._ignored_columns = Array(response)
dlParams._train = train_features_H2O
dlParams._autoencoder = true
dlParams._reproducible = true
dlParams._ignore_const_cols = false
dlParams._seed = 42
dlParams._hidden = Array[Int](10, 2, 10)
dlParams._epochs = 100
dlParams._activation = Activation.Tanh
dlParams._force_load_balance = false

dl = new DeepLearning(dlParams)
val model_nn_dim = dl.trainModel.get

We then save the model:

ModelSerializationSupport.exportH2OModel(model_nn_dim, new File(new File(inputCSV).getParentFile, "model_nn_dim.bin").toURI)

For measuring model performance on test data, we need to convert the test data to the same reduced dimensions as the training data:

val test_dim = model_nn.scoreDeepFeatures(test, 2)
val test_dim_score = model_nn_dim.scoreAutoEncoder(test_dim, Key.make(), false)

val result = confusionMat(test_dim_score, test, test_dim_score.anyVec.mean)
println(result.deep.mkString("n"))
>>>
Array(38767, 29)
Array(18103, 64)

Now, this actually looks quite good in terms of identifying fraud cases: 93% of fraud cases were identified!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset