Random forest and PAM

To perform this method in R, you can use the randomForest() function. After setting the random seed, simply create the model object. In the following code, I specify the number of trees as 2000 and set proximity measure to TRUE. You don't have to run this on scaled data:

> set.seed(1918)

> rf <- randomForest::randomForest(x = wine[, -1], ntree = 2000, proximity = T)

> rf

Call:
randomForest(x = wine[, -1], ntree = 2000, proximity = T)
Type of random forest: unsupervised
Number of trees: 2000
No. of variables tried at each split: 3

As you can see, placing a call to rf did not provide any meaningful output other than the variables sampled at each split (mtry). Let's examine the first five rows and first five columns of the N x N matrix:

> dim(rf$proximity)
[1] 178 178

> rf$proximity[1:5, 1:5]
1 2 3 4 5
1 1.0000000 0.27868852 0.4049296 0.36200717 0.12969283
2 0.2786885 1.00000000 0.2142857 0.12648221 0.04453441
3 0.4049296 0.21428571 1.0000000 0.26865672 0.14942529
4 0.3620072 0.12648221 0.2686567 1.00000000 0.07692308
5 0.1296928 0.04453441 0.1494253 0.07692308 1.00000000

One way to think of the values is that they are the percentage of times those two observations show up in the same terminal nodes! Looking at variable importance, we see that the transformed Alcohol input could possibly be dropped. We will keep it for simplicity:

> randomForest::importance(rf)
MeanDecreaseGini
Alcohol 3.692748
MalicAcid 12.650096
Ash 10.842885
Alk_ash 11.636227
magnesium 10.672465
T_phenols 17.733783
Flavanoids 21.410838
Non_flav 11.527873
Proantho 14.494229
C_Intensity 14.795900
Hue 14.296274
OD280_315 17.815508
Proline 15.922621

It is now just a matter of creating the dissimilarity matrix, which transforms the proximity values (square root(1 - proximity)) as follows:

> rf_dist <- sqrt(1 - rf$proximity)

> rf_dist[1:2, 1:2]
1 2
1 0.0000000 0.8493006
2 0.8493006 0.0000000

We now have our input features, so let's run a PAM clustering as we did earlier:

> set.seed(1776)

> pam_rf <- cluster::pam(rf_dist, k = 3)

> table(pam_rf$clustering)

1 2 3
52 82 44

> table(pam_rf$clustering, wine$Class)

1 2 3
1 52 0 0
2 7 70 5
3 0 1 43

These results are comparable to the other techniques applied. Lesson learned here? If you have messy data for a clustering problem, consider using random forest to create a distance matrix, and even eliminate features from your clustering algorithm.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset