Random forest and PAM

To perform this method in R, you can use the randomForest() function. After setting the random seed, simply create the model object. In the following code, I specify the number of trees as 2000 and set proximity measure to TRUE. You don't have to run this on scaled data:

> set.seed(1918)

> rf <- randomForest::randomForest(x = wine[, -1], ntree = 2000, proximity = T)

> rf

Call:
 randomForest(x = wine[, -1], ntree = 2000, proximity = T) 
               Type of random forest: unsupervised
                     Number of trees: 2000
No. of variables tried at each split: 3

As you can see, placing a call to rf did not provide any meaningful output other than the variables sampled at each split (mtry). Let's examine the first five rows and first five columns of the N x N matrix:

> dim(rf$proximity)
[1] 178 178

> rf$proximity[1:5, 1:5]
          1          2         3          4          5
1 1.0000000 0.27868852 0.4049296 0.36200717 0.12969283
2 0.2786885 1.00000000 0.2142857 0.12648221 0.04453441
3 0.4049296 0.21428571 1.0000000 0.26865672 0.14942529
4 0.3620072 0.12648221 0.2686567 1.00000000 0.07692308
5 0.1296928 0.04453441 0.1494253 0.07692308 1.00000000

One way to think of the values is that they are the percentage of times those two observations show up in the same terminal nodes! Looking at variable importance, we see that the transformed Alcohol input could possibly be dropped. We will keep it for simplicity:

> randomForest::importance(rf)
            MeanDecreaseGini
Alcohol             3.692748
MalicAcid          12.650096
Ash                10.842885
Alk_ash            11.636227
magnesium          10.672465
T_phenols          17.733783
Flavanoids         21.410838
Non_flav           11.527873
Proantho           14.494229
C_Intensity        14.795900
Hue                14.296274
OD280_315          17.815508
Proline            15.922621

It is now just a matter of creating the dissimilarity matrix, which transforms the proximity values (square root(1 - proximity)) as follows:

> rf_dist <- sqrt(1 - rf$proximity)

> rf_dist[1:2, 1:2]
          1         2
1 0.0000000 0.8493006
2 0.8493006 0.0000000

We now have our input features, so let's run a PAM clustering as we did earlier:

> set.seed(1776)

> pam_rf <- cluster::pam(rf_dist, k = 3)

> table(pam_rf$clustering)

 1  2  3 
52 82 44 

> table(pam_rf$clustering, wine$Class)
   
     1  2  3
  1 52  0  0
  2  7 70  5
  3  0  1 43

These results are comparable to the other techniques applied. Lesson learned here? If you have messy data for a clustering problem, consider using random forest to create a distance matrix, and even eliminate features from your clustering algorithm.

Table of Contents for Random forest and PAM

Create new playlist

Sign In

Sign Up

Table of Contents for
Random forest and PAM