Random forest

Like our motivation with the use of the Gower metric in handling mixed, in fact, messy data, we can apply random forest in an unsupervised fashion. Selecting this method has a number of advantages:

  • Robust against outliers and highly skewed variables
  • No need to transform or scale the data
  • Handles mixed data (numeric and factors)
  • Can accommodate missing data
  • Can be used on data with a large number of variables; in fact, it can be used to eliminate useless features by examining variable importance
  • The dissimilarity matrix produced serves as an input to the other techniques discussed earlier (hierarchical, k-means, and PAM)

A couple of words of caution. It may take some trial and error to properly tune the random forest with respect to the number of variables sampled at each tree split (mtry = ? in the function) and the number of trees grown. Studies done show that the more trees grown, up to a point, provide better results, and a good starting point is to grow 2,000 trees (Shi, T. & Horvath, S., 2006).

This is how the algorithm works, given a dataset with no labels:

  • The current observed data is labeled as class 1
  • A second (synthetic) set of observations is created of the same size as the observed data; this is created by randomly sampling from each of the features from the observed data, so if you have 20 observed features, you will have 20 synthetic features
  • The synthetic portion of the data is labeled as class 2, which facilitates using random forest as an artificial classification problem
  • Create a random forest model to distinguish between the two classes
  • Turn the model's proximity measures of just the observed data (the synthetic data is now discarded) into a dissimilarity matrix
  • Utilize the dissimilarity matrix as the clustering input features

So what exactly are these proximity measures?

A proximity measure is a pairwise measure between all the observations. If two observations end up in the same terminal node of a tree, their proximity score is equal to one, otherwise zero.

At the termination of the random forest run, the proximity scores for the observed data are normalized by dividing by the total number of trees. The resulting NxN matrix contains scores between zero and one, naturally with the diagonal values all being one. That's all there is to it. An effective technique that I believe is underutilized and one that I wish I had learned years ago.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset