Introduction to random forests

A random forest is a supervised machine learning algorithm based on ensemble learning. It is used for both regression and classification problems. The general idea behind random forests is to build multiple decision trees and aggregate them to get an accurate result. A decision tree is a deterministic algorithm, which means if the same data is given to it, the same tree will be produced each time. They have a tendency to overfit, because they build the best tree possible with the given data, but may fail to generalize when unseen data is provided. All the decision trees that make up a random forest are different because we build each tree on a different random subset of our data. A random forest tends to be more accurate than a single decision tree because it minimizes overfitting.

The following diagram demonstrates bootstrap sampling being done from the source sample. Models are built on each of the samples and then the predictions are combined to arrive at a final result:

Each tree in a random forest is built using the following steps where A represents the entire forest, a represents a single tree, for a = 1 to A:

Create a bootstrap sample with replacement, D training from x, y label these X_a_,Y_a
Train the tree f_a on X_a, Y_a
Average the predictions or take the majority vote to arrive at a final prediction

In a regression problem, predictions for the test instances are made by taking the mean of the predictions made by all trees. This can be represented as follows:

Here, N is the total number of trees in the random forest. a=1 represents the first tree in a forest, while the last tree in the forest is A. () represents the prediction from a single tree.

If we have a classification problem, majority voting or the most common answer is used.

Table of Contents for Introduction to random forests

Create new playlist

Sign In

Sign Up

Table of Contents for
Introduction to random forests