Random forests

We have talked about decision trees, and now it’s time to discuss random forests. Very basically, a random forest is a collection of decision trees. In random forests, a fraction of the number of total rows and features are selected at random to train on. A decision tree is then built upon this subset. This collection will then have the results aggregated into a single result.

Random forests can also reduce bias and variance. How do they do this? By training on different data samples, or by using a random subset of features. Let’s take an example. Let’s say we have 30 features. A random forest might only use 10 of these features. That leaves 20 features unused, but some of those 20 features might be important. Remember that a random forest is a collection of decision trees. Therefore, in each tree, if we utilize 10 features, over time most if not all of our features would have been included anyway simply because of the law of averages. So, it is this inclusion that helps limit our error due to bias and variance.

For large datasets, the number of trees can grow quite large, sometimes into the tens of thousands and more, depending on the number of features you are using, so you need to be careful regarding performance.

Here is a diagram of what a random forest might look like:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset