Doing classification using random forest

Sometimes, one decision tree is not enough, so a set of decision trees is used to produce more powerful models. These are called ensemble learning algorithms. Ensemble learning algorithms are not limited to using decision trees as base models.

The most popular ensemble learning algorithm is random forest. In random forest, rather than growing one single tree, K number of trees are grown. Every tree is given a random subset S of training data. To add a twist to it, every tree only uses a subset of features. When it comes to making predictions, a majority vote is done on the trees and that becomes the prediction.

Let me explain this with an example. The goal is to make a prediction for a given person about whether he/she has good credit or bad credit.

To do this, we will provide labeled training data—in this case, a person with features and labels indicating whether he/she has good credit or bad credit. Now we do not want to create feature bias, so we will provide a randomly selected set of features. There is another reason to provide a randomly selected subset of features: most real-world data has hundreds, if not thousands, of features. Text classification algorithms, for example, typically have 50k-100k features.

In this case, to add flavor to the story, we are not going to provide features, but we will ask different people why they think a person has good or bad credit. Now, by definition, different people are exposed to the different features (sometimes overlapping) of a person, which gives us the same functionality as randomly selected features.

Our first example features Jack, who carries a bad credit label. We start with Joey, who works at Jack's favorite bar, the Elephant Bar. The only way a person can deduce why a label is given to an individual is by asking yes/no questions. Let's see what Joey says:

  • Question 1: Does Jack tip well? (feature: generosity)

Answer: No

  • Question 2: Does Jack spend at least $60 per visit? (feature: spendthrift)

Answer: Yes

  • Question 3: Does he tend to get into bar fights even at the smallest provocation? (feature: volatile)

Answer: Yes

That explains why Jack has bad credit.

We now ask Jack's girlfriend, Stacey:

  • Question 1: When you hang out, does Jack always cover the bill? (feature: generosity)

Answer: No

  • Question 2: Has Jack paid you back the $500 he owes? (feature: responsibility)

Answer: No

  • Question 3: Does he overspend sometimes just to show off? (feature: spendthrift)

Answer: Yes

That explains why Jack has bad credit.

We now ask Jack's best friend, George:

  • Question 1: When both you and Jack hang out at your apartment, does he clean up himself? (feature: organized)

Answer: No

  • Question 2: Did Jack arrive empty-handed during the Super Bowl potluck? (feature: care)

Answer: Yes

  • Question 3: Has he used the "I forgot my wallet at home" excuse for you to cover his tab at restaurants? (feature: responsibility)

Answer: Yes

That explains why Jack has bad credit.

Now we talk about Jessica who has good credit. Let's ask Stacey, who happens to be Jessica's sister:

  • Question 1: Whenever you run short of money, does Jessica offer help? (feature: generosity)

Answer: Yes

  • Question 2: Does Jessica pay her bills on time? (feature: responsibility)

Answer: Yes

  • Question 3: Does Jessica offer to babysit your child? (feature: care)

Answer: Yes

That explains why Jessica has good credit.

Now we ask George who happens to be her husband:

  • Question 1: Does Jessica keep the house tidy? (feature: organized)

Answer: Yes

  • Question 2: Does she expect expensive gifts? (feature: spendthrift)

Answer: No

  • Question 3: Does she get upset when you forget to mow the lawn? (feature: volatile)

Answer: No

That explains why Jessica has good credit.

Now let's ask Joey, the bartender at the Elephant Bar:

  • Question 1: Whenever she comes to the bar with friends, is she mostly the designated driver? (feature: responsible)

Answer: Yes

  • Question 2: Does she always take leftovers home? (feature: spendthrift)

Answer: Yes

  • Question 3: Does she tip well? (feature: generosity)

Answer: Yes

That explains why she has a good credit. 

The way random forest works is that it does a random selection on two levels:

  • A subset of the data
  • A subset of features to split that data

Both these subsets can overlap.

In our example, we have six features and we are going to assign three features to each tree. This way, there is a good chance we will have an overlap.

Let's add eight more people to our training dataset:

Names Label Generosity Responsibility Care Organization Spendthrift Volatile
Jack 0 0 0 0 0 1 1
Jessica 1 1 1 1 1 0 0
Jenny 0 0 0 1 0 1 1
Rick 1 1 1 0 1 0 0
Pat 0 0 0 0 0 1 1
Jeb 1 1 1 1 0 0 0
Jay 1 0 1 1 1 0 0
Nat 0 1 0 0 0 1 1
Ron 1 0 1 1 1 0 0
Mat 0 1 0 0 0 1 1
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset