Problems

  1. Let us take another example of playing chess from Chapter 2, Naive Bayes. How would you classify a data sample (warm,strong,spring,?) according to the random forest algorithm?

Temperature

Wind

Season

Play

Cold

Strong

Winter

No

Warm

Strong

Autumn

No

Warm

None

Summer

Yes

Hot

None

Spring

No

Hot

Breeze

Autumn

Yes

Warm

Breeze

Spring

Yes

Cold

Breeze

Winter

No

Cold

None

Spring

Yes

Hot

Strong

Summer

Yes

Warm

None

Autumn

Yes

Warm

Strong

Spring

?

  1. Would it be a good idea to use only one tree and a random forest? Justify your answer.
  2. Can cross-validation improve the results of the classification by the random forest? Justify your answer.

Analysis:

  1. We run the program to construct the random forest and classify the feature (Warm, Strong, Spring).

Input:

source_code/4/chess_with_seasons.csv  
Temperature,Wind,Season,Play  
Cold,Strong,Winter,No  
Warm,Strong,Autumn,No  
Warm,None,Summer,Yes  
Hot,None,Spring,No  
Hot,Breeze,Autumn,Yes  
Warm,Breeze,Spring,Yes  
Cold,Breeze,Winter,No  
Cold,None,Spring,Yes  
Hot,Strong,Summer,Yes  
Warm,None,Autumn,Yes  
Warm,Strong,Spring,? 

Output:

We construct four trees in a random forest:

$ python chess_with_seasons.csv 4 2 > chess_with_seasons.out

The whole construction and analysis is stored in the file source_code/4/chess_with_seasons.out. Your construction may differ because of the randomness involved. From the output we extract the random forest graph consisting of random decision trees given the random numbers generated during our run.

Executing the command above again will most likely result in a different output and different random forest graph. Yet the results of the classification should be similar with a high probability because of the multiplicity of the random decision trees and their voting power combined. The classification by one random decision tree may be subject to a great variance. However, the majority vote combines the classification from all the trees, thus reducing the variance. To verify your understanding, you can compare your results of the classification with the classification by the random forest graph below.

Random forest graph and classification:

Let's have a look at the output of the random forest graph and the classification of the feature:

Tree 0:
Root ├── [Wind=None] │ ├── [Temperature=Cold] │ │ └── [Play=Yes] │ └── [Temperature=Warm] │ ├── [Season=Autumn] │ │ └── [Play=Yes] │ └── [Season=Summer] │ └── [Play=Yes] └── [Wind=Strong] ├── [Temperature=Cold] │ └── [Play=No] └── [Temperature=Warm] └── [Play=No]
Tree 1: Root ├── [Season=Autumn] │ ├──[Wind=Strong] │ │ └──[Play=No] │ ├── [Wind=None] │ │ └── [Play=Yes] │ └──[Wind=Breeze] │ └── [Play=Yes] ├── [Season=Summer] │ └── [Play=Yes] ├── [Season=Winter] │ └── [Play=No] └── [Season=Spring] ├── [Temperature=Cold] │ └── [Play=Yes] └── [Temperature=Warm] └── [Play=Yes] Tree 2:
Root ├── [Season=Autumn] │ ├── [Temperature=Hot] │ │ └── [Play=Yes] │ └── [Temperature=Warm] │ └── [Play=No] ├── [Season=Spring] │ ├── [Temperature=Cold] │ │ └── [Play=Yes] │ └── [Temperature=Warm] │ └── [Play=Yes] ├── [Season=Winter] │ └── [Play=No] └── [Season=Summer] ├── [Temperature=Hot] │ └── [Play=Yes] └── [Temperature=Warm] └── [Play=Yes]
Tree 3: Root ├── [Season=Autumn] │ ├──[Wind=Breeze] │ │ └── [Play=Yes] │ ├── [Wind=None] │ │ └── [Play=Yes] │ └──[Wind=Strong] │ └── [Play=No] ├── [Season=Spring] │ ├── [Temperature=Cold] │ │ └── [Play=Yes] │ └── [Temperature=Warm] │ └── [Play=Yes] ├── [Season=Winter] │ └── [Play=No] └── [Season=Summer] └── [Play=Yes]
The total number of trees in the random forest=4. The maximum number of the variables considered at the node is m=4. Classication Feature: ['Warm', 'Strong', 'Spring', '?']
Tree 0 votes for the class: No Tree 1 votes for the class: Yes
Tree 2 votes for the class: Yes
Tree 3 votes for the class: Yes The class with the maximum number of votes is 'Yes'. Thus the constructed random forest classifies the feature ['Warm', 'Strong', 'Spring', '?'] into the class 'Yes'.
  1. When we construct a tree in a random forest, we use only a random subset of the data with replacement. This is to eliminate the bias of the classifier towards certain features. However, if we use only one tree, that tree may happen to contain features with bias and might miss some important feature to provide an accurate classification. So, a random forest classifier with one decision tree would likely lead to a very poor classification. Therefore, we should construct more decision trees in a random forest to benefit from the reduction of bias and variance in the classification.
  2. During cross-validation, we divide the data into the training and the testing data. Training data is used to train the classifier and the test data is to evaluate which parameters or methods would be the best fit to improve the classification. Another advantage of cross-validation is the reduction of bias because we only use partial data, thereby decreasing the chance of overfitting to the specific dataset.

However, in a decision forest, we address problems that cross-validation addresses in an alternative way. Each random decision tree is constructed only on the subset of the data -reducing the chance of overfitting. In the end, the classification is the combination of results from each of these trees. The best decision in the end is not made by tuning the parameters on a test dataset, but by taking the majority vote of all the trees with reduced bias.

Hence, cross-validation for a decision forest algorithm would not be of a much use as it is already intrinsic within the algorithm.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset