We will use the example from the previous chapter about the swim preference. We have the same data table:
Swimming suit |
Water temperature |
Swim preference |
None |
Cold |
No |
None |
Warm |
No |
Small |
Cold |
No |
Small |
Warm |
No |
Good |
Cold |
No |
Good |
Warm |
Yes |
We would like to construct a random forest from this data and use it to classify an item (Good,Cold,?).
Analysis:
We are given M=3 variables according to which a feature can be classified. In a random forest algorithm, we usually do not use all three variables to form tree branches at each node. We use only m variables out of M. So we choose m such that m is less than or equal to M. The greater m is, the stronger the classifier is in each constructed tree. However, as mentioned earlier, more data leads to more bias. But, because we use multiple trees (with smaller m), even if each constructed tree is a weak classifier, their combined classification accuracy is strong. As we want to reduce a bias in a random forest, we may want to consider to choose a parameter m to be slightly less than M.
Thus we choose the maximum number of the variables considered at the node to be m=min(M,math.ceil(2*math.sqrt(M)))=min(M,math.ceil(2*math.sqrt(3)))=3.
We are given the following features:
[['None', 'Cold', 'No'], ['None', 'Warm', 'No'], ['Small', 'Cold', 'No'], ['Small', 'Warm', 'No'], ['Good', 'Cold', 'No'], ['Good', 'Warm', 'Yes']]
When constructing a random decision tree as a part of a random forest, we will choose only a subset out of them in a random manner with replacement.