Swim preference - analysis with random forest

We will use the example from the previous chapter about the swim preference. We have the same data table:

Swimming suit

Water temperature

Swim preference

None

Cold

No

None

Warm

No

Small

Cold

No

Small

Warm

No

Good

Cold

No

Good

Warm

Yes

We would like to construct a random forest from this data and use it to classify an item (Good,Cold,?).

Analysis:

We are given M=3 variables according to which a feature can be classified. In a random forest algorithm, we usually do not use all three variables to form tree branches at each node. We use only m variables out of M. So we choose m such that m is less than or equal to M. The greater m is, the stronger the classifier is in each constructed tree. However, as mentioned earlier, more data leads to more bias. But, because we use multiple trees (with smaller m), even if each constructed tree is a weak classifier, their combined classification accuracy is strong. As we want to reduce a bias in a random forest, we may want to consider to choose a parameter m to be slightly less than M.

Thus we choose the maximum number of the variables considered at the node to be m=min(M,math.ceil(2*math.sqrt(M)))=min(M,math.ceil(2*math.sqrt(3)))=3.

We are given the following features:

[['None', 'Cold', 'No'], ['None', 'Warm', 'No'], ['Small', 'Cold', 'No'], ['Small', 'Warm', 'No'], ['Good', 'Cold', 'No'], ['Good', 'Warm', 'Yes']]

When constructing a random decision tree as a part of a random forest, we will choose only a subset out of them in a random manner with replacement.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset