Decision tree working methodology from first principles

In the following example, the response variable has only two classes: whether to play tennis or not. But the following table has been compiled based on various conditions recorded on various days. Now, our task is to find out which output the variables are resulting in most significantly: YES or NO.

This example comes under the Classification tree:

Day	Outlook	Temperature	Humidity	Wind	Play tennis
D1	Sunny	Hot	High	Weak	No
D2	Sunny	Hot	High	Strong	No
D3	Overcast	Hot	High	Weak	Yes
D4	Rain	Mild	High	Weak	Yes
D5	Rain	Cool	Normal	Weak	Yes
D6	Rain	Cool	Normal	Strong	No
D7	Overcast	Cool	Normal	Strong	Yes
D8	Sunny	Mild	High	Weak	No
D9	Sunny	Cool	Normal	Weak	Yes
D10	Rain	Mild	Normal	Weak	Yes
D11	Sunny	Mild	Normal	Strong	Yes
D12	Overcast	Mild	High	Strong	Yes
D13	Overcast	Hot	Normal	Weak	Yes
D14	Rain	Mild	High	Strong	No

Taking the Humidity variable as an example to classify the Play Tennis field:
- CHAID: Humidity has two categories and our expected values should be evenly distributed in order to calculate how distinguishing the variable is:

Calculating x² (Chi-square) value:

Calculating degrees of freedom = (r-1) * (c-1)

Where r = number of row components/number of variable categories, C = number of response variables.

Here, there are two row categories (High and Normal) and two column categories (No and Yes).

Hence = (2-1) * (2-1) = 1

p-value for Chi-square 2.8 with 1 d.f = 0.0942

p-value can be obtained with the following Excel formulae: = CHIDIST (2.8, 1) = 0.0942

In a similar way, we will calculate the p-value for all variables and select the best variable with a low p-value.

ENTROPY:

Entropy = - Σ p * log ₂ p

In a similar way, we will calculate information gain for all variables and select the best variable with the highest information gain.

GINI:

Gini = 1- Σp²

In a similar way, we will calculate Expected Gini for all variables and select the best with the lowest expected value.

For the purpose of a better understanding, we will also do similar calculations for the Wind variable:

CHAID: Wind has two categories and our expected values should be evenly distributed in order to calculate how distinguishing the variable is:

ENTROPY:

GINI:

Now we will compare both variables for all three metrics so that we can understand them better.

Variables	CHAID (p-value)	Entropy information gain	Gini expected value
Humidity	0.0942	0.1518	0.3669
Wind	0.2733	0.0482	0.4285
Better	Low value	High value	Low value

For all three calculations, Humidity is proven to be a better classifier than Wind. Hence, we can confirm that all methods convey a similar story.

Table of Contents for Decision tree working methodology from first principles

Create new playlist

Sign In

Sign Up

Table of Contents for
Decision tree working methodology from first principles