Decision tree working methodology from first principles

In the following example, the response variable has only two classes: whether to play tennis or not. But the following table has been compiled based on various conditions recorded on various days. Now, our task is to find out which output the variables are resulting in most significantly: YES or NO.

  1. This example comes under the Classification tree:

Day

Outlook

Temperature

Humidity

Wind

Play tennis

D1

Sunny

Hot

High

Weak

No

D2

Sunny

Hot

High

Strong

No

D3

Overcast

Hot

High

Weak

Yes

D4

Rain

Mild

High

Weak

Yes

D5

Rain

Cool

Normal

Weak

Yes

D6

Rain

Cool

Normal

Strong

No

D7

Overcast

Cool

Normal

Strong

Yes

D8

Sunny

Mild

High

Weak

No

D9

Sunny

Cool

Normal

Weak

Yes

D10

Rain

Mild

Normal

Weak

Yes

D11

Sunny

Mild

Normal

Strong

Yes

D12

Overcast

Mild

High

Strong

Yes

D13

Overcast

Hot

Normal

Weak

Yes

D14

Rain

Mild

High

Strong

No

  1. Taking the Humidity variable as an example to classify the Play Tennis field:
    • CHAID: Humidity has two categories and our expected values should be evenly distributed in order to calculate how distinguishing the variable is:

Calculating x2 (Chi-square) value:

Calculating degrees of freedom = (r-1) * (c-1)

Where r = number of row components/number of variable categories, C = number of response variables.

Here, there are two row categories (High and Normal) and two column categories (No and Yes).

Hence = (2-1) * (2-1) = 1

p-value for Chi-square 2.8 with 1 d.f = 0.0942

p-value can be obtained with the following Excel formulae: = CHIDIST (2.8, 1) = 0.0942

In a similar way, we will calculate the p-value for all variables and select the best variable with a low p-value.

  • ENTROPY:

Entropy = - Σ p * log 2 p

In a similar way, we will calculate information gain for all variables and select the best variable with the highest information gain.

  • GINI:

Gini = 1- Σp2

In a similar way, we will calculate Expected Gini for all variables and select the best with the lowest expected value.

For the purpose of a better understanding, we will also do similar calculations for the Wind variable:

  • CHAID: Wind has two categories and our expected values should be evenly distributed in order to calculate how distinguishing the variable is:
  • ENTROPY
  • GINI:

Now we will compare both variables for all three metrics so that we can understand them better.

Variables

CHAID
(p-value)

Entropy
information gain

Gini
expected value

Humidity

0.0942

0.1518

0.3669

Wind

0.2733

0.0482

0.4285

Better

Low value

High value

Low value

 

For all three calculations, Humidity is proven to be a better classifier than Wind. Hence, we can confirm that all methods convey a similar story.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset