- What is the information entropy of the following multisets?
a) {1,2}, b) {1,2,3}, c) {1,2,3,4}, d) {1,1,2,2}, e) {1,1,2,3} - What is the information entropy of the probability space induced by the biased coin that shows heads with the probability 10% and tails with the probability 90%?
- Let us take another example of playing chess from Chapter 2, Naive Bayes:
- a) What is the information gain for each of the non-classifying attributes in the table?
- b) What is the decision tree constructed from the given table?
- c) How would you classify a data sample (warm,strong,spring,?) according to the constructed decision tree?
Temperature |
Wind |
Season |
Play |
Cold |
Strong |
Winter |
No |
Warm |
Strong |
Autumn |
No |
Warm |
None |
Summer |
Yes |
Hot |
None |
Spring |
No |
Hot |
Breeze |
Autumn |
Yes |
Warm |
Breeze |
Spring |
Yes |
Cold |
Breeze |
Winter |
No |
Cold |
None |
Spring |
Yes |
Hot |
Strong |
Summer |
Yes |
Warm |
None |
Autumn |
Yes |
Warm |
Strong |
Spring |
? |
- Mary and temperature preferences. Let us take the example from the Chapter 1, Classification Using K Nearest Neighbors, about the temperature preferences of Mary.
Temperature in degrees Celsius |
Wind speed in kmph |
Mary's perception |
10 |
0 |
Cold |
25 |
0 |
Warm |
15 |
5 |
Cold |
20 |
3 |
Warm |
18 |
7 |
Cold |
20 |
10 |
Cold |
22 |
5 |
Warm |
24 |
6 |
Warm |
We would like to use decision trees to decide if our friend Mary would feel warm or cold in the room with the temperature 16 degrees Celsius with the fan of the wind speed 3km/h.
Can you please explain how a decision tree algorithm could be used here and how good it would be to use it for this example?
Analysis:
- Here are entropies of the multisets:
a) E({1,2})=-(1/2)*log2(1/2)-(1/2)*log2(1/2)=1
b) E({1,2,3})=-(1/3)*log2(1/3)-(1/3)*log2(1/3)-(1/3)*log2(1/3)=1.5849625
c) E({1,2,3,4})=-(1/4)*log2(1/4)-(1/4)*log2(1/4)-(1/4)*log2(1/4)-(1/4)*log2(1/4)=2
d) E({1,1,2,2})=-(2/4)*log2(2/4)-(2/4)*log2(2/4)=1
e) E({1,1,2,3})=-(2/4)*log2(2/4)-(1/4)*log2(1/4)-(1/4)*log2(1/4)=1.5
Here note that the information entropy of the multisets that have more than two classes is greater than 1, so we need more than one bit of information to represent the result. But is this true for every multiset that has more than two classes of elements?
- E({10% of heads, 90% of tails})=-0.1*log2(0.1)-(0.9)*log2(0.9)=0.46899559358
- a) The information gains for the three attributes are as follows:
IG(S,temperature)=0.0954618442383
IG(S,wind)=0.0954618442383
IG(S,season)=0.419973094022
b) Therefore, we would choose the attribute season to branch from the root node as it has the highest information gain. Alternatively, we can put all the input data into the program to construct a decision tree:
Root ├── [Season=Autumn] │ ├──[Wind=Breeze] │ │ └──[Play=Yes] │ ├──[Wind=Strong] │ │ └──[Play=No] │ └──[Wind=None] │ └──[Play=Yes] ├── [Season=Summer] │ ├──[Temperature=Hot] │ │ └──[Play=Yes] │ └──[Temperature=Warm] │ └──[Play=Yes] ├── [Season=Winter] │ └──[Play=No] └── [Season=Spring] ├── [Temperature=Hot] │ └──[Play=No] ├── [Temperature=Warm] │ └──[Play=Yes] └── [Temperature=Cold] └── [Play=Yes]
c) According to the constructed decision tree, we would classify the data sample (warm,strong,spring,?) to the class Play=Yes by going to the bottommost branch from the root node and then arriving to the leaf node by taking the middle branch.
- Here the decision tree algorithm may not perform that well without any processing of the data. If we considered every class of a temperature, then 25 degrees Celsius would still not occur in the decision tree as it is not in the input data, so we would not be able to classify how Mary would feel at 16 degrees Celsius and at 3km/h windspeed.
We could alternatively divide the temperature and wind speed into the intervals in order to reduce the classes, so that the resulting decision tree could classify the input instance. But it is this division, the classification of in what intervals 25 degrees Celsius and 3km/h should be, that is the fundamental part of the analysis procedure for this type of problem. Thus decision trees without any serious modification could not analyze the problem well.