We have the following data about the shopping preferences of our friend, Jane:
Temperature |
Rain |
Shopping |
Cold |
None |
Yes |
Warm |
None |
No |
Cold |
Strong |
Yes |
Cold |
None |
No |
Warm |
Strong |
No |
Warm |
None |
Yes |
Cold |
None |
? |
We would like to find out, using the decision trees, whether Jane would go shopping if the outside temperature was cold with no rain.
Analysis:
Here we should be careful, as there are instances of the data that have the same values for the same attributes, but have different classes; that is, (cold,none,yes) and (cold,none,no). The program we made would form the following decision tree:
Root ├── [Temperature=Cold] │ ├──[Rain=None] │ │ └──[Shopping=Yes] │ └──[Rain=Strong] │ └──[Shopping=Yes] └── [Temperature=Warm] ├──[Rain=None] │ └──[Shopping=No] └── [Rain=Strong] └── [Shopping=No]
But at the leaf node [Rain=None] with the parent [Temperature=Cold], there are two data samples with both classes no and yes. We cannot therefore classify an instance (cold,none,?) accurately. For the decision tree algorithm to work better, we would have to either provide a class at the leaf node with the greatest weight - that is, the majority class. Even better would be to collect values for more attributes for the data samples so that we can make a decision more accurately.
Therefore, in the presence of the given data, we are uncertain whether Jane would go shopping or not.