Getting ready

Whenever someone has to take tennis lessons in the morning, the night before, the instructor checks the weather report and decides whether the next morning would be good to play tennis. This recipe will use this as an example to build a decision tree.

Let's decide on the features of weather that affect the decision whether to play tennis in the morning or not:

  • Rain
  • Wind speed
  • Temperature

Let's build a table using different combinations of these features:

Rain Windy Temperature Play tennis?
Yes Yes Hot No
Yes Yes Normal No
Yes Yes Cool No
No Yes Hot No
No Yes Cool No
No No Hot Yes
No No Normal Yes
No No Cool No

Now how do we build a decision tree? We can start with one of the three features: rain, wind speed, or temperature. The rule is to start in such a way that maximum information gain would be possible. 

Information gain means identifying the fastest way to reach a decision. In fact, it is a concept that we use in everyday life. Being a busy executive, I get hundreds of e-mails every day. We get a mix of spam and important client inquiries in the same e-mail stream. So no e-mail can be ignored without being processed. My goal while processing e-mails is to maximize information gain. The action item for me in this case is to determine whether I'm looking for information or an outcome. I start with the subject and then proceed to the first few lines. By this time, I have an idea about what needs to be done with the e-mail: for example, send a quick reply, forward/delegate it, star it for later processing, mark as spam, block the sender, and so on. Every now and then, I get e-mails that are complete essays, but due to the inefficiency of information gain, they get ignored. 

On a rainy day, as you can see in the table, other features do not matter and there is no play. The same is true for high wind velocity.

Decision trees, like most other algorithms, take feature values only as double values. So let's do the mapping:

The positive class is 1.0 and the negative class is 0.0. Let's load the data using the CSV format, with the first value as a label:

$ vi tennis.csv
0.0,1.0,1.0,2.0
0.0,1.0,1.0,1.0
0.0,1.0,1.0,0.0
0.0,0.0,1.0,2.0
0.0,0.0,1.0,0.0
1.0,0.0,0.0,2.0
1.0,0.0,0.0,1.0
0.0,0.0,0.0,0.0
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset