Gini index

The Gini index is alternatively described as a measure of variance and a measure of purity. Let's start with its definition to get its meaning:

The  is the number of records pertaining to the k class within the given group. Let's take for instance, a decision tree with a first split on variable A. You now take one of the halves and compute for that half how many records pertaining to the k class (for instance, default) are within that half. If you think about it, this proportion is a measure of purity within that half: a number of p close to one means that this group is mainly composed of records pertaining to a single class, that is, it means that the given variable and cutoff are well discriminating our population.

Once you understand this, it is easy to understand the meaning of the whole index. The sum of will be low if all, or a great part of,  is either close to zero or close to one. This is reasonable, since if a decision tree is able to well discriminate, this will translate into halves having one k class with a close to one and the other classes with a  close to zero.

The final take away is: the smaller the Gini, the better the model.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset