Mutual information

When looking at feature selection, we should not focus on the type of relationship as we did in the previous section (linear relationships). Instead, we should think in terms of how much information one feature provides, given that
we already have another.

To understand this, let's pretend that we want to use features from the house_size, number_of_levels, and avg_rent_price feature sets to train a classifier that outputs whether the house has an elevator or not. In this example, we can intuitively see that, knowing house_size, we don't need to know number_of_levels anymore, as it contains, somehow, redundant information. With avg_rent_price, it's different because we cannot infer the value of rental space simply from the size of the house or the number of levels it has. Hence, it would be wise to keep only one of them, in addition to the average price of rental space.

Mutual information formalizes the aforementioned reasoning by calculating how much information the two features have in common. However, unlike correlation, it does not rely on the sequence of data, but on the distribution. To understand how it works, we have to dive into information entropy.

Let's assume we have a fair coin. Before we flip it, we will have maximum uncertainty as to whether it will show heads or tails, as each outcome has an equal probability of 50 percent. This uncertainty can be measured by means of Claude Shannon's information entropy:

In our fair coin case, we have two cases: let X₀ be the case of heads and X₁ the case of tails with .

Hence, it concludes in the following:

For convenience, we can also use scipy.stats.entropy([0.5, 0.5], base=2). We set the base parameter to 2 to get the same result as earlier. Otherwise, the function will use the natural logarithm via np.log(). In general, the base does not matter (as long as you use it consistently).

Now, imagine that we knew upfront that the coin is actually not that fair, with heads having a 60 percent chance of showing up after flipping:

H(X) = -0.6^.log₂(0.6) - 0.4^.log₂(0.4)=0.97

We can see that this situation is less uncertain. The uncertainty will decrease the further away we get from 0.5, reaching the extreme value of 0 for either 0 percent or 100 percent probability of heads showing up, as we can see in the following graph:

We will now modify the entropy, H(X), by applying it to two features instead of one in such a way that it measures how much uncertainty is removed from X when we learn about Y. Then, we can catch how one feature reduces the uncertainty of another.

For example, without having any further information about the weather, we are totally uncertain whether or not it's raining outside. If we now learn that the grass outside is wet, the uncertainty has been reduced (we will still have to check whether the sprinkler had been turned on).

More formally, mutual information is defined as the following:

This looks a bit intimidating but is really nothing more than sums and products. For instance, the calculation of p() can be done by binning the feature values and then calculating the fraction of values in each bin. In the following plots, we have set the number of bins to ten.

In order to restrict mutual information to the interval of [0,1], we have to divide it by their added individual entropy, which gives us the following normalized mutual information:

The code is as follows:

def normalized_mutual_info(x, y, bins=10):
    counts_xy, bins_x, bins_y = np.histogram2d(x, y, bins=(bins, bins))
    counts_x, bins = np.histogram(x, bins=bins)
    counts_y, bins = np.histogram(y, bins=bins)

    counts_xy += 1 # add-one smoothing as we have
    counts_x += 1 # seen in the previous chapters
    counts_y += 1
    P_xy = counts_xy / np.sum(counts_xy)
    P_x = counts_x / np.sum(counts_x)
    P_y = counts_y / np.sum(counts_y)

    I_xy = np.sum(P_xy * np.log2(P_xy / (P_x.reshape(-1, 1) * P_y)))

    return I_xy / (entropy(counts_x) + entropy(counts_y))

The nice thing about mutual information is that, unlike correlation, it does not only look at linear relationships, as we can see in the following graphs:

As we can see, mutual information is able to indicate the strength of a linear relationship. The following diagram shows that it also works for squared relationships:

So, what we would have to do is calculate the normalized mutual information for all feature pairs. For every pair with too high a value (we would have to determine what this means), we would then drop one of them. In the case of regression, we could drop the feature that has too little mutual information with the desired result value.

This might work for a smallish set of features. At some point, however, this procedure can be really expensive, because the amount of calculation grows quadratically with the number of features.

Another huge disadvantage of filters is that they drop features that don't seem to be useful in isolation. More often than not, there are a handful of features that seem to be totally independent of the target variable, yet when combined, they rock. To keep these, we need wrappers.

Table of Contents for Mutual information

Create new playlist

Sign In

Sign Up

Table of Contents for
Mutual information