Mutual information between two random variables tells us the amount of information we can obtain from one random variable through another. Mutual information between two random variables x and y can be given as follows:
It is basically the difference between the entropy of y and the conditional entropy of y given x.
Mutual information between code and the generator output tells us how much information we can obtain about through . If the mutual information c and is high, then we can say knowing the generator output helps us to infer c. But if the mutual information is low, then we cannot infer c from the generator output. Our goal is to maximize the mutual information.
The mutual information between code and the generator output, , can be given as follows:
Let's look at the elements of the formula:
- is the entropy of the code
- is the conditional entropy of the code c given the generator output
But the problem is, how do we compute ? Because to compute this value, we need to know the posterior, , which we don't know yet. So, we estimate the posterior with the auxiliary distribution, :
Let's say , then we can deduce mutual information as follows:
Thus, we can say:
Maximizing mutual information, basically implies we are maximizing our knowledge about c given the generated output, that is, knowing about one variable through another.