Before going ahead, let's get familiar with the notations:
- Let's represent the distribution of the input dataset by , where represents the parameter of the network that will be learned during training
- We represent the latent variable by , which encodes all the properties of the input by sampling from the distribution
- denotes the joint distribution of the input with their properties,
- represents the distribution of the latent variable
Using the Bayesian theorem, we can write the following:
The preceding equation helps us to compute the probability distribution of the input dataset. But the problem lies in computing , because computing it is intractable. Thus, we need to find a tractable way to estimate the . Here, we introduce a concept called variational inference.
Instead of inferring the distribution of directly, we approximate them using another distribution, say a Gaussian distribution . That is, we use which is basically a neural network parameterized by parameter to estimate the value of :
- is basically our probabilistic encoder; that is, they to create a latent vector z given
- is the probabilistic decoder; that is, it tries to construct the input given the latent vector
The following diagram helps you attain good clarity on the notations and what we have seen so far: