Mixture of experts

The idea behind mixture of experts is to use a set of linear regressions for each sub space of the original data space and combine them with weighting functions that will successively give weight to each linear regression.

Consider the following example dataset, which we generate with the following toy code:

x1=runif(40,0,10)
x2=runif(40,10,20)

e1 = rnorm(20,0,2)
e2 = rnorm(20,0,3)
 
y1 = 1+2.5*x1 + e1
y2 = 35+-1.5*x2 + e2
 
xx=c(x1,x2)
yy=c(y1,y2)

Plotting the result, and doing a simple linear regression on it, gives the following:

Mixture of experts

Obviously, the linear regression does not capture the behavior of the data at all. It barely captures a general trend in the data that more or less averages the data set.

The idea of mixture of experts is to have several sub models within a bigger model—for example, having several regression lines, as the following graph:

Mixture of experts

In this graph, the red and green lines seems to better represent the data set. However, the model needs to choose when to choose each one. Again, a mixture model could be a solution, except that, in this case, we want the mixture to be dependent on the data points. So the graphical model will be a bit different:

Mixture of experts

This is the linear model as we know it. Next we introduce the dependence of the latent variable to the data points with:

Mixture of experts

Here, S(.) is, for example, a sigmoid function. The function p(zi | xiθ) is usually called the gating function.

The graphical model associated with such a model is quite different now because it introduces a dependency between the latent variable and the observations:

Mixture of experts

In general, mixture of experts models uses a softmax gating function such that:

Mixture of experts

The EM algorithm is usually a good algorithm to fit such a model. For example, the mixtools package includes a function hmeME to fit mixture of experts models. At the time of writing, this function is limited to two clusters.

The combination of all the gating functions requires us to sum to one at each point; for example, in our example we could use two sigmoids with the following effect:

Mixture of experts

And such a combination could give a final model that better interprets the initial data set, such as this graph:

Mixture of experts

We recommend the reader develop his or her own EM algorithm to fit such models and try different types of gating functions.

Techniques such as shrinkage or using a Bayesian approach on the parameters could be useful to avoid over-fitting too, which can be problematic when the number of sub models grows quickly.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset