Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Learning as an optimization

In the previous sections, we saw various ways of evaluating our models and also defining the loss functions that we want to minimize. This suggests that a learning task can be viewed as an optimizations problem. In an optimization problem, we are provided with a hypothesis space, which in this case, is the set of all possible models along with an objective function, on the basis of which we will select the best-representing model from the hypothesis space. In this section, we will discuss the various choices of objective functions and how they affect our learning task.

Empirical risk and overfitting

Let's consider the task of selecting a model, M, which optimizes the expectation of some loss function, . As we don't know the value of , we generally use the dataset, D, which we have to get an empirical estimate of the expectation. Using D, we can define an empirical distribution, , as follows:

Putting this in simple words, for some event, A, we assign its probability to be the number of times we have seen this event in our samples. Therefore, as we have more and more samples from the original distribution, , the value of keeps getting closer and closer to the original distribution.

However, there are a few drawbacks to this approach that we need to keep in mind to avoid getting poor results. Think of a case when we have a lot of variables in the network, let's say n. Considering that all the variables can only take two different states, our joint distribution over these variables will have different assignments. Now, let's say that we are provided with 1000 distinct samples from the original distribution. If we try to find the empirical distribution using this data, we will be assigning a probability of 0.001 to each of the 1000 assignments that were given to us and will assign 0 to the rest assignments. In real life, we want to predict over new data using our learned model, and it is highly possible that our training data doesn't have all the possible events. In such cases, our trained model will overfit to the training data as it assigns 0 probability to all the events that are not present in the training data.

So, to avoid overfitting, we can limit our hypothesis space to simpler models. This leads to yet another problem; with limited hypothesis space, we might not be able to find a model that will fit perfectly into the original distribution, even if we are provided with infinite data. This type of limitation in learning introduces an inherent error in the learning model, which is known as bias. Conversely, if we have a hypothesis space with more complex models, we can correctly learn the actual distribution, . In that case, if we also have less data, we will get too many fluctuations in our predictions. As a result, we will have a learned model with high variance.

In conclusion, we will always have a trade-off between the bias and variance in our learned models. However, with very limited data, variance turns out to be more dangerous, as it is not able to learn the actual distribution, , at all.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Learning as an optimization

Create new playlist

Sign In

Sign Up

Learning as an optimization

Empirical risk and overfitting

Table of Contents for
Learning as an optimization