Chapter 4

Validating Machine Learning

IN THIS CHAPTER

check Explaining how correct sampling is critical in machine learning

check Highlighting errors dictated by bias and variance

check Proposing different approaches to validation and testing

check Warning against biased samples, overfitting, underfitting, and snooping

“I’m not running around looking for love and validation …”

— SOPHIE B. HAWKINS

Having examples (in the form of data sets) and a machine learning algorithm at hand doesn’t assure that solving a learning problem is possible or that the results will provide the desired solution. For example, if you want your computer to distinguish a photo of a dog from a photo of a cat, you can provide it with good examples of dogs and cats. You then train a dog versus cat classifier based on some machine learning algorithm that could output the probability that a given photo is a dog or a cat. Of course, the output is a probability — not an absolute assurance that the photo is a dog or cat.

Based on the probability that the classifier reports, you can decide the class (dog or cat) of a photo based on the estimated probability calculated by the algorithm. When the probability is higher for a dog, you can minimize the risk of making a wrong assessment by choosing the higher chances favoring a dog. The greater the probability difference between the likelihood of a dog against that of a cat, the higher the confidence you can have in your choice. A close choice likely occurs because of some ambiguity in the photo (the photo is not clear or the dog is actually a bit cattish). For that matter, it might not even be a dog — and the algorithm doesn’t know anything about the raccoon, which is what the picture actually shows.

Such is the power of training a classifier: You pose the problem; you offer the examples, with each one carefully marked with the label or class that the algorithm should learn; your computer trains the algorithm for a while; and finally you get a resulting model, which provides you with an answer or probability. (Labeling is a challenging activity in itself, as you discover in Book 9.) In the end, a probability is just an opportunity (or a risk, from another perspective) to propose a solution and get a correct answer. At this point, you may seem to have addressed every issue and believe that the work is finished, but you must still validate the results. This chapter helps you discover why machine learning isn’t just a push-the-button-and-forget-it activity.

Checking Out-of-Sample Errors

When you first receive the data used to train the algorithm, the data is just a data sample. Unless the circumstances are quite rare, the data you receive won’t be all the data that you could possibly get. For instance, if you receive sales data from your marketing department, the data you receive is not all the possible sales data because unless sales are stopped, there will always be new data representing new sales in the future.

If your data is not all the data possible, you must call it a sample. A sample is a selection, and as with all selections, the data could reflect different motivations as to why someone selected it in such a way. Therefore, when you receive data, the first question you have to consider is how someone has selected it. If someone selected it randomly, without any specific criteria, you can expect that, if things do not change from the past, future data won’t differ too much from the data you have at hand.

remember Statistics expects that the future won’t differ too much from the past. Thus you can base future predictions on past data by employing random sampling theory. If you select examples randomly without a criterion, you do have a good chance of choosing a selection of examples that won’t differ much from future examples, or in statistical terms, you can expect that the distribution of your present sample will closely resemble the distribution of future samples.

However, when the sample you receive is somehow special, it could present a problem when training the algorithm. In fact, the special data could force your algorithm to learn a different mapping to the response than the mapping it might have created by using random data. As an example, if you receive sales data from just one shop or only the shops in a single region (which is actually a specific sample), the algorithm may not learn how to forecast the future sales of all the shops in all the regions. The specific sample causes problems because other shops may be different and follow different rules from the ones you’re observing.

remember Ensuring that your algorithm is learning correctly from data is the reason you should always check what the algorithm has learned from in-sample data (the data used for training) by testing your hypothesis on some out-of-sample data. Out-of-sample data is data you didn’t have at learning time, and it should represent the kind of data you need to create forecasts.

Looking for generalization

Generalization is the capability to learn from data at hand the general rules that you can apply to all other data. Out-of-sample data therefore becomes essential to figuring out whether learning from data is possible, and to what extent.

No matter how big your in-sample data set is, bias created by some selection criteria still makes seeing similar examples frequently and systematically highly unlikely in reality. For example, in statistics, there is an anecdote about inferring from biased samples. It involves the 1936 US presidential election between Alfred Landon and Franklin D. Roosevelt in which the Literary Digest used biased poll information to predict the winner.

At that time, the Literary Digest, a respectable and popular magazine, polled its readers to determine the next president of the United States, a practice that it had performed successfully since 1916. The response of the poll was strikingly in favor of Landon, with more than a 57 percent consensus on the candidate. The magazine also used such a huge sample — more than 10 million people (with only 2.4 million responding) — that the result seemed unassailable: A large sample coupled with a large difference between the winner and the loser tends not to raise many doubts. Yet the poll was completely unsuccessful. In the end, the margin of error was 19 percent, with Landon getting only 38 percent of the vote and Roosevelt getting 62 percent. This margin is the largest error ever for a public opinion poll.

What happened? Well, simply put, the magazine questioned people whose names were pulled from every telephone directory in United States, as well as from the magazine’s subscription list and from rosters of clubs and associations, gathering more than ten million names. Impressive, but at the end of the Great Depression, having a telephone, subscribing to a magazine, or being part of a club meant that you were rich, so the sample was made of only affluent voters and completely ignored lower-income voters, who happen to represent the majority (thereby resulting in a selection bias). In addition, the poll suffered from a nonresponsive bias because only 2.4 million people responded, and people who respond to polls tend to differ from those who don’t. (You can read more about the faulty Literary Digest poll at www.math.upenn.edu/~deturck/m170/wk4/lecture/case1.html.) The magnitude of error for this particular incident ushered in the beginning of a more scientific approach to sampling.

remember Such classical examples of selection bias point out that if the selection process biases a sample, the learning process will have the same bias. However, sometimes bias is unavoidable and difficult to spot. As an example, when you go fishing with a net, you can see only the fish you catch and that didn’t pass through the net itself.

Another example comes from World War II. At that time, designers constantly improved US war planes by adding extra armor plating to the parts that took the most hits upon returning from bombing runs. It took the reasoning of the mathematician Abraham Wald to point out that designers actually needed to reinforce the places that didn’t have bullet holes on returning planes. These locations were likely so critical that a plane hit there didn’t return home, and consequently no one could observe its damage (a kind of survivorship bias where the survivors skew the data). Survivorship bias is still a problem today. In fact, you can read about how this story has shaped the design of Facebook at www.fastcodesign.com/1671172/how-a-story-from-world-war-ii-shapes-facebook-today.

Preliminary reasoning on your data and testing results with out-of-sample examples can help you spot or at least have an intuition of possible sampling problems. However, receiving new out-of-sample data is often difficult, costly, and requires investment in terms of timing. In the sales example discussed earlier, you have to wait for a long time to test your sales forecasting model — maybe an entire year — in order to find out whether your hypothesis works. In addition, making the data ready for use can consume a great deal of time. For example, when you label photos of dogs and cats, you need to spend time labeling a larger number of photos taken from the web or from a database.

A possible shortcut to expending additional effort is getting out-of-sample examples from your available data sample. You reserve a part of the data sample based on a separation between training and testing data dictated by time or by random sampling. If time is an important component in your problem (as it is in forecasting sales), you look for a time label to use as separator. Data before a certain date appears as in-sample data; data after that date appears as out-of-sample data. The same happens when you choose data randomly: What you extracted as in-sample data is just for training; what is left is devoted to testing purposes and serves as your out-of-sample data.

Getting to Know the Limits of Bias

Now that you know more about the in-sample and out-of-sample portions of your data, you also know that learning depends a lot on the in-sample data. This portion of your data is important because you want to discover a point of view of the world, and as with all points of view, it can be wrong, distorted, or just merely partial. You also know that you need an out-of-sample example to check whether the learning process is working. However, these aspects form only part of the picture. When you make a machine learning algorithm work on data in order to guess a certain response, you are effectively taking a gamble, and that gamble is not just because of the sample you use for learning. There’s more. For the moment, imagine that you freely have access to suitable, unbiased, in-sample data, so data is not the problem. Instead you need to concentrate on the method for learning and predicting.

First, you must consider that you’re betting that the algorithm can reasonably guess the response. You can’t always make this assumption because figuring out certain answers isn’t possible no matter what you know in advance. For instance, you can’t fully determine the behavior of human beings by knowing their previous history and behavior. Maybe a random effect is involved in the generative process of our behavior (the irrational part of us, for instance), or maybe the issue comes down to free will (the problem is also a philosophical/religious one, and there are many discordant opinions). Consequently, you can guess only some types of responses, and for many others, such as when you try to predict people’s behavior, you have to accept a certain degree of uncertainty which, with luck, is acceptable for your purposes.

Second, you must consider that you’re betting that the relationship between the information you have and the response you want to predict can be expressed as a mathematical formula of some kind, and that your machine learning algorithm is actually capable of guessing that formula. The capacity of your algorithm to guess the mathematical formula behind a response is intrinsically embedded in the nuts and bolts of the algorithm. Some algorithms can guess almost everything; others actually have a limited set of options. The range of possible mathematical formulations that an algorithm can guess is the set of its possible hypotheses. Consequently, a hypothesis is a single algorithm, specified in all its parameters and therefore capable of a single, specific formulation.

Mathematics is fantastic. It can describe much of the real world by using some simple notation, and it’s the core of machine learning because any learning algorithm has a certain capability to represent a mathematical formulation. Some algorithms, such as linear regression, explicitly use a specific mathematical formulation for representing how a response (for instance, the price of a house) relates to a set of predictive information (such as market information, house location, surface of the estate, and so on).

Some formulations are so complex and intricate that even though representing them on paper is possible, doing so is too difficult in practical terms. Some other sophisticated algorithms, such as decision trees (a topic of Book 9, Chapter 4), don’t have an explicit mathematical formulation, but are so adaptable that they can be set to approximate a large range of formulations easily. As an example, consider a simple and easily explained formulation. The linear regression is just a line in a space of coordinates given by the response and all the predictors. In the easiest example, you can have a response, y, and a single predictor, x, with a formulation of

images

In a simple situation of a response predicted by a single feature, such a model is perfect when your data arranges itself as a line. However, what happens if it doesn’t and instead shapes itself like a curve? To represent the situation, just observe the following bidimensional representations, as shown in Figure 4-1.

image

FIGURE 4-1: Example of a linear model struggling to map a curve function.

When points resemble a line or a cloud, some error occurs when you’re figuring out that the result is a straight line; therefore the mapping provided by the preceding formulation is somehow imprecise. However, the error doesn’t appear systematically but rather randomly because some points are above the mapped line and others are below it. The situation with the curved, shaped cloud of points is different, because this time, the line is sometimes exact but at other times is systematically wrong. Sometimes points are always above the line; sometimes they are below it.

remember Given the simplicity of its mapping of the response, your algorithm tends to systematically overestimate or underestimate the real rules behind the data, representing its bias. The bias is characteristic of simpler algorithms that can’t express complex mathematical formulations.

Keeping Model Complexity in Mind

Just as simplicity of formulations is a problem, automatically resorting to mapping very intricate formulations doesn’t always provide a solution. In fact, you don’t know the true complexity of the required response mapping (such as whether it fits in a straight line or in a curved one). Therefore, just as simplicity may create an unsuitable response (refer to Figure 4-1), it’s also possible to represent the complexity in data with an overly complex mapping. In such cases, the problem with a complex mapping is that it has many terms and parameters — and in some extreme cases, your algorithm may have more parameters than your data has examples. Because you must specify all the parameters, the algorithm then starts memorizing everything in the data — not just the signals but also the random noise, the errors, and all the slightly specific characteristics of your sample.

In some cases, it can even just memorize the examples as they are. However, unless you’re working on a problem with a limited number of simple features with few distinct values (basically a toy data set, that is, a data set with few examples and features, thus simple to deal with and ideal for examples), you’re highly unlikely to encounter the same example twice, given the enormous number of possible combinations of all the available features in the data set.

When memorization happens, you may have the illusion that everything is working well because your machine learning algorithm seems to have fitted the in-sample data so well. Instead, problems can quickly become evident when you start having it work with out-of-sample data and you notice that it produces errors in its predictions as well as errors that actually change a lot when you relearn from the same data with a slightly different approach. Overfitting occurs when your algorithm has learned too much from your data, up to the point of mapping curve shapes and rules that do not exist, as shown in Figure 4-2. Any slight change in the procedure or in the training data produces erratic predictions.

image

FIGURE 4-2: Example of a linear model going right and becoming too complex while trying to map a curve function.

Keeping Solutions Balanced

To create great solutions, machine learning models trade off between simplicity (implying a higher bias) and complexity (generating a higher variance of estimates). If you intend to achieve the best predictive performance, you do need to find a solution in the middle by understanding what works better, which you do by using trial and error on your data. Because data is what dictates the most suitable solution for the prediction problem, you have neither a panacea nor an easy recurrent solution for solving all your machine learning dilemmas.

remember A commonly referred to theorem in the mathematical folklore is the no-free-lunch theorem by David Wolpert and William Macready, which states that “any two optimization algorithms are equivalent when their performance is averaged across all possible problems” (see https://en.wikipedia.org/wiki/No_free_lunch_theorem for details). If the algorithms are equivalent in the abstract, no one is superior to the other unless proved in a specific, practical problem. (See the discussion at www.no-free-lunch.org for more details about no-free-lunch theorems; two of them are actually used for machine learning.)

In particular, in his article “The Lack of A Priori Distinctions Between Learning Algorithms,” Wolpert discussed the fact that there are no a priori distinctions between algorithms, no matter how simple or complex they are (you can obtain the article at http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.51.9734). Data dictates what works and how well it works. In the end, you cannot always rely on a single machine learning algorithm, but you have to test many and find the best one for your problem.

Besides being led into machine learning experimentation by the try-everything principle of the no-free-lunch theorem, you have another rule of thumb to consider: Occam’s razor, which is attributed to William of Occam, a scholastic philosopher and theologian who lived in the fourteenth century (see http://math.ucr.edu/home/baez/physics/General/occam.html for details). The Occam’s razor principle states that theories should be cut down to the minimum in order to plausibly represent the truth (hence the razor). The principle doesn’t state that simpler solutions are better but that, between a simple solution and a more complex solution offering the same result, the simpler solution is always preferred. The principle is at the very foundations of our modern scientific methodology, and even Albert Einstein seems to have often referred to it, stating that “everything should be as simple as it can be, but not simpler” (see http://quoteinvestigator.com/2011/05/13/einstein-simple for details). Summarizing the evidence so far:

  • To get the best machine learning solution, try everything you can on your data and represent your data’s performance with learning curves.
  • Start with simpler models, such as linear models, and always prefer a simpler solution when it performs nearly as well as a complex solution. You benefit from the choice when working on out-of-sample data from the real world.
  • Always check the performance of your solution using out-of-sample examples, as discussed in the preceding sections.

Depicting learning curves

To visualize the degree to which a machine learning algorithm is suffering from bias or variance with respect to a data problem, you can take advantage of a chart type named learning curve. Learning curves are displays in which you plot the performance of one or more machine learning algorithms with respect to the quantity of data they use for training. The plotted values are the prediction error measurements, and the metric is measured both as in-sample and cross-validated or out-of-sample performance.

remember If the chart depicts performance with respect to the quantity of data, it’s a learning curve chart. When it depicts performance with respect to different hyper-parameters or a set of learned features picked by the model, it’s a validation curve chart instead. To create a learning curve chart, you must do the following:

  • Divide your data into in-sample and out-of-sample sets (a train/test split of 70/30 works fine, or you can use cross-validation).
  • Create portions of your training data of growing size. Depending on the size of the data that you have available for training, you can use 10 percent portions or, if you have a lot of data, grow the number of examples on a power scale such as 103, 104, 105, and so on.
  • Train models on the different subsets of the data. Test and record their performance on the same training data and on the out-of-sample set.
  • Plot the recorded results on two curves, one for the in-sample results and the other for the out-of-sample results (see Figure 4-3). If instead of a train/test split you use cross-validation, you can also draw boundaries expressing the stability of the result across multiple validations (confidence intervals) based on the standard deviation of the results themselves.
image

FIGURE 4-3: Examples of learning curves affected by bias (left) and variance (right).

Ideally, you should obtain two curves with different starting error points: higher for the out-of-sample; lower for the in-sample. As the size of the training set increases, the difference in space between the two should reduce until, at a certain number of observations, they become close to a common error value.

Noticeably, after you print your chart, problems arise when

  • The two curves tend to converge, but you can’t see on the chart that they get near each other because you have too few examples. This situation gives you a strong hint to increase the size of your data set if you want to successfully learn with the tested machine learning algorithm.
  • The final convergence point between the two curves has a high error, so consequently your algorithm has too much bias. Adding more examples here does not help because you have a convergence with the amount of data you have. You should increase the number of features or use a more complex learning algorithm as a solution.
  • The two curves do not tend to converge because the out-of-sample curve starts to behave erratically. Such a situation is clearly a sign of high variance of the estimates, which you can reduce by increasing the number of examples (at a certain number, the out-of-sample error will start to decrease again), reducing the number of features, or, sometimes, just fixing some key parameters of the learning algorithm.

Python provides learning curves as part of the scikit-learn package using the learning_curve function that prepares all the computations for you (see the details at http://scikit-learn.org/stable/modules/generated/sklearn.learning_curve.learning_curve.html).

Training, Validating, and Testing

In a perfect world, you could perform a test on data that your machine learning algorithm has never learned from before. However, waiting for fresh data isn’t always feasible in terms of time and costs. As a first simple remedy, you can randomly split your data into training and test sets. The common split is from 25 to 30 percent for testing and the remaining 70 to 75 percent for training. You split your data consisting of your response and features at the same time, keeping correspondence between each response and its features.

The second remedy occurs when you need to tune your learning algorithm. In this case, the test split data isn’t a good practice because it causes another kind of overfitting called snooping (see more on this topic later in the chapter). To overcome snooping, you need a third split, called a validation set. A suggested split is to have your examples partitioned in thirds: 70 percent for training, 20 percent for validation, and 10 percent for testing.

You should perform the split randomly, that is, regardless of the initial ordering of the data. Otherwise, your test won’t be reliable, because ordering could cause overestimation (when there is some meaningful ordering) or underestimation (when distribution differs by too much). As a solution, you must ensure that the test set distribution isn’t very different from the training distribution, and that sequential ordering occurs in the split data. For example, check whether identification numbers, when available, are continuous in your sets. Sometimes, even if you strictly abide by random sampling, you can’t always obtain similar distributions among sets, especially when your number of examples is small.

tip When your number of examples n is high, such as n>10,000, you can quite confidently create a randomly split data set. When the data set is smaller, comparing basic statistics such as mean, mode, median, and variance across the response and features in the training and test sets will help you understand whether the test set is unsuitable. When you aren’t sure that the split is right, just recalculate a new one.

Resorting to Cross-Validation

A noticeable problem with the train/test set split is that you’re actually introducing bias into your testing because you’re reducing the size of your in-sample training data. When you split your data, you may be actually keeping some useful examples out of training. Moreover, sometimes your data is so complex that a test set, though apparently similar to the training set, is not really similar because combinations of values are different (which is typical of highly dimensional data sets). These issues add to the instability of sampling results when you don’t have many examples. The risk of splitting your data in an unfavorable way also explains why the train/test split isn’t the favored solution by machine learning practitioners when you have to evaluate and tune a machine learning solution.

Cross-validation based on k-folds is actually the answer. It relies on random splitting, but this time it splits your data into a number k of folds (portions of your data) of equal size. Then each fold is held out in turn as a test set and the others are used for training. Each iteration uses a different fold as a test, which produces an error estimate. In fact, after completing the test on one fold against the others used as training, a successive fold, different from the previous, is held out and the procedure is repeated in order to produce another error estimate. The process continues until all the k-folds are used once as a test set and you have a k number of error estimates that you can compute into a mean error estimate (the cross-validation score) and a standard error of the estimates. Figure 4-4 shows how this process works.

image

FIGURE 4-4: A graphical representation of how cross-validation works.

This procedure provides the following advantages:

  • It works well regardless of the number of examples, because by increasing the number of used folds, you are actually increasing the size of your training set (larger k, larger training set, reduced bias) and decreasing the size of the test set.
  • Differences in distribution for individual folds don’t matter as much. When a fold has a different distribution compared to the others, it’s used just once as a test set and is blended with others as part of the training set during the remaining tests.
  • You are actually testing all the observations, so you are fully testing your machine learning hypothesis using all the data you have.
  • By taking the mean of the results, you can expect a predictive performance. In addition, the standard deviation of the results can tell you how much variation you can expect in real out-of-sample data. Higher variation in the cross-validated performances informs you of extremely variegated data that the algorithm is incapable of properly catching.

remember Using k-fold cross-validation is always the optimal choice unless the data you’re using has some kind of order that matters. For instance, it could involve a time series, such as sales. In that case, you shouldn’t use a random sampling method but instead rely on a train/test split based on the original sequence so that the order is preserved and you can test on the last examples of that ordered series.

Looking for Alternatives in Validation

You have a few alternatives to cross-validation, all of which are derived from statistics. The first one to consider — but only if you have an in-sample made of few examples — is the leave-one-out cross-validation (LOOCV). It is analogous to k-folds cross-validation, with the only difference being that k, the number of folds, is exactly n, the number of examples. Therefore, in LOOCV, you build n models (which may turn into a huge number when you have many observations) and test each one on a single out-of-sample observation. Apart from being computationally intensive and requiring that you build many models to test your hypothesis, the problem with LOOCV is that it tends to be pessimistic (making your error estimate higher). It’s also unstable for a small number of n, and the variance of the error is much higher. All these drawbacks make comparing models difficult.

Another alternative from statistics is bootstrapping, a method long used to estimate the sampling distribution of statistics, which are presumed not to follow a previously assumed distribution. Bootstrapping works by building a number (the more the better) of samples of size n (the original in-sample size) drawn with repetition. To draw with repetition means that the process could draw an example multiple times to use it as part of the bootstrapping resampling. Bootstrapping has the advantage of offering a simple and effective way to estimate the true error measure. In fact, bootstrapped error measurements usually have much less variance than cross-validation ones. On the other hand, validation becomes more complicated due to the sampling with replacement, so your validation sample comes from the out-of-bootstrap examples. Moreover, using some training samples repeatedly can lead to a certain bias in the models built with bootstrapping.

remember If you are using out-of-bootstrapping examples for your test, you’ll notice that the test sample can be of various sizes, depending on the number of unique examples in the in-sample, likely accounting for about a third of your original in-sample size. This simple Python code snippet demonstrates randomly simulating a certain number of bootstraps:

from random import randint
import numpy as np
n = 1000 # number of examples
# your original set of examples
examples = set(range(n))
results = list()
for j in range(10000):
# your bootstrapped sample
chosen = [randint(0,n) for k in range(n)]
# out-of-sample
results.append((1000-len(set(choosen)&examples))
/float(n))
print ("Out-of-bootstrap: %0.1f %%" %
(np.mean(results)*100))

Out-of-bootstrap: 36.8 %

Running the experiment may require some time, and your results may be different due to the random nature of the experiment. However, you should see an output of around 36.8 percent.

Optimizing Cross-Validation Choices

Being able to validate a machine learning hypothesis effectively allows further optimization of your chosen algorithm. As discussed in the previous sections, the algorithm provides most of the predictive performance on your data, given its ability to detect signals from data and fit the true functional form of the predictive function without overfitting and generating much variance of the estimates. Not every machine learning algorithm is a best fit for your data, and no single algorithm can suit every problem. It’s up to you to find the right one for a specific problem.

A second source of predictive performance is the data itself when appropriately transformed and selected to enhance the learning capabilities of the chosen algorithm.

The final source of performance derives from fine-tuning the algorithm’s hyper-parameters, which are the parameters that you decide before learning happens and that aren’t learned from data. Their role is in defining a priori a hypothesis, whereas other parameters specify it a posteriori, after the algorithm interacts with the data and, by using an optimization process, finds that certain parameter values work better in obtaining good predictions. Not all machine learning algorithms require much hyper-parameter tuning, but some of the most complex ones do, and though such algorithms still work out of the box, pulling the right levers may make a large difference in the correctness of the predictions. Even when the hyper-parameters aren’t learned from data, you should consider the data you’re working on when deciding hyper-parameters, and you should make the choice based on cross-validation and careful evaluation of possibilities.

remember Complex machine learning algorithms, the ones most exposed to variance of estimates, present many choices expressed in a large number of parameters. Twiddling with them makes them adapt more or less to the data they are learning from. Sometimes too much hyper-parameter twiddling may even make the algorithm detect false signals from the data. That makes hyper-parameters themselves an undetected source of variance if you start manipulating them too much based on some fixed reference like a test set or a repeated cross-validation schema.

tip Python offers slicing functionalities that slice your input matrix into train, test, and validation parts. In particular, for more complex testing procedures, such as cross-validation or bootstrapping, the Scikit-learn package offers an entire module (http://scikit-learn.org/stable/modules/classes.html#module-sklearn.cross_validation). In Book 9, you discover how to apply machine learning to real problems, including some practical examples using both these packages.

Exploring the space of hyper-parameters

The possible combinations of values that hyper-parameters may form make deciding where to look for optimizations hard. As described when discussing gradient descent, an optimization space may contain value combinations that perform better or worse. Even after you find a good combination, you’re not assured that it’s the best option. (This is the problem of getting stuck in local minima when minimizing the error, an issue described in Book 8, Chapter 3 when talking about gradient descent’s problems.)

As a practical way of solving this problem, the best way to verify hyper-parameters for an algorithm applied to specific data is to test them all by cross-validation, and to pick the best combination. This simple approach, called grid-search, offers indisputable advantages by allowing you to sample the range of possible values to input into the algorithm systematically and to spot when the general minimum happens. On the other hand, grid-search also has serious drawbacks because it’s computationally intensive (you can easily perform this task in parallel on modern multicore computers) and quite time-consuming. Moreover, systematic and intensive tests enhance the possibility of incurring error because some good but fake validation results can be caused by noise present in the data set.

Some alternatives to grid-search are available. Instead of testing everything, you can try exploring the space of possible hyper-parameter values guided by computationally heavy and mathematically complex nonlinear optimization techniques (like the Nelder-Mead method), using a Bayesian approach (where the number of tests is minimized by taking advantage of previous results), or using random search.

Surprisingly, random search works incredibly well, is simple to understand, and isn’t just based on blind luck, though it may initially appear to be. In fact, the main point of the technique is that if you pick enough random tests, you actually have enough possibilities to spot the right parameters without wasting energy on testing slightly different combinations of similarly performing combinations.

The graphical representation shown in Figure 4-5 explains why random search works well. A systematic exploration, though useful, tends to test every combination, which turns into a waste of energy if some parameters don’t influence the result. A random search actually tests fewer combinations but more in the range of each hyper-parameter, a strategy that proves winning if, as often happens, certain parameters are more important than others.

image

FIGURE 4-5: Comparing grid-search to random search.

tip For randomized search to perform well, you should make from 15 to a maximum of 60 tests. It does make sense to resort to random search if a grid-search requires a larger number of experiments.

Avoiding Sample Bias and Leakage Traps

On a final note, it’s important to mention a possible remedy to in-sampling bias. In-sampling bias can happen to your data before machine learning is put into action, and it causes high variance of the following estimates. In addition, this section provides a warning about leakage traps that can occur when some information from the out-of-sample passes to in-sample data. This issue can arise when you prepare the data or after your machine learning model is ready and working.

The remedy, which is called ensembling of predictors, works perfectly when your training sample is not completely distorted and its distribution is different from the out-of-sample, but not in an irremediable way, such as when all your classes are present but not in the right proportion (as an example). In such cases, your results are affected by a certain variance of the estimates that you can possibly stabilize in one of several ways: by resampling, as in bootstrapping; by subsampling (taking a sample of the sample); or by using smaller samples (which increases bias).

To understand how ensembling works so effectively, visualize the image of a bull’s eye. If your sample is affecting the predictions, some predictions will be exact and others will be wrong in a random way. If you change your sample, the right predictions will keep on being right, but the wrong ones will start being variations between different values. Some values will be the exact prediction you are looking for; others will just oscillate around the right one.

By comparing the results, you can guess that what is recurring is the right answer. You can also take an average of the answers and guess that the right answer should be in the middle of the values. With the bull’s-eye game, you can visualize superimposing photos of different games: If the problem is variance, ultimately you will guess that the target is in the most frequently hit area or at least at the center of all the shots.

In most cases, such an approach proves to be correct and improves your machine learning predictions a lot. When your problem is bias and not variance, using ensembling really doesn’t cause harm unless you subsample too few samples. A good rule of thumb for subsampling is to take a sample from 70 to 90 percent compared to the original in-sample data.

tip If you want to make ensembling work, you should do the following:

  1. Iterate a large number of times through your data and models (from just a minimum of three iterations to ideally hundreds of times of them).
  2. Every time you iterate, subsample (or else bootstrap) your in-sample data.
  3. Use machine learning for the model on the resampled data, and predict the out-of-sample results. Store those results away for later use.
  4. At the end of the iterations, for every out-of-sample case you want to predict, take all its predictions and average them if you are doing a regression. Take the most frequent class if you are doing a classification.

Watching out for snooping

Leakage traps can surprise you because they can prove to be an unknown and undetected source of problems with your machine learning processes. The problem is snooping, or otherwise observing the out-of-sample data too much and adapting to it too often. In short, snooping is a kind of overfitting — and not just on the training data but also on the test data, making the overfitting problem itself harder to detect until you get fresh data. Usually you realize that the problem is snooping when you already have applied the machine learning algorithm to your business or to a service for the public, making the problem an issue that everyone can see.

You can avoid snooping in two ways. First, when operating on the data, take care to neatly separate training, validation, and test data. Also, when processing, never take any information from validation or test, even the most simple and innocent-looking examples. Worse still is to apply a complex transformation using all the data. In finance, for instance, it is well known that calculating the mean and the standard deviation (which can actually tell you a lot about market conditions and risk) from all training and testing data can leak precious information about your models. When leakage happens, machine learning algorithms perform predictions on the test set rather than the out-of-sample data from the markets, which means that they didn’t work at all.

Check the performance of your out-of-sample examples. In fact, you may bring back some information from your snooping on the test results to help you determine that certain parameters are better than others, or lead you to choose one machine learning algorithm instead of another. For every model or parameter, apply your choice based on cross-validation results or from the validation sample. Never fall for getting takeaways from your out-of-sample data, or you’ll regret it later.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset