Introduction

The following is Wikipedia's definition of supervised learning:

"Supervised learning is the machine learning task of inferring a function from labeled training data."

There are two types of supervised learning algorithms:

Regression: This predicts a continuous valued output, such as a house price.
Classification: This predicts a discreet valued output (0 or 1) called label, such as whether an e-mail is a spam or not. Classification is not limited to two values (binomial); it can have multiple values (multinomial), such as marking an e-mail important, unimportant, urgent, and so on (0, 1, 2...).

We are going to cover regression in this chapter and classification in the next.

We will use the recently sold house data of the City of Saratoga, CA, as an example to illustrate the steps of supervised learning in the case of regression:

Get the labeled data:
- How labeled data is gathered differs in every use case. For example, to convert paper documents into a digital format, documents can be given to Amazon Mechanical Turk to label them.
- The size of the labeled data needs to be sufficiently larger than the number of features in the vector. If the size is small compared to the number of features, it can result in overfitting.
Split the labeled data into two parts:
- Randomly split the data based on a certain ratio, for example, 70:30.
- This split needs to be done randomly every time to avoid bias.
- The first set is called the training dataset, which will be used to train the model.
- The second set is called the test dataset or validation set, which will be used to measure the accuracy of the model.
- Sometimes, data is divided into three sets: training, cross-validation, and test. In this case, the test dataset is only used for measuring accuracy, not for training the model (keeping it outside the random split).
Train the algorithm with the training dataset. Once an algorithm is trained, it is called a model:
- Model training/creation also involves tuning another parameter called hyperparameter. One easy way to understand hyperparameters is to think of them as configuration parameters. Traditionally, hyperparameters are set by hand (hit and trial), but nowadays, there are whole sets of algorithms and methodologies specifically designed for hyperparameter tuning.
Use a test dataset to ask another set of questions to the trained algorithm.

The following figure shows a model getting trained by a training dataset:

Hypothesis, for what it does, may sound like a misnomer here, and you may think that prediction function may be a better name, but the word hypothesis is used for historic reasons.

If we use only one feature to predict the outcome, it is called bivariate analysis. When we have multiple features, it is called multivariate analysis. In fact, we can have as many features as we like. One such algorithm, support vector machines (SVM), which we will cover in the next chapter, allows you to have an infinite number of features.

This chapter covers how to do supervised learning.

Mathematical explanations have been provided in as simple a way as possible, but you can feel free to skip the math and directly go to the How to do it... section.

Table of Contents for Introduction

Create new playlist

Sign In

Sign Up

Table of Contents for
Introduction