In this chapter, we will cover the following topics:
In this chapter, we will look at supervised learning techniques. In the previous chapter, we saw unsupervised techniques including clustering and learning vector quantization. We will start with a classification problem and then proceed to regression. The input for a classification problem is a set of records or instances in the next chapter.
Each record or instance can be written as a set (X,y), where X is a set of attributes and y is a corresponding class label.
Learning a target function, F, that maps each record's attribute set to one of the predefined class label, y, is the job of a classification algorithm.
The general steps for a classification algorithm are as follows:
The first step is to identify the right classification algorithm. There is no prescribed way of choosing the right algorithm, it comes from repeated trial and error. After choosing the algorithm, a training and a test set is created, which is provided to the algorithm to learn a model, that is, a target function F, as defined previously. After creating the model using a training set, a test set is used to validate the model. Usually, we use a confusion matrix to validate the model. We will discuss more about confusion matrices in our recipe: Finding the nearest neighbors.
We will begin with a recipe that will show us how to divide our input dataset into training and test sets. We will follow this with a lazy learner algorithm for classification, called K-Nearest Neighbor. We will then look at Naïve Bayes classifiers. We will venture into a recipe that deals with multiclass problems using decision trees. Our choice of algorithms in this chapter is not random. All the three algorithms that we will cover in this chapter are capable of handling multiclass problems, in addition to binary problems. In multiclass problems, we have more than two class labels to which the instances can belong.