Chapter 6. Machine Learning 1

In this chapter, we will cover the following topics:

  • Preparing data for model building
  • Finding the nearest neighbors
  • Classifying documents using Naïve Bayes
  • Building decision trees to solve multiclass problems

Introduction

In this chapter, we will look at supervised learning techniques. In the previous chapter, we saw unsupervised techniques including clustering and learning vector quantization. We will start with a classification problem and then proceed to regression. The input for a classification problem is a set of records or instances in the next chapter.

Each record or instance can be written as a set (X,y), where X is a set of attributes and y is a corresponding class label.

Learning a target function, F, that maps each record's attribute set to one of the predefined class label, y, is the job of a classification algorithm.

The general steps for a classification algorithm are as follows:

  1. Find an appropriate algorithm
  2. Learn a model using a training set, and validate the model using a test set
  3. Apply the model to predict any unseen instance or record

The first step is to identify the right classification algorithm. There is no prescribed way of choosing the right algorithm, it comes from repeated trial and error. After choosing the algorithm, a training and a test set is created, which is provided to the algorithm to learn a model, that is, a target function F, as defined previously. After creating the model using a training set, a test set is used to validate the model. Usually, we use a confusion matrix to validate the model. We will discuss more about confusion matrices in our recipe: Finding the nearest neighbors.

We will begin with a recipe that will show us how to divide our input dataset into training and test sets. We will follow this with a lazy learner algorithm for classification, called K-Nearest Neighbor. We will then look at Naïve Bayes classifiers. We will venture into a recipe that deals with multiclass problems using decision trees. Our choice of algorithms in this chapter is not random. All the three algorithms that we will cover in this chapter are capable of handling multiclass problems, in addition to binary problems. In multiclass problems, we have more than two class labels to which the instances can belong.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset