Chapter 1. Machine Learning – A Gentle Introduction

"I was into data before it was big"—@ml_hipster

You have probably heard recently about big data. The Internet, the explosion of electronic devices with tremendous computational power, and the fact that almost every process in our world uses some kind of software, are giving us huge amounts of data every minute.

Think about social networks, where we store information about people, their interests, and their interactions. Think about process-control devices, ranging from web servers to cars and pacemakers, which permanently leave logs of data about their performance. Think about scientific research initiatives, such as the genome project, which have to analyze huge amounts of data about our DNA.

There are many things you can do with this data: examine it, summarize it, and even visualize it in several beautiful ways. However, this book deals with another use for data: as a source of experience to improve our algorithms' performance. These algorithms, which can learn from previous data, conform to the field of Machine Learning, a subfield of Artificial Intelligence.

Any machine learning problem can be represented with the following three concepts:

  • We will have to learn to solve a task T. For example, build a spam filter that learns to classify e-mails as spam or ham.
  • We will need some experience E to learn to perform the task. Usually, experience is represented through a dataset. For the spam filter, experience comes as a set of e-mails, manually classified by a human as spam or ham.
  • We will need a measure of performance P to know how well we are solving the task and also to know whether after doing some modifications, our results are improving or getting worse. The percentage of e-mails that our spam filtering is correctly classifying as spam or ham could be P for our spam-filtering task.

Scikit-learn is an open source Python library of popular machine learning algorithms that will allow us to build these types of systems. The project was started in 2007 as a Google Summer of Code project by David Cournapeau. Later that year, Matthieu Brucher started working on this project as part of his thesis. In 2010, Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, and Vincent Michel of INRIA took the project leadership and produced the first public release. Nowadays, the project is being developed very actively by an enthusiastic community of contributors. It is built upon NumPy (http://www.numpy.org/) and SciPy (http://scipy.org/), the standard Python libraries for scientific computation. Through this book, we will use it to show you how the incorporation of previous data as a source of experience could serve to solve several common programming tasks in an efficient and probably more effective way.

In the following sections of this chapter, we will start viewing how to install scikit-learn and prepare your working environment. After that, we will have a brief introduction to machine learning in a practical way, trying to introduce key machine learning concepts while solving a simple practical task.

Installing scikit-learn

Installation instructions for scikit-learn are available at http://scikit-learn.org/stable/install.html. Several examples in this book include visualizations, so you should also install the matplotlib package from http://matplotlib.org/. We also recommend installing IPython Notebook, a very useful tool that includes a web-based console to edit and run code snippets, and render the results. The source code that comes with this book is provided through IPython notebooks.

An easy way to install all packages is to download and install the Anaconda distribution for scientific computing from https://store.continuum.io/, which provides all the necessary packages for Linux, Mac, and Windows platforms. Or, if you prefer, the following sections gives some suggestions on how to install every package on each particular platform.

Linux

Probably the easiest way to install our environment is through the operating system packages. In the case of Debian-based operating systems, such as Ubuntu, you can install the packages by running the following commands:

  • Firstly, to install the package we enter the following command:
    sudo apt-get install build-essential python-dev python-numpy python-setuptools python-scipy libatlas-dev python-pip
    
  • Then, to install matplotlib, run the following command:
    # sudo apt-get install python-matplotlib
    
  • After that, we should be ready to install scikit-learn by issuing this command:
    # sudo pip install scikit-learn
    
  • To install IPython Notebook, run the following command:
    # sudo apt-get install ipython-notebook
    
  • If you want to install from source, let's say to install all the libraries within a virtual environment, you should issue the following commands:
    # pip install numpy
    # pip install scipy
    # pip install scikit-learn
    
  • To install Matplotlib, you should run the following commands:
    # pip install libpng-dev libjpeg8-dev libfreetype6-dev
    # pip install matplotlib
    
  • To install IPython Notebook, you should run the following commands:
    # pip install ipython
    # pip install tornado
    # pip install pyzmq
    

Mac

You can similarly use tools such as MacPorts and HomeBrew that contain precompiled versions of these packages.

Windows

To install scikit-learn on Windows, you can download a Windows installer from the downloads section of the project web page: http://sourceforge.net/projects/scikit-learn/files/

Checking your installation

To check that everything is ready to run, just open your Python (or probably better, IPython) console and type the following:

>>> import sklearn as sk
>>> import numpy as np
>>> import matplotlib.pyplot as plt

We have decided to precede Python code with >>> to separate it from the sentence results. Python will silently import the scikit-learn, NumPy, and matplotlib packages, which we will use through the rest of this book's examples.

If you want to execute the code presented in this book, you should run IPython Notebook:

# ipython notebook

This will allow you to open the corresponding notebooks right in your browser.

Datasets

As we have said, machine learning methods rely on previous experience, usually represented by a dataset. Every method implemented on scikit-learn assumes that data comes in a dataset, a certain form of input data representation that makes it easier for the programmer to try different methods on the same data. Scikit-learn includes a few well-known datasets. In this chapter, we will use one of them, the Iris flower dataset, introduced in 1936 by Sir Ronald Fisher to show how a statistical method (discriminant analysis) worked (yes, they were into data before it was big). You can find a description of this dataset on its own Wikipedia page, but, essentially, it includes information about 150 elements (or, in machine learning terminology, instances) from three different Iris flower species, including sepal and petal length and width. The natural task to solve using this dataset is to learn to guess the Iris species knowing the sepal and petal measures. It has been widely used on machine learning tasks because it is a very easy dataset in a sense that we will see later. Let's import the dataset and show the values for the first instance:

>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> X_iris, y_iris = iris.data, iris.target
>>> print X_iris.shape, y_iris.shape
  (150, 4) (150,)
>>> print X_iris[0], y_iris[0]
  [ 5.1  3.5  1.4  0.2] 0

Tip

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

We can see that the iris dataset is an object (similar to a dictionary) that has two main components:

  • A data array, where, for each instance, we have the real values for sepal length, sepal width, petal length, and petal width, in that order (note that for efficiency reasons, scikit-learn methods work on NumPy ndarrays instead of the more descriptive but much less efficient Python dictionaries or lists). The shape of this array is (150, 4), meaning that we have 150 rows (one for each instance) and four columns (one for each feature).
  • A targetarray, with values in the range of 0 to 2, corresponding to each instance of Iris species (0: setosa, 1: versicolor, and 2: virginica), as you can verify by printing the iris.target_names value.

While it's not necessary for every dataset we want to use with scikit-learn to have this exact structure, we will see that every method will require this data array, where each instance is represented as a list of features or attributes, and another target array representing a certain value we want our learning method to learn to predict. In our example, the petal and sepal measures are our real-valued attributes, while the flower species is the one-of-a-list class we want to predict.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset