Creating sample data for toy analysis

I will again implore you to use some of your own data for this book, but in the event you cannot, we'll learn how we can use scikit-learn to create toy data.

Getting ready

Very similar to getting built-in datasets, fetching new datasets, and creating sample datasets, the functions that are used follow the naming convention make_<the data set>. Just to be clear, this data is purely artificial:

>>> datasets.make_*?
datasets.make_biclusters
datasets.make_blobs
datasets.make_checkerboard
datasets.make_circles
datasets.make_classification
...

To save typing, import the datasets module as d , and numpy as np:

>>> import sklearn.datasets as d
>>> import numpy as np

How to do it...

This section will walk you through the creation of several datasets; the following How it works... section will confirm the purported characteristics of the datasets. In addition to the sample datasets, these will be used throughout the book to create data with the necessary characteristics for the algorithms on display.

First, the stalwart—regression:

>>> reg_data = d.make_regression()

By default, this will generate a tuple with a 100 x 100 matrix – 100 samples by 100 features. However, by default, only 10 features are responsible for the target data generation. The second member of the tuple is the target variable.

It is also possible to get more involved. For example, to generate a 1000 x 10 matrix with five features responsible for the target creation, an underlying bias factor of 1.0, and 2 targets, the following command will be run:

>>> complex_reg_data = d.make_regression(1000, 10, 5, 2, 1.0)
>>> complex_reg_data[0].shape
(1000, 10)

Classification datasets are also very simple to create. It's simple to create a base classification set, but the basic case is rarely experienced in practice—most users don't convert, most transactions aren't fraudulent, and so on. Therefore, it's useful to explore classification on unbalanced datasets:

>>> classification_set = d.make_classification(weights=[0.1])
>>> np.bincount(classification_set[1])
array([10, 90])

Clusters will also be covered. There are actually several functions to create datasets that can be modeled by different cluster algorithms. For example, blobs are very easy to create and can be modeled by K-Means:

>>> blobs = d.make_blobs()

This will look like the following:

How to do it...

How it works...

Let's walk you through how scikit-learn produces the regression dataset by taking a look at the source code (with some modifications for clarity). Any undefined variables are assumed to have the default value of make_regression.

It's actually surprisingly simple to follow.

First, a random array is generated with the size specified when the function is called:

>>> X = np.random.randn(n_samples, n_features)

Given the basic dataset, the target dataset is then generated:

>>> ground_truth = np.zeros((n_features, n_target))
>>> ground_truth[:n_informative, :] = 100*np.random.rand(n_informative, 
    n_target)

The dot product of X and ground_truth are taken to get the final target values. Bias, if any, is added at this time:

>>> y = np.dot(X, ground_truth) + bias

Note

The dot product is simply a matrix multiplication. So, our final dataset will have n_samples, which is the number of rows from the dataset, and n_target, which is the number of target variables.

Due to NumPy's broadcasting, bias can be a scalar value, and this value will be added to every sample.

Finally, it's a simple matter of adding any noise and shuffling the dataset. Voilà, we have a dataset perfect to test regression.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset