Preparing data for model building

In this recipe, we will look at how to create a train and a test dataset from the given dataset for the classification problem. A test dataset is never shown to the model. In real-world scenarios, we typically build another dataset called dev. Dev stands for development dataset: a dataset that we can use to continuously tune our model during successive runs. The model is trained using the train set, and model performance metrics such as accuracy are measured in dev. Based on this result, the model is further tuned in case improvements are required. In later chapters, we will cover recipes that can do more sophisticated data splitting than just a simple train test split.

Getting ready

We will use the Iris dataset for this recipe. It's easy to demonstrate the concept with this dataset as we are familiar with it because we have used it for many of our previous recipes.

How to do it…

# Load the necesssary Library
from sklearn.cross_validation import train_test_split
from sklearn.datasets import load_iris
import numpy as np

def get_iris_data():
    """
    Returns Iris dataset
    """
    # Load iris dataset
    data = load_iris()
    
    # Extract the dependend and independent variables
    # y is our class label
    # x is our instances/records
    x    = data['data']
    y    = data['target']
    
    # For ease we merge them
    # column merge
    input_dataset = np.column_stack([x,y])

    # Let us shuffle the dataset
    # We want records distributed randomly
    # between our test and train set
    
    np.random.shuffle(input_dataset)

    return input_dataset

# We need  80/20 split.
# 80% of our records for Training
# 20% Remaining for our Test set
train_size = 0.8
test_size  = 1-train_size

# get the data
input_dataset = get_iris_data()
# Split the data
train,test = train_test_split(input_dataset,test_size=test_size)

# Print the size of original dataset
print "Dataset size ",input_dataset.shape
# Print the train/test split
print "Train size ",train.shape
print "Test  size",test.shape

This was pretty simple. Let's see if the class labels are proportionately distributed between the training and the test sets. This is a typical class imbalance problem:

def get_class_distribution(y):

"""
Given an array of class labels
Return the class distribution
"""
    distribution = {}
    set_y = set(y)
    for y_label in set_y:
        no_elements = len(np.where(y == y_label)[0])
        distribution[y_label] = no_elements
    dist_percentage = {class_label: count/(1.0*sum(distribution.values())) for class_label,count in distribution.items()}
    return dist_percentage
        
    

def print_class_label_split(train,test):
  """
  Print the class distribution
  in test and train dataset
  """  
    y_train = train[:,-1]
    
    train_distribution = get_class_distribution(y_train)
    print "
Train data set class label distribution"
    print "=========================================
"
    for k,v in train_distribution.items():
        print "Class label =%d, percentage records =%.2f"%(k,v)
    
    y_test = test[:,-1]    
    
    test_distribution = get_class_distribution(y_test)
    
    print "
Test data set class label distribution"
    print "=========================================
"
    
    for k,v in test_distribution.items():
        print "Class label =%d, percentage records =%.2f"%(k,v)

print_class_label_split(train,test)

Let's see how we distribute the class labels uniformly between the train and the test sets:

# Perform Split the data
stratified_split = StratifiedShuffleSplit(input_dataset[:,-1],test_size=test_size,n_iter=1)

for train_indx,test_indx in stratified_split:
    train = input_dataset[train_indx]
    test =  input_dataset[test_indx]
    print_class_label_split(train,test)

How it works…

After we import the necessary library modules, we must write a convenient function, get_iris_data(), which will return the Iris dataset. We then column concatenate the x and y arrays into a single array called input_dataset. We then shuffle the dataset so that the records can be distributed randomly to the test and the train datasets. The function returns a single array of both the instances and the class labels.

We want to include 80 percent of the record in our training dataset, and use the remaining as our test dataset. The train_size and test_size variables hold a percentage of the values, which should be in the training and testing dataset.

We must call the get_iris_data() function in order to get the input data. We then leverage the train_test_split function from scikit-learn's cross_validation model to split the input dataset into two.

Finally, we can print the size of the original dataset, followed by the test and the train datasets:

How it works…

Our original dataset has 150 rows and five columns. Remember that there are only four attributes; the fifth column is the class label. We had column concatenated x and y.

As you can see, 80 percent of the 150 rows, that is, 120 records, have been assigned to our training set. We have shown how we can easily split our input data into the train and the test sets.

Remember this is a classification problem. The algorithm should be trained to predict the correct class label for a given unknown instance or record. For this, we need to provide the algorithm and an equal distribution of all the classes during training. The Iris dataset is a three-class problem. We should have equal representation from all the three classes. Let's see if our method has taken care of this.

We must define a function called get_class_distribution, which takes a single y parameter's array of class labels. This function returns a dictionary, where the key is the class label and the value is a percentage of the number of records for this distribution. Thus, this dictionary gives us the distribution of the class labels. We must call this function in the following function to get to know what our class distribution is in the train and the test datasets.

The print_class_label_split function is self-explanatory. We must pass the train and the test datasets as the argument. As we have concatenated our x and y, the last column is our class label. We then extract the train and test class labels in y_train and y_test. We pass them to get_class_distribution to get a dictionary of the class labels and their distribution, and finally, we print it.

We can then finally invoke print_class_label_split, and our output should look as follows:

How it works…

Let's now examine the output. As you can see, our training set has a different distribution of the class labels compared with the test set. Exactly 40 percent of the instances in the test set belong to class label 1. This is not the right way to do the split. We should have an equal distribution in both the training and the test datasets.

In the final piece of code, we leverage StratifiedShuffleSplit from scikit-learn in order to achieve equal class distribution in the training and the test sets. Let's examine the parameters of StratifiedShuffleSplit:

stratified_split = StratifiedShuffleSplit(input_dataset[:,-1],test_size=test_size,n_iter=1)

The first parameter is the input dataset. We pass all the rows and the last column. Our test size is defined by the test_size variable, which we had initially declared. We can assume that we need only one split using the n_iter variable. We then proceed to invoke print_class_label_split to print the class label distribution. Let's examine the output:

How it works…

Now, we have the class labels distributed uniformly between the test and train sets.

There's more...

We need to prepare the data carefully before its use in a machine learning algorithm. Providing a uniform class distribution to both the train and the test sets is key to building a successful classification model.

In practical machine learning scenarios, we create another dataset called as dev set in addition to the train and test sets. We may not get our model right in the first iteration. We don't want to show our test dataset to our model as this may bias our next iteration of model building. Hence, we create this dev set, which we can use as we iterate through our model building exercise.

The 80/20 rule of thumb that we specified in this recipe is an ideal scenario. However, in many practical applications, we may not have enough data to leave out that many instances for a test set. There are a few practical techniques, such as cross-validation, which come into play in such scenarios. In our next chapter, we will look at the various cross-validation techniques.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset