Performing random sampling

In this we will learn to how to perform a random sampling of data.

Getting ready

Typically, in scenarios where it's very expensive to access the whole dataset, sampling can be effectively used to extract a portion of the dataset for analysis. Sampling can be effectively used in EDA as well. A sample should be a good representative of the underlying dataset. It should have approximately the same characteristics as the underlying dataset. For example, with respect to the mean, the sample mean should be as close to the original data's mean value as possible. There are several sampling techniques; we will cover one of them here.

In simple random sampling, there is an equal chance of selecting any tuple. For our example, we want to sample ten records randomly from the Iris dataset.

How to do it…

We will begin with loading the necessary libraries and importing the Iris dataset:

# Load libraries
from sklearn.datasets import load_iris
import numpy as np

# 1.	Load the Iris data set
data = load_iris()
x = data['data']

Let's demonstrate how sampling is performed:

# 2.	Randomly sample 10 records from the loaded dataset
no_records = 10
x_sample_indx = np.random.choice(range(x.shape[0]),no_records)
print x[x_sample_indx,:]

How it works…

In step 1, we will load the Iris dataset. In step 2, we will do a random selection using the choice function from numpy.random.

The two parameters that we will pass to the choice functions are a range variable for the total number of rows in the original dataset and the sample size that we require. From zero to the total number of rows in the original dataset, the choice function randomly picks n integers, where n is the size of the sample, which is dictated by no_records in our case.

Another important aspect is that one of the parameters to the choice function is replace and it's set to True by default; it specifies whether we need to sample with replacement or without replacement. Sampling without replacement removes the sampled item from the original list so it will not be a candidate for future sampling. Sampling with replacement does the opposite; every element has an equal chance to be sampled in future sampling even though it's been sampled before.

There's more…

Stratified sampling

If the underlying dataset consists of different groups, a simple random sampling may fail to capture adequate samples in order to be able to represent the data. For example, in a two-class classification problem, 10% of the data belongs to the positive class and 90% belongs to the negative class. This kind of problem is called class imbalance problem in machine learning. When we do sampling on such imbalanced datasets, the sample should also reflect the preceding percentages. This kind of sampling is called stratified sampling. We will look more into stratified sampling in future chapters on machine learning.

Progressive sampling

How do we determine the correct sample size that we need for a given problem? We discussed several sampling techniques before but we don't have a strategy to select the correct sample size. There is no simple answer for this. One way to do this is to use progressive sampling. Select a sample size and get the samples through any of the sampling techniques, apply the desired operation on the data, and record the results. Now, increase the sample size and repeat the steps. This iterative process is called progressive sampling.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset