Performing summary statistics and plots

The primary purpose of using summary statistics is to get a good understanding of the location and dispersion of the data. By summary statistics, we refer to mean, median, and standard deviation. These quantities are quite easy to calculate. However, one should be careful when using them. If the underlying data is not unimodal, that is, it has multiple peaks, these quantities may not be of much use.

Note

If the given data is unimodal, that is, having only one peak, the mean, which gives the location, and standard deviation, which gives the variance, are valuable metrics.

Getting ready

Let's use our Iris dataset to explore some of these summary statistics. In this section, we don't have a wholesome program producing a single output; however, we will have different steps demonstrating different summary measures.

How to do it…

Let's begin by importing the necessary libraries. We will follow it up with the loading of the Iris dataset:

# Load Libraries
from sklearn.datasets import load_iris
import numpy as np
from scipy.stats import trim_mean

# Load iris data
data = load_iris()
x = data['data']
y = data['target']col_names = data['feature_names']

Let's now demonstrate how to calculate the mean, trimmed mean, and range values:

# 1.	Calculate and print the mean value of each column in the Iris dataset
print "col name,mean value"
for i,col_name in enumerate(col_names):
    print "%s,%0.2f"%(col_name,np.mean(x[:,i]))
print    

# 2.	Trimmed mean calculation.
p = 0.1 # 10% trimmed mean
print
print "col name,trimmed mean value"
for i,col_name in enumerate(col_names):
    print "%s,%0.2f"%(col_name,trim_mean(x[:,i],p))
print

# 3.	Data dispersion, calculating and display the range values.
print "col_names,max,min,range"
for i,col_name in enumerate(col_names):
    print "%s,%0.2f,%0.2f,%0.2f"%(col_name,max(x[:,i]),min(x[:,i]),max(x[:,i])-min(x[:,i]))
print

Finally, we will show the variance, standard deviation, mean absolute deviation, and median absolute deviation calculations:

# 4.	Data dispersion, variance and standard deviation
print "col_names,variance,std-dev"
for i,col_name in enumerate(col_names):
    print "%s,%0.2f,%0.2f"%(col_name,np.var(x[:,i]),np.std(x[:,i]))
print
    
# 5.	Mean absolute deviation calculation  
def mad(x,axis=None):
    mean = np.mean(x,axis=axis)
    return np.sum(np.abs(x-mean))/(1.0 * len(x))
        
print "col_names,mad"
for i,col_name in enumerate(col_names):
    print "%s,%0.2f"%(col_name,mad(x[:,i]))
print

# 6.	Median absolute deviation calculation
def mdad(x,axis=None):
    median = np.median(x,axis=axis)
    return np.median(np.abs(x-median))
       
print "col_names,median,median abs dev,inter quartile range"
for i,col_name in enumerate(col_names):
    iqr = np.percentile(x[:,i],75) - np.percentile(x[i,:],25)
    print "%s,%0.2f,%0.2f,%0.2f"%(col_name,np.median(x[:,i]), mdad(x[:,i]),iqr)
print

How it works…

The loading of the Iris dataset is not repeated in this recipe. It's assumed that the reader can look at the previous recipe to do the same. Further, we will assume that the x variable is loaded with all the instance of the Iris records with each record having four columns.

Step 1 prints the mean value of each of the column in the Iris dataset. We used NumPy's mean function for the same. The output of the print statement is as follows:

How it works…

As you can see, we have the mean value for each column. The code to calculate the mean is as follows:

np.mean(x[:,i])

We passed all the rows and columns in the loop. Thus, we get the mean value by columns.

Another interesting measure is what is called trimmed mean. It has its own advantages. The 10% trimmed mean of a given sample is computed by excluding the 10% largest and 10% smallest values from the sample and taking the arithmetic mean of the remaining 80% of the sample.

Note

Compared to the regular mean, a trimmed mean is less sensitive to outliers.

SciPy provides us with a trim mean function. We will demonstrate the trimmed mean calculation in step 2. The output is as follows:

How it works…

With the Iris dataset, we don't see a lot of difference, but in real-world datasets, the trimmed mean is very handy as it gives a better picture of the location of the data.

Till now, what we saw was the location of the data and that the mean and trimmed mean gives a good inference on the data location. Another important aspect to look at is the dispersion of the data. The simplest way to look at the data dispersion is range, which is defined as follows, given a set of values, x, the range is the maximum value of x – minimum value of x. In Step 3, we will calculate and print the same:

How it works…

Note

If the data falls in a very narrow range, say, most of the values cluster around a single value and we have a few extreme values, then the range may be misleading.

When the data falls in a very narrow range and clusters around a single value, variance is used as a typical measure of the dispersion/spread of the data. Variance is the sum of the squared difference between the individual values and the mean value divided by the number of instances. In step 4, we will see the variance calculation.

In the preceding code, in addition to variance, we can see std-dev, that is, standard deviation. As variance is the square of the difference, it's not in the same measurement scale as the original data. We will use standard deviation, which is the square root of the variance, in order to get the data back into its original scale. Let's look at the output of the print statement, where we listed both the variance and standard deviation:

How it works…

As we mentioned earlier, the mean is very sensitive to outliers; variance also uses the mean, and hence, it's prone to the same issues as the mean. We can use other measures for variance to avoid this trap. One such measure is absolute average deviation; instead of taking the square of the difference between the individual values and mean and dividing it by the number of instances, we will take the absolute of the difference between the mean and individual values and divide it by the number of instances. In step 5, we will define a function for this:

def mad(x,axis=None):
mean = np.mean(x,axis=axis)
return np.sum(np.abs(x-mean))/(1.0 * len(x))

As you can see, the function returns the absolute difference between the mean and individual values. The output of this step is as follows:

How it works…

With the data having many outliers, there is another set of metrics that come in handy. They are the median and percentiles. We already saw percentiles in the previous section while plotting the univariate data. Traditionally, median is defined as a value from the dataset such that half of all the points in the dataset are smaller and the other half is larger than the median value.

Note

Percentiles are a generalization of the concept of median. The 50th percentile is the traditional median value.

We saw the 25th and 75th percentiles in the previous section. The 25th percentile is a value such that 25% of all the points in the dataset are smaller than this value:

>>> 
>>> a = [8,9,10,11]
>>> np.median(a)
9.5
>>> np.percentile(a,50)
9.5

The median is the measure of the location of the data distribution. Using percentiles, we can get a metric for the dispersion of the data, the interquartile range. The interquartile range is the distance between the 75th percentile and 25th percentile. Similar to the mean absolute deviation as explained previously, we also have the median absolute deviation.

In step 6, we will calculate and display both the interquartile range and median absolute deviation. We will define the following function in order to calculate the median absolute deviation:

def mdad(x,axis=None):
median = np.median(x,axis=axis)
return np.median(np.abs(x-median))

The output is as follows:

How it works…

See also

  • Grouping Data and Using Plots recipe in Chapter 3, Analyzing Data - Explore & Wrangle
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset