Analyzing univariate data graphically

Datasets with only one variable/column are called univariate data. Univariate is a general term in mathematics, which refers to any expression, equation, function, or polynomial with only one variable. In our case, we will restrict the univariate function to datasets. Let's say that we will measure the heights of a group of people in meters; the data will look as follows:

5, 5.2, 6, 4.7,…

Our measurement is only about a single attribute of people, height. This is an example of univariate data.

Getting ready

Let's start our EDA recipe by looking at a sample univariate dataset through visualization. It is easy to analyze the data characteristics through the right visualization techniques. We will use pyplot to draw graphs in order to visualize the data. Pyplot is the state-machine interface to the matplotlib plotting library. Figures and axes are implicitly and automatically created to achieve the desired plot. The following link is a good reference for pyplot:

http://matplotlib.org/users/pyplot_tutorial.html

For this example, we will use a number of Presidential Requests of Congress in State of the Union Address. The following URL contains the data:

http://www.presidency.ucsb.edu/data/sourequests.php

The following is a sample of the data:

1946, 41
1947, 23
1948, 16
1949, 28
1950, 20
1951, 11
1952, 19
1953, 14
1954, 39
1955, 32
1956, 
1957, 14
1958, 
1959, 16
1960, 6

We will visually look at this data and identify any outliers present in the data. We will follow a recursive approach with respect to the outliers. Once we have identified the outliers, we will remove them from the dataset and plot the remaining data in order to find any new outliers.

Note

Recursively looking into the data after removing the perceived outlier in every iteration is a common approach in detection of outliers.

How to do it…

We will load the data using NumPy's data loading utility. Then, we will address the data quality issues; in this case, we will address how to handle the null values. As you can see in the data, the years 1956 and 1958 have null entries. Let's replace the null values by 0 using the lambda function.

Following this, let's plot the data to look for any trends:

# Load libraries
import numpy as np
from matplotlib.pylab import frange
import matplotlib.pyplot as plt

fill_data = lambda x : int(x.strip() or 0)
data = np.genfromtxt('president.txt',dtype=(int,int),converters={1:fill_data},
            delimiter=",")
x = data[:,0]
y = data[:,1]


# 2.Plot the data to look for any trends or values
plt.close('all')
plt.figure(1)
plt.title("All data")
plt.plot(x,y,'ro')
plt.xlabel('year')plt.ylabel('No Presedential Request')

Let's calculate the percentile values and plot them as references in the plot that has been generated:

#3.Calculate percentile values (25th, 50th,75th) for the data to understand data distribution
perc_25 = np.percentile(y,25)
perc_50 = np.percentile(y,50)
perc_75 = np.percentile(y,75)
print
print "25th Percentile    = %0.2f"%(perc_25)
print "50th Percentile    = %0.2f"%(perc_50)
print "75th Percentile    = %0.2f"%(perc_75)
print
#4.Plot these percentile values as reference in the plot we generated in the previous step.
# Draw horizontal lines at 25,50 and 75th percentile
plt.axhline(perc_25,label='25th perc',c='r')
plt.axhline(perc_50,label='50th perc',c='g')
plt.axhline(perc_75,label='75th perc',c='m')plt.legend(loc='best')

Finally, let's inspect the data visually for outliers and then remove them using the mask function. Let's plot the data again without the outliers:

#5.Look for outliers if any in the data by visual inspection.
# Remove outliers using mask function 
# Remove outliers 0 and 54
y_masked = np.ma.masked_where(y==0,y)
#  Remove point 54
y_masked = np.ma.masked_where(y_masked==54,y_masked)

#6 Plot the data again.
plt.figure(2)
plt.title("Masked data")
plt.plot(x,y_masked,'ro')
plt.xlabel('year')
plt.ylabel('No Presedential Request')
plt.ylim(0,60)


# Draw horizontal lines at 25,50 and 75th percentile
plt.axhline(perc_25,label='25th perc',c='r')
plt.axhline(perc_50,label='50th perc',c='g')
plt.axhline(perc_75,label='75th perc',c='m')
plt.legend(loc='best')plt.show()

How it works…

In the first step, we will put some data loading techniques that we learnt in the previous chapter to action. You will have noticed that the years 1956 and 1958 are left blank. We will replace them with 0 using an anonymous function:

fill_data = lambda x : int(x.strip() or 0)

The fill_data lambda function will replace any null value in the dataset; in this case, line no 11 and 13 with 0:

data = np.genfromtxt('president.txt',dtype=(int,int),converters={1:fill_data},delimiter=",")

We will pass fill_data to the genfromtxt function's converters parameter. Note that converters takes a dictionary as its input. The key in the dictionary dictates which column our function should be applied to. The value indicates the function. In this case, we specified fill_data as the function and set the key to 1 indicating that the fill_data function has to be applied to column 1. Now let's look at the data in the console:

>>> data[7:15]
array([[1953,   14],
       [1954,   39],
       [1955,   32],
       [1956,    0],
       [1957,   14],
       [1958,    0],
       [1959,   16],
       [1960,    6]])
>>>

As we can see, the years 1956 and 1958 have a 0 value added to them. For the ease of plotting, we will load the year data in x and the number of Presidential Requests to Congress in the State of Union Address to y:

x = data[:,0]
y = data[:,1]

As you can see, in the first column, the year is loaded in x and the next column in y.

In step 2, we will plot the data with the x axis as the year and y axis representing the values:

plt.close('all')

We will first close any previous graphs that are open from the previous programs:

plt.figure(1)

We will give a number to our plot. This is very useful when we have a lot of graphs in a program:

plt.title("All data")

We will specify a title for our plot:

plt.plot(x,y,'ro')

Finally, we will plot x and y. The 'ro' parameter tells plyplot to plot x and y as dots (0) in the color red (r):

plt.xlabel('year')
plt.ylabel('No Presedential Request')

Finally, the x and y axes labels are provided.

The output looks as follows:

How it works…

A casual look at this graph shows that the data is spread everywhere and no trends or patterns can be found in the first glance. However, with a keen eye, you can notice three points: one point at the top on the right-hand side and others to the immediate left of 1960 in the x axis. They are starkly different from all the other points in the sample, and hence, they are outliers.

Note

An outlier is an observation that lies outside the overall pattern of a distribution (Moore and McCabe 1999).

In order to understand these points further, we will take the help of percentiles.

Note

If we have a vector V of length N, the qth percentile of V is the qth ranked value in a sorted copy of V. The values and distances of the two nearest neighbors as well as the interpolation parameter will determine the percentile if the normalized ranking does not match q exactly. This function is the same as the median ifq=50, the same as the minimum if q=0, and the same as the maximum if q=100.

Refer to http://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.percentile.html for more information.

Why don't we use averages? We will look into averages in the summary statistics section; however, looking at the percentiles has its own advantages. Average values are typically skewed by outliers; outliers such as the one at the top on the right-hand side can drag the average to a higher value and the outliers near 1960 can do the opposite. Percentiles give us a better clarity about the range of values in our dataset. We can calculate the percentiles using NumPy.

In step 3, we will calculate the percentiles and print them.

The percentile values calculated and printed for this dataset are as follows:

How it works…

Note

Interpreting the percentiles:

25% of the points in the dataset are below 13.00 (25th percentile value).

50% of the points in the dataset are below 18.50 (50th percentile value).

75% of the points in the dataset are below 25.25 (75th percentile value).

A point to note is that the 50th percentile is the median. Percentiles give us a good idea of the range of our values.

In step 4, we will plot these percentile values as horizontal lines in our graph in order to enhance our visualization:

# Draw horizontal lines at 25,50 and 75th percentile
plt.axhline(perc_25,label='25th perc',c='r')
plt.axhline(perc_50,label='50th perc',c='g')
plt.axhline(perc_75,label='75th perc',c='m')
plt.legend(loc='best')

We used the plt.axhline() function to draw these horizontal lines. This function will draw a line at the given y value from the minimum of x to the maximum of x. Using the label parameter, we gave it a name and set the color of the line through the c parameter.

Tip

A good way to understand any function is to pass the function name to help() in the Python console. In this case, help (plt.axhline) in the Python console will give you the details.

Finally, we will place the legend using plt.legend(), and using the loc parameter, ask pyplot to determine the best location to put the legend so that it does not affect the plot readability.

Our graph is now as follows:

How it works…

In step 5, we will move on to remove the outliers using the mask function in NumPy:

# Remove zero values
y_masked = np.ma.masked_where(y==0,y)
#  Remove 54
y_masked = np.ma.masked_where(y_masked==54,y_masked)

Masking is a convenient way to hide some of the values without removing them from our array. We used the ma.masked_where function, where we passed a condition and an array. The function then masks the values in the array that meet the condition. Our first condition was to mask all the points in the y array, where the array value was 0. We stored the new masked array as y_masked. Then, we applied another condition on y_masked to remove point 54.

Finally, in step 6, we will repeat the plotting steps. Our final plot looks as follows:

How it works…

See also

  • Creating Anonymous functions recipe in Chapter 1, Using Python for Data Science
  • Pre-processing columns recipe in Chapter 1, Using Python for Data Science
  • Acquiring data with Python recipe in Chapter 1, Using Python for Data Science
  • Outliers recipe in Chapter 4, Analyzing Data - Deep Dive
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset