Grouping the data and using dot plots

EDA is about zooming in and out of the data from multiple angles in order to get a better grasp of the data. Let's now see the data from a different angle using dot plots. A dot plot is a simple plot where the data is grouped and plotted in a simple scale. It's up to us to decide how we want to group the data.

Note

Dot plots are best used for small-sized to medium-sized datasets. For large-sized data, a histogram is usually used.

Getting ready

For this exercise, we will use the same data as the previous section.

How to do it…

Let's load the necessary libraries. We will follow it up with the loading of our data and along the way, we will handle the missing values. Finally, we will group the data using a frequency counter:

# Load libraries
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
from collections import OrderedDict
from matplotlib.pylab import frange

# 1.Load the data and handle missing values.
fill_data = lambda x : int(x.strip() or 0)
data = np.genfromtxt('president.txt',dtype=(int,int),converters={1:fill_data},delimiter=",")
x = data[:,0]
y = data[:,1]

# 2.Group data using frequency (count of individual data points).
# Given a set of points, Counter() returns a dictionary, where key is a data point,
# and value is the frequency of data point in the dataset.
x_freq = Counter(y)
x_ = np.array(x_freq.keys())y_ = np.array(x_freq.values())

We will proceed to group the data by the year range and plot it:

# 3.Group data by range of years
x_group = OrderedDict()
group= 5
group_count=1
keys = []
values = []
for i,xx in enumerate(x):
    # Individual data point is appended to list keys
    keys.append(xx)
    values.append(y[i])
    # If we have processed five data points (i.e. five years)
    if group_count == group:
        # Convert the list of keys to a tuple
        # use the new tuple as the ke to x_group dictionary
        x_group[tuple(keys)] = values
        keys= []
        values =[]
        group_count = 1
        
    group_count+=1
# Accomodate the last batch of keys and values
x_group[tuple(keys)] = values 

print x_group
# 4.Plot the grouped data as dot plot.
plt.subplot(311)
plt.title("Dot Plot by Frequency")
# Plot the frequency
plt.plot(y_,x_,'ro')
plt.xlabel('Count')
plt.ylabel('# Presedential Request')
# Set the min and max limits for x axis
plt.xlim(min(y_)-1,max(y_)+1)

plt.subplot(312)
plt.title("Simple dot plot")
plt.xlabel('# Presendtial Request')plt.ylabel('Frequency')

Finally, we will prepare the data for a simple dot plot and proceed with plotting it:

# Prepare the data for simple dot plot
# For every (item, frequency) pair create a 
# new x and y
# where x is a list, created using using np.repeat
# function, where the item is repeated frequency times.
# y is a list between 0.1 and frequency/10, incremented
# by 0.1
for key,value in x_freq.items():
    x__ = np.repeat(key,value)
    y__ = frange(0.1,(value/10.0),0.1)
    try:
        plt.plot(x__,y__,'go')
    except ValueError:
        print x__.shape, y__.shape
    # Set the min and max limits of x and y axis
    plt.ylim(0.0,0.4)
    plt.xlim(xmin=-1) 

plt.xticks(x_freq.keys())

plt.subplot(313)
x_vals =[]
x_labels =[]
y_vals =[]
x_tick = 1
for k,v in x_group.items():
    for i in range(len(k)):
        x_vals.append(x_tick)
        x_label = '-'.join([str(kk) if not i else str(kk)[-2:] for i,kk in enumerate(k)])
        x_labels.append(x_label)
    y_vals.extend(list(v))
    x_tick+=1

plt.title("Dot Plot by Year Grouping")
plt.xlabel('Year Group')
plt.ylabel('No Presedential Request')
try:
    plt.plot(x_vals,y_vals,'ro')
except ValueError:
    print len(x_vals),len(y_vals)
    
plt.xticks(x_vals,x_labels,rotation=-35)plt.show()

How it works…

In step 1, we will load the data. This is the same as the data loading discussed in the previous recipe. Before we start plotting the data, we want to group them in order to see the overall data characteristics.

In steps 2 and 3, we will group the data using different criteria.

Let's look at step 2.

Here, we will use a function called Counter() from the collections package.

Note

Given a set of points, Counter() returns a dictionary where key is a data point and value is the frequency of the data points in the dataset.

We will pass our dataset to Counter() and extract the keys from the actual data point and values, the respective frequency from this dictionary into numpy arrays x_ and y_ for ease of plotting. Thus, we have now grouped our data using frequency.

Before we move on to plot this, we will perform another grouping with this data in step 3.

We know that the x axis is years. Our data is also sorted by the year in an ascending order. In this step, we will group our data in a range of years, five in this case; that is, let's say that we will make a group from the first five years, our second group is the next five years, and so on:

group= 5
group_count=1
keys = []
values = []

The group variable defines how many years we want in a single group; in this example, we have 5 groups and keys and values are two empty lists. We will proceed to fill them with values from x and y till group_count reaches group, that is, 5:

for i,xx in enumerate(x):
keys.append(xx)
values.append(y[i])
if group_count == group:
x_group[tuple(keys)] = values
keys= []
values =[]
group_count = 0
    group_count+=1
x_group[tuple(keys)] = values 

The x_group is the name of the dictionary that now stores the group of values. We will need to preserve the order in which we will insert our records and so, we will use OrderedDict in this case.

Note

OrderedDict preserves the order in which the keys are inserted.

Now let's proceed to plot these values.

We want to plot all our graphs in a single window; hence, we will use the subplot parameter to the subplot, which defines the number of rows (3, the number in the hundredth place), number of columns (1, the number in the tenth place), and finally the plot number (1 in the unit place). Our plot output is as follows:

How it works…

In the top graph, the data is grouped by frequency. Here, our x axis is the count and y axis is the number of Presidential Requests. We can see that 30 or more Presidential Requests have occurred only once. As said before, the dot plot is good at analyzing the range of the data points under different groupings.

The middle graph can be viewed as a very simple histogram. As the title of the graph (in plt.title()) says, it's the simplest form of a dot plot, where the x axis is the actual values and y axis is the number of times this x value occurs in the dataset. In a histogram, the bin size has to be set carefully; if not, it can distort the complete picture about the data. However, this can be avoided in this simple dot plot.

In the bottom graph, we have grouped the data by years.

See also

  • Creating Anonymous functions recipe in Chapter 1, Using Python for Data Science
  • Pre-processing columns recipe in Chapter 1, Using Python for Data Science
  • Acquiring data with Python recipe in Chapter 1, Using Python for Data Science
  • Using Dictionary objects recipe in Chapter 1, Using Python for Data Science
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset