© A.J. Henley and Dave Wolf 2018
A.J. Henley and Dave WolfLearn Data Analysis with Pythonhttps://doi.org/10.1007/978-1-4842-3486-0_5

5. Visualizing Data

A. J. Henley1  and Dave Wolf2
(1)
Washington, D.C., District of Columbia, USA
(2)
Sterling Business Advantage, LLC, Adamstown, Maryland, USA
 

Data Quality Report

When you have looked at enough datasets, you will develop a set of questions you want answered about the data to ascertain how good the dataset is. This following scripts combine to form a data quality report that I use to evaluate the datasets that I work with.

# import the data
import pandas as pd
Location = "datasetsgradedata.csv"
df = pd.read_csv(Location)
df.head()
df.mode().transpose()
Listing 5-1

Load Dataset from CSV

data_types = pd.DataFrame(df.dtypes,
        columns=['Data Type'])
data_types
Listing 5-2

Finding Data Types of Each Column

missing_data_counts = pd.DataFrame(df.isnull().sum(),
        columns=['Missing Values'])
missing_data_counts
Listing 5-3

Counting Number of Missing Observations by Column

present_data_counts = pd.DataFrame(df.count(),
        columns=['Present Values'])
present_data_counts
Listing 5-4

Counting Number of Present Observations by Column

unique_value_counts = pd.DataFrame(
        columns=['Unique Values'])
for v in list(df.columns.values):
        unique_value_counts.loc[v] = [df[v].nunique()]
unique_value_counts
Listing 5-5

Counting Number of Unique Observations by Column

minimum_values = pd.DataFrame(columns=[
        'Minimum Values'])
for v in list(df.columns.values):
        minimum_values.loc[v] = [df[v].min()]
minimum_values
Listing 5-6

Finding the Minimum Value for Each Column

maximum_values = pd.DataFrame(
        columns=['Maximum Values'])
for v in list(df.columns.values):
        maximum_values.loc[v] = [df[v].max()]
maximum_values
Listing 5-7

Finding the Maximum Value for Each Column

pd.concat([present_data_counts,
        missing_data_counts,
        unique_value_counts,
        minimum_values,
        maximum_values],
        axis=1)
Listing 5-8

Joining All the Computed Lists into 1 Report

Your Turn

Can you create a data quality report for the datasets/tamiami.csv dataset?

Graph a Dataset: Line Plot

To create a simple line plot, input the code from Listing 5-9.

import pandas as pd
names = ['Bob','Jessica','Mary','John','Mel']
grades = [76,83,77,78,95]
GradeList = zip(names,grades)
df = pd.DataFrame(data = GradeList,
        columns=['Names', 'Grades'])
%matplotlib inline
df.plot()
Listing 5-9

Line Plotting Your Dataset

When you run it, you should see a graph that looks like Figure 5-1.
../images/463663_1_En_5_Chapter/463663_1_En_5_Fig1_HTML.jpg
Figure 5-1

Simple Line Plot

Customizing the graph is easy, but you need to add the matplotlib library first.

Add the code in Listing 5-10 to what you did already.

import matplotlib.pyplot as plt
df.plot()
displayText = "my annotation"
xloc = 1
yloc = df['Grades'].max()
xtext = 8
ytext = 0
plt.annotate(displayText,
            xy=(xloc, yloc),
            xytext=(xtext,ytext),
            xycoords=('axes fraction', 'data'),
            textcoords='offset points')
Listing 5-10

Code to Plot a Customized Graph

Ok, the annotate command has pretty good documentation, located at http://matplotlib.org/api/pyplot_api.html . But let's tear apart what we typed:
  • displayText: the text we want to show for this annotation

  • xloc, yloc: the coordinates of the data point we want to annotate

  • xtext, ytext: coordinates of where we want the text to appear using the coordinate system specified in textcoords

  • xycoords: sets the coordinate system to use to find the data point; it can be set separately for x and y

  • textcoords: sets the coordinate system to use to place the text

Finally, we can add an arrow linking the data point annotated to the text annotation (Listing 5-11).

df.plot()
displayText = "my annotation"
xloc = 1
yloc = df['Grades'].max()
xtext = 8
ytext = -150     
plt.annotate(displayText,
            xy=(xloc, yloc),
            arrowprops=dict(facecolor='black',
                        shrink=0.05),   
            xytext=(xtext,ytext),
            xycoords=('axes fraction', 'data'),
            textcoords='offset points')
Listing 5-11

Code to Plot a Customized Graph

All we did is adjust the offset of the text so that there was enough room between the data and the annotation to actually see the arrow. We did this by changing the ytext value from 0 to -150. Then, we added the setting for the arrow.

More information about creating arrows can be found on the documentation page for annotate at http://matplotlib.org/users/annotations_intro.html .

Your Turn

Take the same dataset we used in this example and add an annotation to Bob's 76 that says “Wow!”

Graph a Dataset: Bar Plot

To create a bar plot, input the code in Listing 5-12.

import matplotlib.pyplot as plt
import pandas as pd
names = ['Bob','Jessica','Mary','John','Mel']
status = ['Senior','Freshman','Sophomore','Senior',
        'Junior']
grades = [76,95,77,78,99]
GradeList = zip(names,grades)
df = pd.DataFrame(data = GradeList,
        columns=['Names', 'Grades'])
%matplotlib inline
df.plot(kind='bar')
Listing 5-12

Bar Plotting Your Dataset

Once you run it, you will get a simple bar plot, but the titles on the x-axis are the numbers 0–4.
../images/463663_1_En_5_Chapter/463663_1_En_5_Fig2_HTML.jpg
Figure 5-2

Simple Bar Plot

But if we convert the Names column into the index, we can improve the graph. So, first, we need to add the code in Listing 5-13.

df2 = df.set_index(df['Names'])
df2.plot(kind="bar")
Listing 5-13

Adding Code to Plot Your Dataset

We will then get a graph that looks like Figure 5-3.
../images/463663_1_En_5_Chapter/463663_1_En_5_Fig3_HTML.jpg
Figure 5-3

Bar Plot with Axis Titles

Your Turn

Can you change the code to create a bar plot where the status is the label?

Graph a Dataset: Box Plot

To create a box plot, input the code in Listing 5-14.

import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
names = ['Bob','Jessica','Mary','John','Mel']
grades = [76,95,77,78,99]
gender = ['Male','Female','Female','Male','Female']
status = ['Senior','Senior','Junior','Junior','Senior']
GradeList = zip(names,grades,gender)
df = pd.DataFrame(data = GradeList, columns=['Names', 'Grades', 'Gender'])
df.boxplot(column='Grades')
Listing 5-14

Box Plotting Your Dataset

Once you run it, you will get a simple box plot.
../images/463663_1_En_5_Chapter/463663_1_En_5_Fig4_HTML.jpg
Figure 5-4

Simple Box Plot

Now, we can use a single command to create categorized graphs (in this case, categorized by gender). See Listing 5-15.

df.boxplot(by='Gender', column="Grades")
Listing 5-15

Adding Code to Categorize Your Box Plot

And we will then get a graph that looks like Figure 5-5. See Listing 5-16
../images/463663_1_En_5_Chapter/463663_1_En_5_Fig5_HTML.jpg
Figure 5-5

Categorized Box Plot

.

Listing 5-16. Categorized Box Plots

And, finally, to adjust the y-axis so that it runs from 0 to 100, we can run the code in Listing 5-17.

axis1 = df.boxplot(by='Gender', column="Grades")
axis1.set_ylim(0,100)
Listing 5-17

Adding Code to Adjust the Y-axis

It will produce a graph like the one in Figure 5-6.
../images/463663_1_En_5_Chapter/463663_1_En_5_Fig6_HTML.jpg
Figure 5-6

Box Plot Grouped by Gender

Your Turn

Using the dataset we just created:
  • Can you create a box plot of the grades categorized by student status?

  • Can you create that box plot with a y-axis that runs from 50 to 110?

Graph a Dataset: Histogram

Because of the nature of histograms, we really need more data than is found in the example dataset we have been working with. Enter the code from Listing 5-18 to import the larger dataset.

import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
Location = "datasets/gradedata.csv"
df = pd.read_csv(Location)
df.head()
Listing 5-18

Importing Dataset from CSV File

To create a simple histogram, we can simply add the code in Listing 5-19.

df.hist()
Listing 5-19

Creating a Histogram not Creating a Box Plot

../images/463663_1_En_5_Chapter/463663_1_En_5_Fig7_HTML.jpg
Figure 5-7

Simple Histogram

And because pandas is not sure which column you wish to count the values of, it gives you histograms for all the columns with numeric values.

In order to see a histogram for just hours, we can specify it as in Listing 5-20.

df.hist(column="hours")
Listing 5-20

Creating Histogram for Single Column

../images/463663_1_En_5_Chapter/463663_1_En_5_Fig8_HTML.jpg
Figure 5-8

Single Column Histogram

And to see histograms of hours separated by gender, we can use Listing 5-21.

df.hist(column="hours", by="gender")
Listing 5-21

Categorized Histogram

../images/463663_1_En_5_Chapter/463663_1_En_5_Fig9_HTML.jpg
Figure 5-9

Categorized Histogram

Your Turn

Can you create an age histogram categorized by gender?

Graph a Dataset: Pie Chart

To create a pie chart, input the code from Listing 5-22.

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
names = ['Bob','Jessica','Mary','John','Mel']
absences = [3,0,1,0,8]
detentions = [2,1,0,0,1]
warnings = [2,1,5,1,2]
GradeList = zip(names,absences,detentions,warnings)
columns=['Names', 'Absences', 'Detentions','Warnings']
df = pd.DataFrame(data = GradeList, columns=columns)
df
Listing 5-22

Pie Charting Your Dataset

This code creates a dataset of student rule violations. Next, in a new cell, we will create a column to show the total violations or demerits per student (Listing 5-23).

df['TotalDemerits'] = df['Absences'] +
        df['Detentions'] + df['Warnings']
df
Listing 5-23

Creating New Column

Finally, to actually create a pie chart of the number of demerits, we can just run the code from Listing 5-24.

plt.pie(df['TotalDemerits'])
Listing 5-24

Creating Pie Chart of Demerits

Once you run it, you will get a simple pie chart (Figure 5-10).
../images/463663_1_En_5_Chapter/463663_1_En_5_Fig10_HTML.jpg
Figure 5-10

Simple Pie Chart

But since it is a bit plain (and a bit elongated), let's try the code from Listing 5-25 in a new cell.

plt.pie(df['TotalDemerits'],
       labels=df['Names'],
       explode=(0,0,0,0,0.15),
       startangle=90,
       autopct='%1.1f%%',)
plt.axis('equal')
plt.show()
Listing 5-25

Creating a Customized Pie Chart

  • Line 2: This adds the students' names as labels to the pie pieces.

  • Line 3: This is what explodes out the pie piece for the fifth student. You can increase or decrease the amount to your liking.

  • Line 4: This is what rotates the pie chart to different points.

  • Line 5: This is what formats the numeric labels on the pie pieces.

  • Line 7: This is what forces the pie to be circular.

And you will see a pie chart that looks like Figure 5-11.
../images/463663_1_En_5_Chapter/463663_1_En_5_Fig11_HTML.jpg
Figure 5-11

Customized Pie Chart

Your Turn

What if, instead of highlighting the worst student, we put a spotlight on the best one? Let's rotate the chart and change the settings so we are highlighting John instead of Mel.

Graph a Dataset: Scatter Plot

The code in Listing 5-26 will allow us to generate a simple scatter plot.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
dataframe = pd.DataFrame({'Col':
        np.random.normal(size=200)})
plt.scatter(dataframe.index, dataframe['Col'])
Listing 5-26

Creating a Scatter Plot

  • Line 4: specifies that figures should be shown inline

  • Line 6: generates a random dataset of 200 values

  • Line 7: creates a scatter plot using the index of the dataframe as the x and the values of column Col as the y

You should get a graph that looks something like Figure 5-12.
../images/463663_1_En_5_Chapter/463663_1_En_5_Fig12_HTML.jpg
Figure 5-12

Simple Scatterplot

Looking at our plot, there doesn't seem to be any pattern to the data. It's random!

Your Turn

Create a scatter plot of the hours and grade data in datasets/gradedata.csv. Do you see a pattern in the data?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset