Data Quality Report
When you have looked at enough datasets, you will develop a set of questions you want answered about the data to ascertain how good the dataset is. This following scripts combine to form a data quality report that I use to evaluate the datasets that I work with.
Load Dataset from CSV
Finding Data Types of Each Column
Counting Number of Missing Observations by Column
Counting Number of Present Observations by Column
Counting Number of Unique Observations by Column
Finding the Minimum Value for Each Column
Finding the Maximum Value for Each Column
Joining All the Computed Lists into 1 Report
Your Turn
Can you create a data quality report for the datasets/tamiami.csv dataset?
Graph a Dataset: Line Plot
To create a simple line plot, input the code from Listing 5-9.
Line Plotting Your Dataset
Customizing the graph is easy, but you need to add the matplotlib library first.
Add the code in Listing 5-10 to what you did already.
Code to Plot a Customized Graph
displayText: the text we want to show for this annotation
xloc, yloc: the coordinates of the data point we want to annotate
xtext, ytext: coordinates of where we want the text to appear using the coordinate system specified in textcoords
xycoords: sets the coordinate system to use to find the data point; it can be set separately for x and y
textcoords: sets the coordinate system to use to place the text
Finally, we can add an arrow linking the data point annotated to the text annotation (Listing 5-11).
Code to Plot a Customized Graph
All we did is adjust the offset of the text so that there was enough room between the data and the annotation to actually see the arrow. We did this by changing the ytext value from 0 to -150. Then, we added the setting for the arrow.
More information about creating arrows can be found on the documentation page for annotate at http://matplotlib.org/users/annotations_intro.html .
Your Turn
Take the same dataset we used in this example and add an annotation to Bob's 76 that says “Wow!”
Graph a Dataset: Bar Plot
To create a bar plot, input the code in Listing 5-12.
Bar Plotting Your Dataset
But if we convert the Names column into the index, we can improve the graph. So, first, we need to add the code in Listing 5-13.
Adding Code to Plot Your Dataset
Your Turn
Can you change the code to create a bar plot where the status is the label?
Graph a Dataset: Box Plot
To create a box plot, input the code in Listing 5-14.
Box Plotting Your Dataset
Now, we can use a single command to create categorized graphs (in this case, categorized by gender). See Listing 5-15.
Adding Code to Categorize Your Box Plot
Listing 5-16. Categorized Box Plots
And, finally, to adjust the y-axis so that it runs from 0 to 100, we can run the code in Listing 5-17.
Adding Code to Adjust the Y-axis
Your Turn
Can you create a box plot of the grades categorized by student status?
Can you create that box plot with a y-axis that runs from 50 to 110?
Graph a Dataset: Histogram
Because of the nature of histograms, we really need more data than is found in the example dataset we have been working with. Enter the code from Listing 5-18 to import the larger dataset.
Importing Dataset from CSV File
To create a simple histogram, we can simply add the code in Listing 5-19.
Creating a Histogram not Creating a Box Plot
And because pandas is not sure which column you wish to count the values of, it gives you histograms for all the columns with numeric values.
In order to see a histogram for just hours, we can specify it as in Listing 5-20.
Creating Histogram for Single Column
And to see histograms of hours separated by gender, we can use Listing 5-21.
Categorized Histogram
Your Turn
Can you create an age histogram categorized by gender?
Graph a Dataset: Pie Chart
To create a pie chart, input the code from Listing 5-22.
Pie Charting Your Dataset
This code creates a dataset of student rule violations. Next, in a new cell, we will create a column to show the total violations or demerits per student (Listing 5-23).
Creating New Column
Finally, to actually create a pie chart of the number of demerits, we can just run the code from Listing 5-24.
Creating Pie Chart of Demerits
But since it is a bit plain (and a bit elongated), let's try the code from Listing 5-25 in a new cell.
Creating a Customized Pie Chart
Line 2: This adds the students' names as labels to the pie pieces.
Line 3: This is what explodes out the pie piece for the fifth student. You can increase or decrease the amount to your liking.
Line 4: This is what rotates the pie chart to different points.
Line 5: This is what formats the numeric labels on the pie pieces.
Line 7: This is what forces the pie to be circular.
Your Turn
What if, instead of highlighting the worst student, we put a spotlight on the best one? Let's rotate the chart and change the settings so we are highlighting John instead of Mel.
Graph a Dataset: Scatter Plot
The code in Listing 5-26 will allow us to generate a simple scatter plot.
Creating a Scatter Plot
Line 4: specifies that figures should be shown inline
Line 6: generates a random dataset of 200 values
Line 7: creates a scatter plot using the index of the dataframe as the x and the values of column Col as the y
Looking at our plot, there doesn't seem to be any pattern to the data. It's random!
Your Turn
Create a scatter plot of the hours and grade data in datasets/gradedata.csv. Do you see a pattern in the data?