The second step of data analysis is cleaning the data. Getting data ready for analytical tools can be a difficult task. Python and its libraries try to make it as easy as possible.
clean the data;
create new variables; and
organize the data.
Cleaning Data
remove outliers;
remove inappropriate values;
remove duplicates;
remove punctuation;
remove whitespace;
standardize dates; and
standardize text.
Calculating and Removing Outliers
Assume you are collecting data on the people you went to high school with. What if you went to high school with Bill Gates? Now, even though the person with the second-highest net worth is only worth $1.5 million, the average of your entire class is pushed up by the billionaire at the top. Finding the outliers allows you to remove the values that are so high or so low that they skew the overall view of the data.
- 1.
Standard Deviations: If the data is normally distributed, then 95 percent of the data is within 1.96 standard deviations of the mean. So we can drop the values either above or below that range.
- 2.
Interquartile Range (IQR) : The IQR is the difference between the 25 percent quantile and the 75 percent quantile. Any values that are either lower than Q1 - 1.5 x IQR or greater than Q3 + 1.5 x IQR are treated as outliers and removed.
Let's see what these look like (Listings 3-1 and 3-2).
Method 1: Standard Deviation
Line 6: Here we calculate the upper range equal to 1.96 times the standard deviation plus the mean.
Line 7: Here we calculate the lower range equal to 1.96 times the standard deviation subtracted from the mean.
Line 9: Here we drop the rows where the grade is higher than the toprange.
Line 11: Here we drop the rows where the grade is lower than the botrange.
Method 2: Interquartile Range
Line 9: Here we calculate the upper boundary = the third quartile + 1.5 * the IQR.
Line 10: Here we calculate the lower boundary = the first quartile - 1.5 * the IQR.
Line 13: Here we drop the rows where the grade is higher than the toprange.
Line 14: Here we drop the rows where the grade is lower than the botrange.
Your Turn
Load the dataset datasets/outlierdata.csv. Can you remove the outliers? Try it with both methods.
Missing Data in Pandas Dataframes
One of the most annoying things about working with large datasets is finding the missing datum. It can make it impossible or unpredictable to compute most aggregate statistics or to generate pivot tables. If you look for missing data points in a 50-row dataset it is fairly easy. However, if you try to find a missing data point in a 500,000-row dataset it can be much tougher.
Python's pandas library has functions to help you find, delete, or change missing data (Listing 3-3).
Creating Dataframe with Missing Data
The preceding code loads a legitimate dataset that includes rows with missing data. We can use the resulting dataframe to practice dealing with missing data.
To drop all the rows with missing (NaN) data, use the code shown in Listing 3-4.
Drop Rows with Missing Data
To add a column filled with empty values, use the code in Listing 3-5.
Add a Column with Empty Values
To drop any columns that contain nothing but empty values, see Listing 3-6.
Drop Completely Empty Columns
To replace all empty values with zero, see Listing 3-7.
Replace Empty Cells with 0
To fill in missing grades with the mean value of grade, see Listing 3-8.
Replace Empty Cells with Average of Column
Note, inplace=True means that the changes are saved to the dataframe right away.
To fill in missing grades with each gender's mean value of grade, see Listing 3-9.
It's Complicated
We can also select some rows but ignore the ones with missing data points. To select the rows of df where age is not NaN and gender is not NaN, see Listing 3-10.
Selecting Rows with No Missing Age or Gender
Your Turn
Load the dataset datasets/missinggrade.csv. Your mission, if you choose to accept it, is to delete rows with missing grades and to replace the missing values in hours of exercise by the mean value for that gender.
Filtering Inappropriate Values
Sometimes, if you are working with data you didn't collect yourself, you need to worry about whether the data is accurate. Heck, sometimes you need to worry about that even if you did collect it yourself! It can be difficult to check the veracity of each and every data point, but it is quite easy to check if the data is appropriate.
Python's pandas library has the ability to filter out the bad values (see Listing 3-11).
Creating Dataset
To eliminate all the rows where the grades are too high, see Listing 3-12.
Filtering Out Impossible Grades
To change the out-of-bound values to the maximum or minimum allowed value, we can use the code seen in Listing 3-13.
Changing Impossible Grades
Your Turn
Using the dataset from this section, can you replace all the subzero grades with a grade of zero?
Finding Duplicate Rows
Another thing you need to worry about if you are using someone else’s data is whether any data is duplicated. (Did the same data get reported twice, or recorded twice, or just copied and pasted?) Heck, sometimes you need to worry about that even if you did collect it yourself! It can be difficult to check the veracity of each and every data point, but it is quite easy to check if the data is duplicated.
Python's pandas library has a function for finding not only duplicated rows, but also the unique rows (Listing 3-14).
Creating Dataset with Duplicates
To indicate the duplicate rows, we can simply run the code seen in Listing 3-15.
Displaying Only Duplicates in the Dataframe
To show the dataset without duplicates, we can run the code seen in Listing 3-16.
Displaying Dataset without Duplicates
You might be asking, “What if the entire row isn't duplicated, but I still know it's a duplicate?" This can happen if someone does your survey or retakes an exam again, so the name is the same, but the observation is different. In this case, where we know that a duplicate name means a duplicate entry, we can use the code seen in Listing 3-17.
Drop Rows with Duplicate Names, Keeping the Last Observation
Your Turn
Load the dataset datasets/dupedata.csv. We figure people with the same address are duplicates. Can you drop the duplicated rows while keeping the first?
Removing Punctuation from Column Contents
Whether in a phone number or an address, you will often find unwanted punctuation in your data. Let's load some data to see how to address that (Listing 3-18).
Loading Dataframe with Data from CSV File
To remove the unwanted punctuation, we create a function that returns all characters that aren't punctuation, and them we apply that function to our dataframe (Listing 3-19).
Stripping Punctuation from the Address Column
Removing Whitespace from Column Contents
Loading Dataframe with Data from CSV File
To remove the whitespace, we create a function that returns all characters that aren't punctuation, and them we apply that function to our dataframe (Listing 3-21).
Stripping Whitespace from the Address Column
Standardizing Dates
One of the problems with consolidating data from different sources is that different people and different systems can record dates differently. Maybe they use 01/03/1980 or they use 01/03/80 or even they use 1980/01/03. Even though they all refer to January 3, 1980, analysis tools may not recognize them all as dates if you are switching back and forth between the different formats in the same column (Listing 3-22).
Creating Dataframe with Different Date Formats
Listing 3-23 shows a function that standardizes dates to single format.
Function to Standardize Dates
Now that we have this function, we can apply it to the birthdates column on our dataframe (Listing 3-24).
Applying Date Standardization to Birthdate Column
Standardizing Text like SSNs, Phone Numbers, and Zip Codes
One of the problems with consolidating data from different sources is that different people and different systems can record certain data like Social Security numbers, phone numbers, and zip codes differently. Maybe they use hyphens in those numbers, and maybe they don't. This section quickly covers how to standardize how these types of data are stored (see Listing 3-25).
Creating Dataframe with SSNs
The code in Listing 3-26 creates a function that standardizes the SSNs and applies it to our ssn column.
Remove Hyphens from SSNs and Add Leading Zeros if Necessary
Creating New Variables
Once the data is free of errors, you need to set up the variables that will directly answer your questions. It's a rare dataset in which every question you need answered is directly addressed by a variable. So, you may need to do a lot of recoding and computing of variables to get exactly the dataset that you need.
Creating bins (like converting numeric grades to letter grades or ranges of dates into Q1, Q2, etc.)
Creating a column that ranks the values in another column
Creating a column to indicate that another value has reached a threshold (passing or failing, Dean's list, etc.)
Converting string categories to numbers (for regression or correlation)
Binning Data
Sometimes, you will have discrete data that you need to group into bins. (Think: converting numeric grades to letter grades.) In this lesson, we will learn about binning (Listing 3-27).
Loading the Dataset from CSV
Now that the data is loaded, we need to define the bins and group names (Listing 3-28).
Define Bins as 0 to 60, 60 to 70, 70 to 80, 80 to 90, 90 to 100
Notice that there is one more bin value than there are group_names. This is because there needs to be a top and bottom limit for each bin.
Cut Grades
Listing 3-29 categorizes the column grade based on the bins list and labels the values using the group_names list.
And if we want to count the number of observations for each category, we can do that too (Listing 3-30).
Count Number of Observations
Your Turn
Recreate the dataframe from this section and create a column classifying the row as pass or fail. This is for a master's program that requires a grade of 80 or above for a student to pass.
Applying Functions to Groups, Bins, and Columns
The number one reason I use Python to analyze data is to handle datasets larger than a million rows. The number two reason is the ease of applying functions to my data.
To see this, first we need to load up some data (Listing 3-31).
Loading a Dataframe from a CSV File
Then, we use binning to divide the data into letter grades (Listing 3-32).
Using Bins
To find the average hours of study by letter grade, we apply our functions to the binned column (Listing 3-33).
Applying Function to Newly Created Bin
Applying a function to a column looks like Listing 3-34.
Applying a Function to a Column
Line 1: Let's get an integer value for each grade in the dataframe.
Applying a function to a group can be seen in Listing 3-35.
Applying a Function to a Group
Line 1: Create a grouping object. In other words, create an object that represents that particular grouping. In this case, we group grades by the gender.
Line 2: Display the mean value of each regiment's pre-test score.
Your Turn
Import the datasets/gradedata.csv file and create a new binned column of the 'status' as either passing (> 70) or failing (<=70). Then, compute the mean hours of exercise of the female students with a 'status' of passing.
Ranking Rows of Data
It is relatively easy to find the row with the maximum value or the minimum value, but sometimes you want to find the rows with the 50 highest or the 100 lowest values for a particular column. This is when you need ranking (Listing 3-36).
Load Data from CSV
If we want to find the rows with the lowest grades, we will need to rank all rows in ascending order by grade. Listing 3-37 shows the code to create a new column that is the rank of the value of grade in ascending order.
Create Column with Ranking by Grade
So, if we just wanted to see the students with the 20 lowest grades, we would use the code in Listing 3-38.
Bottom 20 Students
And, to see them in order, we need to use the code in Listing 3-39.
Bottom 6 Students in Order
Your Turn
Can you find the 50 students with the most hours of study per week?
Create a Column Based on a Conditional
Sometimes, you need to classify a row of data by the values in one or more columns, such as identifying those students who are passing or failing by whether their grade is above or below 70. In this section, we will learn how to do this (Listing 3-40).
Load Data from CSV
Now, let us say we want a column indicating whether students are failing or not (Listing 3-41).
Create Yes/No isFailing Column
Line 1: We import the numpy library
Line 2: Create a new column called df.failing where the value is yes if df.grade is less than 70 and no if not.
If instead we needed a column indicating who the male students were with failing scores, we could use the code in Listing 3-42.
Create Yes/No isFailingMale Column
Your Turn
Can you create a column for timemgmt that shows busy if a student exercises more than three hours per week AND studies more than seventeen hours per week?
Making New Columns Using Functions
Much of what I used to use Excel to do (and what I now use Python for) is to create new columns based on an existing one. So, using the following data (Listing 3-43), let's see how we would do this.
Load Data from CSV
To create a single column to contain the full name of each student, we first create a function to create a single string from two strings (Listing 3-44).
Create Function to Generate Full Name
Now, if you test that function, you will see that it works perfectly well concatenating Adam and Smith into Adam Smith. However, we can also use it with column selectors to create a new column using our fname and lname columns (Listing 3-45).
Create Column to Hold the Full Name
This code creates a column called fullname that concatenates the first and last name.
Your Turn
Can you create a column called total time that adds together the hours of study per week and the hours of exercise per week?
Converting String Categories to Numeric Variables
Why do I need to convert string categories to numeric variables? Many analytical tools won't work on text, but if you convert those values to numbers it makes things much simpler (Listing 3-46).
Load Data from CSV
Method 1: Convert single column to hold numeric variables (Listing 3-47).
Function to Convert Gender to Number
Now, run that method on your column (Listing 3-48).
Apply score_to_numeric Function to Gender
Method 2: Create individual Boolean columns (Listing 3-49).
Create Boolean Columns Based on Gender Column
Join columns to original dataset (Listing 3-50).
Add New Columns to Original Dataframe
Your Turn
Using datasets/gradesdatawithyear.csv, can you create a numeric column to replace statuses of freshman through senior with the numerals 1 through 4?
Organizing the Data
Both original and newly created variables need to be formatted correctly for two reasons.
First, so our analysis tools work with them correctly. Failing to format a missing value code or a dummy variable correctly will have major consequences for your data analysis.
Second, it's much faster to run the analysis and interpret results if you don't have to keep looking up which variable Q156 is.
Removing columns that aren't needed
Changing column names
Changing column names to lower case
Formatting date variables as dates, and so forth.
Removing and Adding Columns
Sometimes we need to adjust the data. Either something is left out that should have been included or something was left in that should have been removed. So, let's start with the dataset in Listing 3-51.
Creating Starting Dataset
We can drop a column by simply adding the code in Listing 3-52.
Dropping a Column
With axis=1 telling drop that we want to drop a column (1) and not a row (0).
We can add a column filled with zeros by setting the new column name to be equal to a 0 (Listing 3-53).
Creating a New Column Filled with Zeros
If, however, you want to set the new columns to equal null values, you can do that too (Listing 3-54).
Creating a New Column Filled with Null Values
Now, adding a column with values is not that hard. We create a series and set the column equal to the series (Listing 3-55).
Creating a New Column Filled with Values
Your Turn
Can you remove the bs, ms, and phd degree columns?
Can you add a Hogwarts Magic Degree column? Everyone but Jessica has one; does that make it harder? No? Then I have to be sure to stump you next time.
Selecting Columns
You will need to make subselections of your data occasionally, especially if your dataset has tons of columns. Here, we learn how to create a dataframe that includes only some of our columns (Listing 3-56).
Load Data from CSV
Now, to select a column of data, we specify the column name (Listing 3-57).
Selecting a Column into a List
But if you run that code you only get the data in the column (notice the header is missing). That is because this doesn't return a dataframe; it returns a list. To return a dataframe when selecting a column, we need to specify it (Listing 3-58).
Selecting a Column into a Dataframe
To return multiple columns, we use code like that in Listing 3-59.
Selecting Multiple Columns into a Dataframe
And, of course, if we want to create a dataframe with that subset of columns, we can copy it to another variable (Listing 3-60).
Creating New Dataframe from Your Selection
Your Turn
We need to create a mailing list. Can you create a new dataframe by selecting the first name, last name, and address fields?
Change Column Name
Sometimes you need change the names of your columns. With pandas, it's easy to do. First, you load your data (Listing 3-61).
Load Dataset from CSV
But, when we look at the header, we aren't crazy about the column names—or it doesn't have any.
It is simple to change the column headers (Listing 3-62).
Change All Headers
Or, if you just wanted to change one or two values, you can load the list of headers (Listing 3-63).
Load List of Headers into a Temp Variable
Once the headers are loaded, you can change a few (Listing 3-64).
Changing Headers
Your Turn
Can you change the age column name to years?
Setting Column Names to Lower Case
It may not be the biggest problem in the world, but sometimes I need to convert all the column names to lowercase (or uppercase, for that matter). This lesson will cover how to do that (Listing 3-65).
Load Data from CSV
Once you have the data, there are two quick ways to cast all the column headers to lowercase (Listing 3-66).
Casting All Headers to Lowercase
Your Turn
Can you figure out how to make all the column headers all uppercase?
Finding Matching Rows
Of course, you don't always want to compute using the entire dataset. Sometimes you want to work with just a subset of your data. In this lesson, we find out how to do that (Listing 3-67).
Creating Dataset
To find all the rows that contain the word Mel, use the code shown in Listing 3-68 in a new cell.
Filtering Rows
After executing that line of Python, you will see a list of Boolean values—True for the lines that match our query and False for that ones that don't.
We can make our answer shorter by adding .any. This will just display a single True if any line matches and False if none of them do (Listing 3-69).
Check if Any Rows Match
Alternatively, you can add .all. This will just display a single True if all of the lines match and False if at least one of them does not (Listing 3-70).
Check if All Rows Match
We can also use this along with the .loc (locate) function to show just the rows that match certain criteria (Listing 3-71).
Show the Rows that Match
Your Turn
Can you find all the people who have at least one MS degree in the following data (Listing 3-72)?
Starting Dataset
Filter Rows Based on Conditions
Load Data from CSV
We can show one column of data (Listing 3-74).
One Column
Or we can show two columns of data (Listing 3-75).
Two Columns
Or we can show the first two rows of data (Listing 3-76).
First Two Rows
To show all the rows where the grade is greater than 80, use the code in Listing 3-77.
All Rows with Grade > 80
Using multiple conditions is a little trickier. So, if we wanted to get a list of all the students who scored higher than 99.9 and were male, we would need to use the code shown in Listing 3-78.
All Rows Where Men Scored > 99.9
If instead we wanted all the students who scored higher than 99 OR were female, we would need to use the code in Listing 3-79.
All Rows Where Women or Scored > 99
Your Turn
Can you show all the rows where the student was male, exercised less than two hours per week, and studied more than fifteen hours per week?
Selecting Rows Based on Conditions
Load Data from CSV
Method 1: Using Variables to Hold Attributes
Line 1: We create a variable with TRUE if gender is female.
Line 2: We create a variable with TRUE if grade is greater than or equal to 90.
Line 3: This is where we select all cases where both gender is female and grade is greater than or equal to 90.
Method 2: Using Variable Attributes Directly
In Listing 3-82, we select all cases where the first name is not missing and gender is male.
Your Turn
Can you find all the rows where the student had four or more hours of exercise per week, seventeen or more hours of study, and still had a grade that was lower than 80?
Random Sampling Dataframe
This one is simple. Obviously, sometimes we have datasets that are too large and we need to take a subset, so let's start with some loaded data (Listing 3-83).
Load Dataset from CSV
To select just 100 rows randomly from that dataset, we can simply run the code shown in Listing 3-84.
Random Sample of 100 Rows from Dataframe
Your Turn
Can you create a random sample of 500 rows from that dataset?