© A.J. Henley and Dave Wolf 2018
A.J. Henley and Dave WolfLearn Data Analysis with Pythonhttps://doi.org/10.1007/978-1-4842-3486-0_3

3. Preparing Data Is Half the Battle

A. J. Henley1  and Dave Wolf2
(1)
Washington, D.C., District of Columbia, USA
(2)
Sterling Business Advantage, LLC, Adamstown, Maryland, USA
 

The second step of data analysis is cleaning the data. Getting data ready for analytical tools can be a difficult task. Python and its libraries try to make it as easy as possible.

With just a few lines of code, you will be able to get your data ready for analysis. You will be able to
  • clean the data;

  • create new variables; and

  • organize the data.

Cleaning Data

To be useful for most analytical tasks , data must be clean. This means it should be consistent, relevant, and standardized. In this chapter, you will learn how to
  • remove outliers;

  • remove inappropriate values;

  • remove duplicates;

  • remove punctuation;

  • remove whitespace;

  • standardize dates; and

  • standardize text.

Calculating and Removing Outliers

Assume you are collecting data on the people you went to high school with. What if you went to high school with Bill Gates? Now, even though the person with the second-highest net worth is only worth $1.5 million, the average of your entire class is pushed up by the billionaire at the top. Finding the outliers allows you to remove the values that are so high or so low that they skew the overall view of the data.

We cover two main ways of detecting outliers:
  1. 1.

    Standard Deviations: If the data is normally distributed, then 95 percent of the data is within 1.96 standard deviations of the mean. So we can drop the values either above or below that range.

     
  2. 2.

    Interquartile Range (IQR) : The IQR is the difference between the 25 percent quantile and the 75 percent quantile. Any values that are either lower than Q1 - 1.5 x IQR or greater than Q3 + 1.5 x IQR are treated as outliers and removed.

     

Let's see what these look like (Listings 3-1 and 3-2).

import pandas as pd
Location = "datasets/gradedata.csv"
df = pd.read_csv(Location)
meangrade = df['grade'].mean()
stdgrade = df['grade'].std()
toprange = meangrade + stdgrade * 1.96
botrange = meangrade - stdgrade * 1.96
copydf = df
copydf = copydf.drop(copydf[copydf['grade']
        > toprange].index)
copydf = copydf.drop(copydf[copydf['grade']
        < botrange].index)
copydf
Listing 3-1

Method 1: Standard Deviation

  • Line 6: Here we calculate the upper range equal to 1.96 times the standard deviation plus the mean.

  • Line 7: Here we calculate the lower range equal to 1.96 times the standard deviation subtracted from the mean.

  • Line 9: Here we drop the rows where the grade is higher than the toprange.

  • Line 11: Here we drop the rows where the grade is lower than the botrange.

import pandas as pd
Location = "datasets/gradedata.csv"
df = pd.read_csv(Location)
q1 = df['grade'].quantile(.25)
q3 = df['grade'].quantile(.75)
iqr = q3-q1
toprange = q3 + iqr * 1.5
botrange = q1 - iqr * 1.5
copydf = df
copydf = copydf.drop(copydf[copydf['grade']
        > toprange].index)
copydf = copydf.drop(copydf[copydf['grade']
        < botrange].index)
copydf
Listing 3-2

Method 2: Interquartile Range

  • Line 9: Here we calculate the upper boundary = the third quartile + 1.5 * the IQR.

  • Line 10: Here we calculate the lower boundary = the first quartile - 1.5 * the IQR.

  • Line 13: Here we drop the rows where the grade is higher than the toprange.

  • Line 14: Here we drop the rows where the grade is lower than the botrange.

Your Turn

Load the dataset datasets/outlierdata.csv. Can you remove the outliers? Try it with both methods.

Missing Data in Pandas Dataframes

One of the most annoying things about working with large datasets is finding the missing datum. It can make it impossible or unpredictable to compute most aggregate statistics or to generate pivot tables. If you look for missing data points in a 50-row dataset it is fairly easy. However, if you try to find a missing data point in a 500,000-row dataset it can be much tougher.

Python's pandas library has functions to help you find, delete, or change missing data (Listing 3-3).

import pandas as pd
df = pd.read_csv("datasets/gradedatamissing.csv")
df.head()
Listing 3-3

Creating Dataframe with Missing Data

The preceding code loads a legitimate dataset that includes rows with missing data. We can use the resulting dataframe to practice dealing with missing data.

To drop all the rows with missing (NaN) data, use the code shown in Listing 3-4.

df_no_missing = df.dropna()
df_no_missing
Listing 3-4

Drop Rows with Missing Data

To add a column filled with empty values, use the code in Listing 3-5.

import numpy as np
df['newcol'] = np.nan
df.head()
Listing 3-5

Add a Column with Empty Values

To drop any columns that contain nothing but empty values, see Listing 3-6.

df.dropna(axis=1, how="all")
Listing 3-6

Drop Completely Empty Columns

To replace all empty values with zero, see Listing 3-7.

df.fillna(0)
Listing 3-7

Replace Empty Cells with 0

To fill in missing grades with the mean value of grade, see Listing 3-8.

df["grade"].fillna(df["grade"].mean(), inplace=True)
Listing 3-8

Replace Empty Cells with Average of Column

Note, inplace=True means that the changes are saved to the dataframe right away.

To fill in missing grades with each gender's mean value of grade, see Listing 3-9.

df["grade"].fillna(df.groupby("gender")
     ["grade"].transform("mean"), inplace=True)
Listing 3-9

It's Complicated

We can also select some rows but ignore the ones with missing data points. To select the rows of df where age is not NaN and gender is not NaN, see Listing 3-10.

df[df['age'].notnull() & df['gender'].notnull()]
Listing 3-10

Selecting Rows with No Missing Age or Gender

Your Turn

Load the dataset datasets/missinggrade.csv. Your mission, if you choose to accept it, is to delete rows with missing grades and to replace the missing values in hours of exercise by the mean value for that gender.

Filtering Inappropriate Values

Sometimes, if you are working with data you didn't collect yourself, you need to worry about whether the data is accurate. Heck, sometimes you need to worry about that even if you did collect it yourself! It can be difficult to check the veracity of each and every data point, but it is quite easy to check if the data is appropriate.

Python's pandas library has the ability to filter out the bad values (see Listing 3-11).

import pandas as pd
names = ['Bob','Jessica','Mary','John','Mel']
grades = [76,-2,77,78,101]
GradeList = zip(names,grades)
df = pd.DataFrame(data = GradeList,
    columns=['Names', 'Grades'])
df
Listing 3-11

Creating Dataset

To eliminate all the rows where the grades are too high, see Listing 3-12.

df.loc[df['Grades'] <= 100]
Listing 3-12

Filtering Out Impossible Grades

To change the out-of-bound values to the maximum or minimum allowed value, we can use the code seen in Listing 3-13.

df.loc[(df['Grades'] >= 100,'Grades')] = 100
Listing 3-13

Changing Impossible Grades

Your Turn

Using the dataset from this section, can you replace all the subzero grades with a grade of zero?

Finding Duplicate Rows

Another thing you need to worry about if you are using someone else’s data is whether any data is duplicated. (Did the same data get reported twice, or recorded twice, or just copied and pasted?) Heck, sometimes you need to worry about that even if you did collect it yourself! It can be difficult to check the veracity of each and every data point, but it is quite easy to check if the data is duplicated.

Python's pandas library has a function for finding not only duplicated rows, but also the unique rows (Listing 3-14).

import pandas as pd
names = ['Jan','John','Bob','Jan','Mary','Jon','Mel','Mel']
grades = [95,78,76,95,77,78,99,100]
GradeList = zip(names,grades)
df = pd.DataFrame(data = GradeList,
        columns=['Names', 'Grades'])
df
Listing 3-14

Creating Dataset with Duplicates

To indicate the duplicate rows, we can simply run the code seen in Listing 3-15.

df.duplicated()
Listing 3-15

Displaying Only Duplicates in the Dataframe

To show the dataset without duplicates, we can run the code seen in Listing 3-16.

df.drop_duplicates()
Listing 3-16

Displaying Dataset without Duplicates

You might be asking, “What if the entire row isn't duplicated, but I still know it's a duplicate?" This can happen if someone does your survey or retakes an exam again, so the name is the same, but the observation is different. In this case, where we know that a duplicate name means a duplicate entry, we can use the code seen in Listing 3-17.

df.drop_duplicates(['Names'], keep="last")
Listing 3-17

Drop Rows with Duplicate Names, Keeping the Last Observation

Your Turn

Load the dataset datasets/dupedata.csv. We figure people with the same address are duplicates. Can you drop the duplicated rows while keeping the first?

Removing Punctuation from Column Contents

Whether in a phone number or an address, you will often find unwanted punctuation in your data. Let's load some data to see how to address that (Listing 3-18).

import pandas as pd
Location = "datasets/gradedata.csv"
## To add headers as we load the data...
df = pd.read_csv(Location)
df.head()
Listing 3-18

Loading Dataframe with Data from CSV File

To remove the unwanted punctuation, we create a function that returns all characters that aren't punctuation, and them we apply that function to our dataframe (Listing 3-19).

import string
exclude = set(string.punctuation)
def remove_punctuation(x):
    try:
        x = ''.join(ch for ch in x if ch not in exclude)
    except:
        pass
    return x
df.address = df.address.apply(remove_punctuation)
df
Listing 3-19

Stripping Punctuation from the Address Column

Removing Whitespace from Column Contents

import pandas as pd
Location = "datasets/gradedata.csv"
## To add headers as we load the data...
df = pd.read_csv(Location)
df.head()
Listing 3-20

Loading Dataframe with Data from CSV File

To remove the whitespace, we create a function that returns all characters that aren't punctuation, and them we apply that function to our dataframe (Listing 3-21).

def remove_whitespace(x):
    try:
        x = ''.join(x.split())
    except:
        pass
    return x
df.address = df.address.apply(remove_whitespace)
df
Listing 3-21

Stripping Whitespace from the Address Column

Standardizing Dates

One of the problems with consolidating data from different sources is that different people and different systems can record dates differently. Maybe they use 01/03/1980 or they use 01/03/80 or even they use 1980/01/03. Even though they all refer to January 3, 1980, analysis tools may not recognize them all as dates if you are switching back and forth between the different formats in the same column (Listing 3-22).

import pandas as pd
names = ['Bob','Jessica','Mary','John','Mel']
grades = [76,95,77,78,99]
bsdegrees = [1,1,0,0,1]
msdegrees = [2,1,0,0,0]
phddegrees = [0,1,0,0,0]
bdates = ['1/1/1945','10/21/76','3/3/90',
        '04/30/1901','1963-09-01']
GradeList = zip(names,grades,bsdegrees,msdegrees,
        phddegrees,bdates)
columns=['Names','Grades','BS','MS','PhD',"bdates"]
df = pd.DataFrame(data = GradeList, columns=columns)
df
Listing 3-22

Creating Dataframe with Different Date Formats

Listing 3-23 shows a function that standardizes dates to single format.

from time import strftime
from datetime import datetime
def standardize_date(thedate):
    formatted_date = ""
    thedate = str(thedate)
    if not thedate or thedate.lower() == "missing"
                or thedate == "nan":
        formatted_date = "MISSING"
    if the_date.lower().find('x') != -1:
        formatted_date = "Incomplete"
    if the_date[0:2] == "00":
        formatted_date = thedate.replace("00", "19")
    try:
        formatted_date = str(datetime.strptime(    
        thedate,'%m/%d/%y')
.strftime('%m/%d/%y'))
    except:
        pass
    try:
        formatted_date = str(datetime.strptime(
thedate, '%m/%d/%Y')
.strftime('%m/%d/%y'))
    except:
        pass
    try:
        if int(the_date[0:4]) < 1900:
            formatted_date = "Incomplete"
        else:
            formatted_date = str(datetime.strptime(
            thedate, '%Y-%m-%d')
.strftime('%m/%d/%y'))
    except:
        pass
    return formatted_date
Listing 3-23

Function to Standardize Dates

Now that we have this function, we can apply it to the birthdates column on our dataframe (Listing 3-24).

df.bdates = df.bdates.apply(standardize_date)
df
Listing 3-24

Applying Date Standardization to Birthdate Column

Standardizing Text like SSNs, Phone Numbers, and Zip Codes

One of the problems with consolidating data from different sources is that different people and different systems can record certain data like Social Security numbers, phone numbers, and zip codes differently. Maybe they use hyphens in those numbers, and maybe they don't. This section quickly covers how to standardize how these types of data are stored (see Listing 3-25).

import pandas as pd
names = ['Bob','Jessica','Mary','John','Mel']
grades = [76,95,77,78,99]
bsdegrees = [1,1,0,0,1]
msdegrees = [2,1,0,0,0]
phddegrees = [0,1,0,0,0]
ssns = ['867-53-0909','333-22-4444','123-12-1234',
        '777-93-9311','123-12-1423']
GradeList = zip(names,grades,bsdegrees,msdegrees,
        phddegrees,ssns)
columns=['Names','Grades','BS','MS','PhD',"ssn"]
df = pd.DataFrame(data = GradeList, columns=columns)
df
Listing 3-25

Creating Dataframe with SSNs

The code in Listing 3-26 creates a function that standardizes the SSNs and applies it to our ssn column.

def right(s, amount):
    return s[-amount]
def standardize_ssn(ssn):
    try:
        ssn = ssn.replace("-","")
        ssn = "".join(ssn.split())
        if len(ssn)<9 and ssn != 'Missing':
            ssn="000000000" + ssn
            ssn=right(ssn,9)
    except:
        pass
    return ssn
df.ssn = df.ssn.apply(standardize_ssn)
df
Listing 3-26

Remove Hyphens from SSNs and Add Leading Zeros if Necessary

Creating New Variables

Once the data is free of errors, you need to set up the variables that will directly answer your questions. It's a rare dataset in which every question you need answered is directly addressed by a variable. So, you may need to do a lot of recoding and computing of variables to get exactly the dataset that you need.

Examples include the following:
  • Creating bins (like converting numeric grades to letter grades or ranges of dates into Q1, Q2, etc.)

  • Creating a column that ranks the values in another column

  • Creating a column to indicate that another value has reached a threshold (passing or failing, Dean's list, etc.)

  • Converting string categories to numbers (for regression or correlation)

Binning Data

Sometimes, you will have discrete data that you need to group into bins. (Think: converting numeric grades to letter grades.) In this lesson, we will learn about binning (Listing 3-27).

import pandas as pd
Location = "datasets/gradedata.csv"
df = pd.read_csv(Location)
df.head()
Listing 3-27

Loading the Dataset from CSV

Now that the data is loaded, we need to define the bins and group names (Listing 3-28).

# Create the bin dividers
bins = [0, 60, 70, 80, 90, 100]
# Create names for the four groups
group_names = ['F', 'D', 'C', 'B', 'A']
Listing 3-28

Define Bins as 0 to 60, 60 to 70, 70 to 80, 80 to 90, 90 to 100

Notice that there is one more bin value than there are group_names. This is because there needs to be a top and bottom limit for each bin.

df['lettergrade'] = pd.cut(df['grade'], bins,
        labels=group_names)
df
Listing 3-29

Cut Grades

Listing 3-29 categorizes the column grade based on the bins list and labels the values using the group_names list.

And if we want to count the number of observations for each category, we can do that too (Listing 3-30).

pd.value_counts(df['lettergrade'])
Listing 3-30

Count Number of Observations

Your Turn

Recreate the dataframe from this section and create a column classifying the row as pass or fail. This is for a master's program that requires a grade of 80 or above for a student to pass.

Applying Functions to Groups, Bins, and Columns

The number one reason I use Python to analyze data is to handle datasets larger than a million rows. The number two reason is the ease of applying functions to my data.

To see this, first we need to load up some data (Listing 3-31).

import pandas as pd
Location = "datasets/gradedata.csv"
df = pd.read_csv(Location)
df.head()
Listing 3-31

Loading a Dataframe from a CSV File

Then, we use binning to divide the data into letter grades (Listing 3-32).

# Create the bin dividers
bins = [0, 60, 70, 80, 90, 100]
# Create names for the four groups
group_names = ['F', 'D', 'C', 'B', 'A']
df['letterGrades'] = pd.cut(df['grade'],
        bins, labels=group_names)
df.head()
Listing 3-32

Using Bins

To find the average hours of study by letter grade, we apply our functions to the binned column (Listing 3-33).

df.groupby('letterGrades')['hours'].mean()
Listing 3-33

Applying Function to Newly Created Bin

Applying a function to a column looks like Listing 3-34.

# Applying the integer function to the grade column
df['grade'] = df['grade'] = df['grade']
.apply(lambda x: int(x))
df.head()
Listing 3-34

Applying a Function to a Column

  • Line 1: Let's get an integer value for each grade in the dataframe.

Applying a function to a group can be seen in Listing 3-35.

gender_preScore = df['grade'].groupby(df['gender'])
gender_preScore.mean()
Listing 3-35

Applying a Function to a Group

  • Line 1: Create a grouping object. In other words, create an object that represents that particular grouping. In this case, we group grades by the gender.

  • Line 2: Display the mean value of each regiment's pre-test score.

Your Turn

Import the datasets/gradedata.csv file and create a new binned column of the 'status' as either passing (> 70) or failing (<=70). Then, compute the mean hours of exercise of the female students with a 'status' of passing.

Ranking Rows of Data

It is relatively easy to find the row with the maximum value or the minimum value, but sometimes you want to find the rows with the 50 highest or the 100 lowest values for a particular column. This is when you need ranking (Listing 3-36).

import pandas as pd
Location = "datasets/gradedata.csv"
df = pd.read_csv(Location)
df.head()
Listing 3-36

Load Data from CSV

If we want to find the rows with the lowest grades, we will need to rank all rows in ascending order by grade. Listing 3-37 shows the code to create a new column that is the rank of the value of grade in ascending order.

df['graderanked'] = df['grade'].rank(ascending=1)
df.tail()
Listing 3-37

Create Column with Ranking by Grade

So, if we just wanted to see the students with the 20 lowest grades, we would use the code in Listing 3-38.

df[df['graderanked'] < 21]
Listing 3-38

Bottom 20 Students

And, to see them in order, we need to use the code in Listing 3-39.

df[df['graderanked'] < 6].sort_values('graderanked')
Listing 3-39

Bottom 6 Students in Order

Your Turn

Can you find the 50 students with the most hours of study per week?

Create a Column Based on a Conditional

Sometimes, you need to classify a row of data by the values in one or more columns, such as identifying those students who are passing or failing by whether their grade is above or below 70. In this section, we will learn how to do this (Listing 3-40).

import pandas as pd
Location = "datasets/gradedata.csv"
df = pd.read_csv(Location)
df.head()
Listing 3-40

Load Data from CSV

Now, let us say we want a column indicating whether students are failing or not (Listing 3-41).

import numpy as np
df['isFailing'] = np.where(df['grade']<70,
'yes', 'no')
df.tail(10)
Listing 3-41

Create Yes/No isFailing Column

  • Line 1: We import the numpy library

  • Line 2: Create a new column called df.failing where the value is yes if df.grade is less than 70 and no if not.

If instead we needed a column indicating who the male students were with failing scores, we could use the code in Listing 3-42.

df['isFailingMale'] = np.where((df['grade']<70)
        & (df['gender'] == 'male'),'yes', 'no')
df.tail(10)
Listing 3-42

Create Yes/No isFailingMale Column

Your Turn

Can you create a column for timemgmt that shows busy if a student exercises more than three hours per week AND studies more than seventeen hours per week?

Making New Columns Using Functions

Much of what I used to use Excel to do (and what I now use Python for) is to create new columns based on an existing one. So, using the following data (Listing 3-43), let's see how we would do this.

import pandas as pd
Location = "datasets/gradedata.csv"
df = pd.read_csv(Location)
df.head()
Listing 3-43

Load Data from CSV

To create a single column to contain the full name of each student, we first create a function to create a single string from two strings (Listing 3-44).

def singlename(fn, ln):
    return fn + " " + ln
Listing 3-44

Create Function to Generate Full Name

Now, if you test that function, you will see that it works perfectly well concatenating Adam and Smith into Adam Smith. However, we can also use it with column selectors to create a new column using our fname and lname columns (Listing 3-45).

df['fullname'] = singlename(df['fname'],df['lname'])
Listing 3-45

Create Column to Hold the Full Name

This code creates a column called fullname that concatenates the first and last name.

Your Turn

Can you create a column called total time that adds together the hours of study per week and the hours of exercise per week?

Converting String Categories to Numeric Variables

Why do I need to convert string categories to numeric variables? Many analytical tools won't work on text, but if you convert those values to numbers it makes things much simpler (Listing 3-46).

import pandas as pd
Location = "datasets/gradedata.csv"
df = pd.read_csv(Location)
df.head()
Listing 3-46

Load Data from CSV

Method 1: Convert single column to hold numeric variables (Listing 3-47).

def score_to_numeric(x):
    if x=='female':
        return 1
    if x=='male':
        return 0
Listing 3-47

Function to Convert Gender to Number

Now, run that method on your column (Listing 3-48).

df['gender_val'] = df['gender'].apply(score_to_numeric)
df.tail()
Listing 3-48

Apply score_to_numeric Function to Gender

Method 2: Create individual Boolean columns (Listing 3-49).

df_gender = pd.get_dummies(df['gender'])
df_gender.tail()
Listing 3-49

Create Boolean Columns Based on Gender Column

Join columns to original dataset (Listing 3-50).

# Join the dummy variables to the main dataframe
df_new = pd.concat([df, df_gender], axis=1)
df_new.tail()
# or
# Alterative for joining the new columns
df_new = df.join(df_gender)
df_new.tail()
Listing 3-50

Add New Columns to Original Dataframe

Your Turn

Using datasets/gradesdatawithyear.csv, can you create a numeric column to replace statuses of freshman through senior with the numerals 1 through 4?

Organizing the Data

Both original and newly created variables need to be formatted correctly for two reasons.

First, so our analysis tools work with them correctly. Failing to format a missing value code or a dummy variable correctly will have major consequences for your data analysis.

Second, it's much faster to run the analysis and interpret results if you don't have to keep looking up which variable Q156 is.

Examples include the following:
  • Removing columns that aren't needed

  • Changing column names

  • Changing column names to lower case

  • Formatting date variables as dates, and so forth.

Removing and Adding Columns

Sometimes we need to adjust the data. Either something is left out that should have been included or something was left in that should have been removed. So, let's start with the dataset in Listing 3-51.

import pandas as pd
names = ['Bob','Jessica','Mary','John','Mel']
grades = [76,95,77,78,99]
bsdegrees = [1,1,0,0,1]
msdegrees = [2,1,0,0,0]
phddegrees = [0,1,0,0,0]
GradeList = zip(names,grades,bsdegrees,msdegrees,
        phddegrees)
columns=['Names','Grades','BS','MS','PhD']
df = pd.DataFrame(data = GradeList, columns=columns)
df
Listing 3-51

Creating Starting Dataset

We can drop a column by simply adding the code in Listing 3-52.

df.drop('PhD', axis=1)
Listing 3-52

Dropping a Column

With axis=1 telling drop that we want to drop a column (1) and not a row (0).

We can add a column filled with zeros by setting the new column name to be equal to a 0 (Listing 3-53).

df['HighSchool']=0
Listing 3-53

Creating a New Column Filled with Zeros

If, however, you want to set the new columns to equal null values, you can do that too (Listing 3-54).

df['PreSchool'] = np.nan
Listing 3-54

Creating a New Column Filled with Null Values

Now, adding a column with values is not that hard. We create a series and set the column equal to the series (Listing 3-55).

d = ([0,1,0,1,0])
s = pd.Series(d, index= df.index)
df['DriversLicense'] = s
df
Listing 3-55

Creating a New Column Filled with Values

Your Turn

Can you remove the bs, ms, and phd degree columns?

Can you add a Hogwarts Magic Degree column? Everyone but Jessica has one; does that make it harder? No? Then I have to be sure to stump you next time.

Selecting Columns

You will need to make subselections of your data occasionally, especially if your dataset has tons of columns. Here, we learn how to create a dataframe that includes only some of our columns (Listing 3-56).

import pandas as pd
Location = "datasets/gradedata.csv"
df = pd.read_csv(Location)
df.head()
Listing 3-56

Load Data from CSV

Now, to select a column of data, we specify the column name (Listing 3-57).

df['fname']
Listing 3-57

Selecting a Column into a List

But if you run that code you only get the data in the column (notice the header is missing). That is because this doesn't return a dataframe; it returns a list. To return a dataframe when selecting a column, we need to specify it (Listing 3-58).

df[['fname']]
Listing 3-58

Selecting a Column into a Dataframe

To return multiple columns, we use code like that in Listing 3-59.

df[['fname','age','grade']]
Listing 3-59

Selecting Multiple Columns into a Dataframe

And, of course, if we want to create a dataframe with that subset of columns, we can copy it to another variable (Listing 3-60).

df2 = df[['fname','age','grade']]
df2.head()
Listing 3-60

Creating New Dataframe from Your Selection

Your Turn

We need to create a mailing list. Can you create a new dataframe by selecting the first name, last name, and address fields?

Change Column Name

Sometimes you need change the names of your columns. With pandas, it's easy to do. First, you load your data (Listing 3-61).

import pandas as pd
Location = "datasets/gradedata.csv"
df = pd.read_csv(Location)
df.head()
Listing 3-61

Load Dataset from CSV

But, when we look at the header, we aren't crazy about the column names—or it doesn't have any.

It is simple to change the column headers (Listing 3-62).

df.columns = ['FirstName', 'LastName', 'Gender',
        'Age', 'HoursExercisePerWeek',
        'HoursStudyPerWeek', 'LetterGrade',
        'StreetAddress']
df.head()
Listing 3-62

Change All Headers

Or, if you just wanted to change one or two values, you can load the list of headers (Listing 3-63).

headers = list(df.columns.values)
Listing 3-63

Load List of Headers into a Temp Variable

Once the headers are loaded, you can change a few (Listing 3-64).

headers[0] = 'FName'
headers[1] = 'LName'
df.columns = headers
df.head()
Listing 3-64

Changing Headers

Your Turn

Can you change the age column name to years?

Setting Column Names to Lower Case

It may not be the biggest problem in the world, but sometimes I need to convert all the column names to lowercase (or uppercase, for that matter). This lesson will cover how to do that (Listing 3-65).

import pandas as pd
Location = "datasets/gradedata.csv"
df = pd.read_csv(Location)
df.head()
Listing 3-65

Load Data from CSV

Once you have the data, there are two quick ways to cast all the column headers to lowercase (Listing 3-66).

# method 1
df.columns = map(str.lower, df.columns)
# method 2
df.columns = [x.lower() for x in df.columns]
Listing 3-66

Casting All Headers to Lowercase

Your Turn

Can you figure out how to make all the column headers all uppercase?

Finding Matching Rows

Of course, you don't always want to compute using the entire dataset. Sometimes you want to work with just a subset of your data. In this lesson, we find out how to do that (Listing 3-67).

import pandas as pd
names = ['Bob','Jessica','Mary','John','Mel']
grades = [76,95,77,78,99]
GradeList = zip(names,grades)
df = pd.DataFrame(data = GradeList,
        columns=['Names', 'Grades'])
df
Listing 3-67

Creating Dataset

To find all the rows that contain the word Mel, use the code shown in Listing 3-68 in a new cell.

df['Names'].str.contains('Mel')
Listing 3-68

Filtering Rows

After executing that line of Python, you will see a list of Boolean values—True for the lines that match our query and False for that ones that don't.

We can make our answer shorter by adding .any. This will just display a single True if any line matches and False if none of them do (Listing 3-69).

# check if any row matches
df['Names'].str.contains('Mel').any()
Listing 3-69

Check if Any Rows Match

Alternatively, you can add .all. This will just display a single True if all of the lines match and False if at least one of them does not (Listing 3-70).

# check if all rows match
df['Names'].str.contains('Mel').all()
Listing 3-70

Check if All Rows Match

We can also use this along with the .loc (locate) function to show just the rows that match certain criteria (Listing 3-71).

# Find the rows that match a criteria like this
df.loc[df['Names'].str.contains('Mel')==True]
# or even like this...
df.loc[df['Grades']==0]
Listing 3-71

Show the Rows that Match

Your Turn

Can you find all the people who have at least one MS degree in the following data (Listing 3-72)?

import pandas as pd
names = ['Bob','Jessi','Mary','John','Mel','Sam',
        'Cathy','Hank','Lloyd']
grades = [76,95,77,78,99,84,79,100,73]
bsdegrees = [1,1,0,0,1,1,1,0,1]
msdegrees = [2,1,0,0,0,1,1,0,0]
phddegrees = [0,1,0,0,0,2,1,0,0]
GradeList = zip(names,grades,bsdegrees,msdegrees,
        phddegrees)
df = pd.DataFrame(data = GradeList, columns=['Name','Grade','BS','MS','PhD'])
df
Listing 3-72

Starting Dataset

Filter Rows Based on Conditions

import pandas as pd
Location = "datasets/gradedata.csv"
df = pd.read_csv(Location)
df.head()
Listing 3-73

Load Data from CSV

We can show one column of data (Listing 3-74).

df['grade'].head()
Listing 3-74

One Column

Or we can show two columns of data (Listing 3-75).

df[['age','grade']].head()
Listing 3-75

Two Columns

Or we can show the first two rows of data (Listing 3-76).

df[:2]
Listing 3-76

First Two Rows

To show all the rows where the grade is greater than 80, use the code in Listing 3-77.

df[df['grade'] > 80]
Listing 3-77

All Rows with Grade > 80

Using multiple conditions is a little trickier. So, if we wanted to get a list of all the students who scored higher than 99.9 and were male, we would need to use the code shown in Listing 3-78.

df.ix[(df['grade'] > 99.9) &
    (df['gender'] == 'male') ]
Listing 3-78

All Rows Where Men Scored > 99.9

If instead we wanted all the students who scored higher than 99 OR were female, we would need to use the code in Listing 3-79.

df.ix[(df['grade'] > 99) | (df['gender'] == 'female') ]
Listing 3-79

All Rows Where Women or Scored > 99

Your Turn

Can you show all the rows where the student was male, exercised less than two hours per week, and studied more than fifteen hours per week?

Selecting Rows Based on Conditions

import pandas as pd
Location = "datasets/gradedata.csv"
df = pd.read_csv(Location)
df.head()
Listing 3-80

Load Data from CSV

female = df['gender'] == "female"
a_student = df['grade'] >= 90
df[female & a_student].head()
Listing 3-81

Method 1: Using Variables to Hold Attributes

  • Line 1: We create a variable with TRUE if gender is female.

  • Line 2: We create a variable with TRUE if grade is greater than or equal to 90.

  • Line 3: This is where we select all cases where both gender is female and grade is greater than or equal to 90.

df[df['fname'].notnull() & (df['gender'] == "male")]
Listing 3-82

Method 2: Using Variable Attributes Directly

In Listing 3-82, we select all cases where the first name is not missing and gender is male.

Your Turn

Can you find all the rows where the student had four or more hours of exercise per week, seventeen or more hours of study, and still had a grade that was lower than 80?

Random Sampling Dataframe

This one is simple. Obviously, sometimes we have datasets that are too large and we need to take a subset, so let's start with some loaded data (Listing 3-83).

import pandas as pd
import numpy as np
Location = "datasets/gradedata.csv"
df = pd.read_csv(Location)
df.tail()
Listing 3-83

Load Dataset from CSV

To select just 100 rows randomly from that dataset, we can simply run the code shown in Listing 3-84.

df.take(np.random.permutation(len(df))[:100])
Listing 3-84

Random Sample of 100 Rows from Dataframe

Your Turn

Can you create a random sample of 500 rows from that dataset?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset