This chapter covers many fundamental operations of the DataFrame. Many of the recipes will be similar to those in Chapter 1, Pandas Foundations, which primarily covered operations on a Series.
We can select a single column by passing the column name to the index operator of a DataFrame. This was covered in the Selecting a column recipe in Chapter 1, Pandas Foundations. It is often necessary to focus on a subset of the current working dataset, which is accomplished by selecting multiple columns.
In this recipe, all the actor and director columns will be selected from the movie dataset.
>>> import pandas as pd
>>> import numpy as np
>>> movies = pd.read_csv("data/movie.csv")
>>> movie_actor_director = movies[
... [
... "actor_1_name",
... "actor_2_name",
... "actor_3_name",
... "director_name",
... ]
... ]
>>> movie_actor_director.head()
actor_1_name actor_2_name actor_3_name director_name
0 CCH Pounder Joel Dav... Wes Studi James Ca...
1 Johnny Depp Orlando ... Jack Dav... Gore Ver...
2 Christop... Rory Kin... Stephani... Sam Mendes
3 Tom Hardy Christia... Joseph G... Christop...
4 Doug Walker Rob Walker NaN Doug Walker
>>> type(movies[["director_name"]])
<class 'pandas.core.frame.DataFrame'>
>>> type(movies["director_name"])
<class 'pandas.core.series.Series'>
.loc
to pull out a column by name. Because this index operation requires that we pass in a row selector first, we will use a colon (:
) to indicate a slice that selects all of the rows. This can also return either a DataFrame or a Series:
>>> type(movies.loc[:, ["director_name"]])
<class 'pandas.core.frame.DataFrame'>
>>> type(movies.loc[:, "director_name"])
<class 'pandas.core.series.Series'>
The DataFrame index operator is very flexible and capable of accepting a number of different objects. If a string is passed, it will return a single-dimensional Series. If a list is passed to the indexing operator, it returns a DataFrame of all the columns in the list in the specified order.
Step 2 shows how to select a single column as a DataFrame and as a Series. Usually, a single column is selected with a string, resulting in a Series. When a DataFrame is desired, put the column name in a single-element list.
Step 3 shows how to use the loc
attribute to pull out a Series or a DataFrame.
Passing a long list inside the indexing operator might cause readability issues. To help with this, you may save all your column names to a list variable first. The following code achieves the same result as step 1:
>>> cols = [
... "actor_1_name",
... "actor_2_name",
... "actor_3_name",
... "director_name",
... ]
>>> movie_actor_director = movies[cols]
One of the most common exceptions raised when working with pandas is KeyError
. This error is mainly due to mistyping of a column or index name. This same error is raised whenever a multiple column selection is attempted without the use of a list:
>>> movies[
... "actor_1_name",
... "actor_2_name",
... "actor_3_name",
... "director_name",
... ]
Traceback (most recent call last):
...
KeyError: ('actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name')
Although column selection is usually done with the indexing operator, there are some DataFrame methods that facilitate their selection in an alternative manner. The .select_dtypes
and .filter
methods are two useful methods to do this.
If you want to select by type, you need to be familiar with pandas data types. The Understanding data types recipe in Chapter 1, Pandas Foundations, explains the types.
.get_dtype_counts
method to output the number of columns with each specific data type:
>>> movies = pd.read_csv("data/movie.csv")
>>> def shorten(col):
... return (
... str(col)
... .replace("facebook_likes", "fb")
... .replace("_for_reviews", "")
... )
>>> movies = movies.rename(columns=shorten)
>>> movies.dtypes.value_counts()
float64 13
int64 3
object 12
dtype: int64
.select_dtypes
method to select only the integer columns:
>>> movies.select_dtypes(include="int").head()
num_voted_users cast_total_fb movie_fb
0 886204 4834 33000
1 471220 48350 0
2 275868 11700 85000
3 1144337 106759 164000
4 8 143 0
number
to the include
parameter:
>>> movies.select_dtypes(include="number").head()
num_critics duration ... aspect_ratio movie_fb
0 723.0 178.0 ... 1.78 33000
1 302.0 169.0 ... 2.35 0
2 602.0 148.0 ... 2.35 85000
3 813.0 164.0 ... 2.35 164000
4 NaN NaN ... NaN 0
>>> movies.select_dtypes(include=["int", "object"]).head()
color direc/_name ... conte/ating movie_fb
0 Color James Cameron ... PG-13 33000
1 Color Gore Verbinski ... PG-13 0
2 Color Sam Mendes ... PG-13 85000
3 Color Christopher Nolan ... PG-13 164000
4 NaN Doug Walker ... NaN 0
>>> movies.select_dtypes(exclude="float").head()
color director_name ... content_rating movie_fb
0 Color James Ca... ... PG-13 33000
1 Color Gore Ver... ... PG-13 0
2 Color Sam Mendes ... PG-13 85000
3 Color Christop... ... PG-13 164000
4 NaN Doug Walker ... NaN 0
.filter
method. This method is flexible and searches column names (or index labels) based on which parameter is used. Here, we use the like
parameter to search for all the Facebook columns or the names that contain the exact string, fb
. The like
parameter is checking for substrings in column names:
>>> movies.filter(like="fb").head()
director_fb actor_3_fb ... actor_2_fb movie_fb
0 0.0 855.0 ... 936.0 33000
1 563.0 1000.0 ... 5000.0 0
2 0.0 161.0 ... 393.0 85000
3 22000.0 23000.0 ... 23000.0 164000
4 131.0 NaN ... 12.0 0
.filter
method has more tricks (or parameters) up its sleeve. If you use the items
parameters, you can pass in a list of column names:
>>> cols = [
... "actor_1_name",
... "actor_2_name",
... "actor_3_name",
... "director_name",
... ]
>>> movies.filter(items=cols).head()
actor_1_name ... director_name
0 CCH Pounder ... James Cameron
1 Johnny Depp ... Gore Verbinski
2 Christoph Waltz ... Sam Mendes
3 Tom Hardy ... Christopher Nolan
4 Doug Walker ... Doug Walker
.filter
method allows columns to be searched with regular expressions using the regex
parameter. Here, we search for all columns that have a digit somewhere in their name:
>>> movies.filter(regex=r"d").head()
actor_3_fb actor_2_name ... actor_3_name actor_2_fb
0 855.0 Joel Dav... ... Wes Studi 936.0
1 1000.0 Orlando ... ... Jack Dav... 5000.0
2 161.0 Rory Kin... ... Stephani... 393.0
3 23000.0 Christia... ... Joseph G... 23000.0
4 NaN Rob Walker ... NaN 12.0
Step 1 lists the frequencies of all the different data types. Alternatively, you may use the .dtypes
attribute to get the exact data type for each column. The .select_dtypes
method accepts either a list or single data type in its include
or exclude
parameters and returns a DataFrame with columns of just those given data types (or not those types if excluding columns). The list values may be either the string name of the data type or the actual Python object.
The .filter
method selects columns by only inspecting the column names and not the actual data values. It has three mutually exclusive parameters: items
, like
, and regex
, only one of which can be used at a time.
The like
parameter takes a string and attempts to find all the column names that contain that exact string somewhere in the name. To gain more flexibility, you may use the regex
parameter instead to select column names through a regular expression. This particular regular expression, r'd'
, represents all digits from zero to nine and matches any string with at least a single digit in it.
The filter method comes with another parameter, items
, which takes a list of exact column names. This is nearly an exact duplication of the index operation, except that a KeyError
will not be raised if one of the strings does not match a column name. For instance, movies.filter(items=['actor_1_name', 'asdf'])
runs without error and returns a single column DataFrame.
One confusing aspect of .select_dtypes
is its flexibility to take both strings and Python objects. The following list should clarify all the possible ways to select the many different column data types. There is no standard or preferred method of referring to data types in pandas, so it's good to be aware of both ways:
np.number
, 'number'
– Selects both integers and floats regardless of sizenp.float64
, np.float_
, float
, 'float64'
, 'float_'
, 'float'
– Selects only 64-bit floatsnp.float16
, np.float32
, np.float128
, 'float16'
, 'float32'
, 'float128'
– Respectively selects exactly 16, 32, and 128-bit floatsnp.floating
, 'floating'
– Selects all floats regardless of sizenp.int0
, np.int64
, np.int_
, int
, 'int0'
, 'int64'
, 'int_'
, 'int'
– Selects only 64-bit integersnp.int8
, np.int16
, np.int32
, 'int8'
, 'int16'
, 'int32'
– Respectively selects exactly 8, 16, and 32-bit integersnp.integer
, 'integer'
– Selects all integers regardless of size'Int64'
– Selects nullable integer; no NumPy equivalentnp.object
, 'object'
, 'O'
– Select all object data typesnp.datetime64
, 'datetime64'
, 'datetime'
– All datetimes are 64 bitsnp.timedelta64
, 'timedelta64'
, 'timedelta'
– All timedeltas are 64 bitspd.Categorical
, 'category'
– Unique to pandas; no NumPy equivalentBecause all integers and floats default to 64 bits, you may select them by using the string 'int'
or 'float'
as you can see from the preceding bullet list. If you want to select all integers and floats regardless of their specific size, use the string 'number'
.
One of the first tasks to consider after initially importing a dataset as a DataFrame is to analyze the order of the columns. As humans we are used to reading languages from left to right, which impacts our interpretations of the data. It's far easier to find and interpret information when column order is given consideration.
There are no standardized set of rules that dictate how columns should be organized within a dataset. However, it is good practice to develop a set of guidelines that you consistently follow. This is especially true if you work with a group of analysts who share lots of datasets.
The following is a guideline to order columns:
This recipe shows you how to order the columns with this guideline. There are many possible orderings that are sensible.
>>> movies = pd.read_csv("data/movie.csv")
>>> def shorten(col):
... return col.replace("facebook_likes", "fb").replace(
... "_for_reviews", ""
... )
>>> movies = movies.rename(columns=shorten)
>>> movies.columns
Index(['color', 'director_name', 'num_critic', 'duration', 'director_fb',
'actor_3_fb', 'actor_2_name', 'actor_1_fb', 'gross', 'genres',
'actor_1_name', 'movie_title', 'num_voted_users', 'cast_total_fb',
'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
'movie_imdb_link', 'num_user', 'language', 'country', 'content_rating',
'budget', 'title_year', 'actor_2_fb', 'imdb_score', 'aspect_ratio',
'movie_fb'],
dtype='object')
>>> cat_core = [
... "movie_title",
... "title_year",
... "content_rating",
... "genres",
... ]
>>> cat_people = [
... "director_name",
... "actor_1_name",
... "actor_2_name",
... "actor_3_name",
... ]
>>> cat_other = [
... "color",
... "country",
... "language",
... "plot_keywords",
... "movie_imdb_link",
... ]
>>> cont_fb = [
... "director_fb",
... "actor_1_fb",
... "actor_2_fb",
... "actor_3_fb",
... "cast_total_fb",
... "movie_fb",
... ]
>>> cont_finance = ["budget", "gross"]
>>> cont_num_reviews = [
... "num_voted_users",
... "num_user",
... "num_critic",
... ]
>>> cont_other = [
... "imdb_score",
... "duration",
... "aspect_ratio",
... "facenumber_in_poster",
... ]
>>> new_col_order = (
... cat_core
... + cat_people
... + cat_other
... + cont_fb
... + cont_finance
... + cont_num_reviews
... + cont_other
... )
>>> set(movies.columns) == set(new_col_order)
True
>>> movies[new_col_order].head()
movie_title title_year ... aspect_ratio facenumber_in_poster
0 Avatar 2009.0 ... 1.78 0.0
1 Pirates ... 2007.0 ... 2.35 0.0
2 Spectre 2015.0 ... 2.35 1.0
3 The Dark... 2012.0 ... 2.35 0.0
4 Star War... NaN ... NaN 0.0
You can select a subset of columns from a DataFrame, with a list of specific column names. For instance, movies[['movie_title', 'director_name']]
creates a new DataFrame with only the movie_title
and director_name
columns. Selecting columns by name is the default behavior of the index operator for a pandas DataFrame.
Step 3 neatly organizes all of the column names into separate lists based on their type (categorical or continuous) and by how similar their data is. The most important columns, such as the title of the movie, are placed first.
Step 4 concatenates all of the lists of column names and validates that this new list contains the same exact values as the original column names. Python sets are unordered and the equality statement checks whether each member of one set is a member of the other. Manually ordering columns in this recipe is susceptible to human error as it's easy to mistakenly forget a column in the new column list.
Step 5 completes the reordering by passing the new column order as a list to the indexing operator. This new order is now much more sensible than the original.
There are alternative guidelines for ordering columns besides the suggestion mentioned earlier. Hadley Wickham's seminal paper on Tidy Data suggests placing the fixed variables first, followed by measured variables. As this data does not come from a controlled experiment, there is some flexibility in determining which variables are fixed and which ones are measured. Good candidates for measured variables are those that we would like to predict, such as gross, the budget, or the imdb_score. For instance, in this ordering, we can mix categorical and continuous variables. It might make more sense to place the column for the number of Facebook likes directly after the name of that actor. You can, of course, come up with your own guidelines for column order as the computational parts are unaffected by it.
In the Calling Series methods recipe in Chapter 1, Pandas Foundations, a variety of methods operated on a single column or Series of data. Many of these were aggregation or reducing methods that returned a single scalar value. When these same methods are called from a DataFrame, they perform that operation for each column at once and reduce the results for each column in the DataFrame. They return a Series with the column names in the index and the summary for each column as the value.
In this recipe, we explore a variety of the most common DataFrame attributes and methods with the movie dataset.
.shape
, .size
, and .ndim
, along with running the len
function:
>>> movies = pd.read_csv("data/movie.csv")
>>> movies.shape
(4916, 28)
>>> movies.size
137648
>>> movies.ndim
2
>>> len(movies)
4916
.count
method shows the number of non-missing values for each column. It is an aggregation method as it summarizes every column in a single value. The output is a Series that has the original column names as its index:
>>> movies.count()
color 4897
director_name 4814
num_critic_for_reviews 4867
duration 4901
director_facebook_likes 4814
...
title_year 4810
actor_2_facebook_likes 4903
imdb_score 4916
aspect_ratio 4590
movie_facebook_likes 4916
Length: 28, dtype: int64
.min
, .max
, .mean
, .median
, and .std
, return Series that have the column names of the numeric columns in the index and their aggregations as the values:
>>> movies.min()
num_critic_for_reviews 1.00
duration 7.00
director_facebook_likes 0.00
actor_3_facebook_likes 0.00
actor_1_facebook_likes 0.00
...
title_year 1916.00
actor_2_facebook_likes 0.00
imdb_score 1.60
aspect_ratio 1.18
movie_facebook_likes 0.00
Length: 16, dtype: float64
.describe
method is very powerful and calculates all the descriptive statistics and quartiles at once. The end result is a DataFrame with the descriptive statistics names as its index. I like to transpose the results using .T
as I can usually fit more information on the screen that way:
>>> movies.describe().T
count mean ... 75% max
num_criti... 4867.0 137.988905 ... 191.00 813.0
duration 4901.0 107.090798 ... 118.00 511.0
director_... 4814.0 691.014541 ... 189.75 23000.0
actor_3_f... 4893.0 631.276313 ... 633.00 23000.0
actor_1_f... 4909.0 6494.488491 ... 11000.00 640000.0
... ... ... ... ... ...
title_year 4810.0 2002.447609 ... 2011.00 2016.0
actor_2_f... 4903.0 1621.923516 ... 912.00 137000.0
imdb_score 4916.0 6.437429 ... 7.20 9.5
aspect_ratio 4590.0 2.222349 ... 2.35 16.0
movie_fac... 4916.0 7348.294142 ... 2000.00 349000.0
.describe
method using the percentiles
parameter:
>>> movies.describe(percentiles=[0.01, 0.3, 0.99]).T
count mean ... 99% max
num_criti... 4867.0 137.988905 ... 546.68 813.0
duration 4901.0 107.090798 ... 189.00 511.0
director_... 4814.0 691.014541 ... 16000.00 23000.0
actor_3_f... 4893.0 631.276313 ... 11000.00 23000.0
actor_1_f... 4909.0 6494.488491 ... 44920.00 640000.0
... ... ... ... ... ...
title_year 4810.0 2002.447609 ... 2016.00 2016.0
actor_2_f... 4903.0 1621.923516 ... 17000.00 137000.0
imdb_score 4916.0 6.437429 ... 8.50 9.5
aspect_ratio 4590.0 2.222349 ... 4.00 16.0
movie_fac... 4916.0 7348.294142 ... 93850.00 349000.0
Step 1 gives basic information on the size of the dataset. The .shape
attribute returns a tuple with the number of rows and columns. The .size
attribute returns the total number of elements in the DataFrame, which is just the product of the number of rows and columns. The .ndim
attribute returns the number of dimensions, which is two for all DataFrames. When a DataFrame is passed to the built-in len
function, it returns the number of rows.
The methods in step 2 and step 3 aggregate each column down to a single number. Each column name is now the index label in a Series with its aggregated result as the corresponding value.
If you look closely, you will notice that the output from step 3 is missing all the object columns from step 2. This method ignores string columns by default.
Note that numeric columns have missing values but have a result returned by .describe
. By default, pandas handles missing values in numeric columns by skipping them. It is possible to change this behavior by setting the skipna
parameter to False
. This will cause pandas to return NaN
for all these aggregation methods if there exists at least a single missing value.
The .describe
method displays the summary statistics of the numeric columns. You can expand its summary to include more quantiles by passing a list of numbers between 0 and 1 to the percentiles
parameter. See the Developing a data analysis routine recipe for more on the .describe
method.
To see how the .skipna
parameter affects the outcome, we can set its value to False
and rerun step 3 from the preceding recipe. Only numeric columns without missing values will calculate a result:
>>> movies.min(skipna=False)
num_critic_for_reviews NaN
duration NaN
director_facebook_likes NaN
actor_3_facebook_likes NaN
actor_1_facebook_likes NaN
...
title_year NaN
actor_2_facebook_likes NaN
imdb_score 1.6
aspect_ratio NaN
movie_facebook_likes 0.0
Length: 16, dtype: float64
The Chaining Series methods recipe in Chapter 1, Pandas Foundations, showcased several examples of chaining Series methods together. All the method chains in this chapter will begin from a DataFrame. One of the keys to method chaining is to know the exact object being returned during each step of the chain. In pandas, this will nearly always be a DataFrame, Series, or scalar value.
In this recipe, we count all the missing values in each column of the movie dataset.
.isnull
method to get a count of the missing values. This method will change every value to a Boolean, indicating whether it is missing:
>>> movies = pd.read_csv("data/movie.csv")
>>> def shorten(col):
... return col.replace("facebook_likes", "fb").replace(
... "_for_reviews", ""
... )
>>> movies = movies.rename(columns=shorten)
>>> movies.isnull().head()
color director_name ... aspect_ratio movie_fb
0 False False ... False False
1 False False ... False False
2 False False ... False False
3 False False ... False False
4 True False ... True False
.sum
method that interprets True
and False
as 1
and 0
, respectively. Because this is a reduction method, it aggregates the results into a Series:
>>> (movies.isnull().sum().head())
color 19
director_name 102
num_critic 49
duration 15
director_fb 102
dtype: int64
>>> movies.isnull().sum().sum()
2654
.any
method twice in succession:
>>> movies.isnull().any().any()
True
The .isnull
method returns a DataFrame the same size as the calling DataFrame but with all values transformed to Booleans. See the counts of the following data types to verify this:
>>> movies.isnull().dtypes.value_counts()
bool 28
dtype: int64
In Python, Booleans evaluate to 0 and 1, and this makes it possible to sum them by column, as done in step 2. The resulting Series itself also has a .sum
method, which gets us the grand total of missing values in the DataFrame.
In step 4, the .any
method on a DataFrame returns a Series of Booleans indicating if there exists at least one True
for each column. The .any
method is chained again on this resulting Series of Booleans to determine if any of the columns have missing values. If step 4 evaluates as True
, then there is at least one missing value in the entire DataFrame.
Most of the columns in the movie dataset with the object
data type contain missing values. By default, aggregation methods (.min
, .max
, and .sum
), do not return anything for object
columns. as seen in the following code snippet, which selects three object
columns and attempts to find the maximum value of each one:
>>> movies[["color", "movie_title", "color"]].max()
Series([], dtype: float64)
To force pandas to return something for each column, we must fill in the missing values. Here, we choose an empty string:
>>> movies.select_dtypes(["object"]).fillna("").max()
color Color
director_name Étienne Faure
actor_2_name Zubaida Sahar
genres Western
actor_1_name Óscar Jaenada
...
plot_keywords zombie|zombie spoof
movie_imdb_link http://www.imdb....
language Zulu
country West Germany
content_rating X
Length: 12, dtype: object
For purposes of readability, method chains are often written as one method call per line surrounded by parentheses. This makes it easier to read and insert comments on what is returned at each step of the chain, or comment out lines to debug what is happening:
>>> (movies.select_dtypes(["object"]).fillna("").max())
color Color
director_name Étienne Faure
actor_2_name Zubaida Sahar
genres Western
actor_1_name Óscar Jaenada
...
plot_keywords zombie|zombie spoof
movie_imdb_link http://www.imdb....
language Zulu
country West Germany
content_rating X
Length: 12, dtype: object
A primer on operators was given in the Series operations recipe from Chapter 1, Pandas Foundations, which will be helpful here. The Python arithmetic and comparison operators work with DataFrames, as they do with Series.
When an arithmetic or comparison operator is used with a DataFrame, each value of each column gets the operation applied to it. Typically, when an operator is used with a DataFrame, the columns are either all numeric or all object (usually strings). If the DataFrame does not contain homogeneous data, then the operation is likely to fail. Let's see an example of this failure with the college dataset, which contains both numeric and object data types. Attempting to add 5
to each value of the DataFrame raises a TypeError
as integers cannot be added to strings:
>>> colleges = pd.read_csv("data/college.csv")
>>> colleges + 5
Traceback (most recent call last):
...
TypeError: can only concatenate str (not "int") to str
To successfully use an operator with a DataFrame, first select homogeneous data. For this recipe, we will select all the columns that begin with 'UGDS_'
. These columns represent the fraction of undergraduate students by race. To get started, we import the data and use the institution name as the label for our index, and then select the columns we desire with the .filter
method:
>>> colleges = pd.read_csv(
... "data/college.csv", index_col="INSTNM"
... )
>>> college_ugds = colleges.filter(like="UGDS_")
>>> college_ugds.head()
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... 0.0333 0.9353 ... 0.0059 0.0138
Universit... 0.5922 0.2600 ... 0.0179 0.0100
Amridge U... 0.2990 0.4192 ... 0.0000 0.2715
Universit... 0.6988 0.1255 ... 0.0332 0.0350
Alabama S... 0.0158 0.9208 ... 0.0243 0.0137
This recipe uses multiple operators with a DataFrame to round the undergraduate columns to the nearest hundredth. We will then see how this result is equivalent to the .round
method.
UGDS_BLACK
row of this series when we round it to two decimal places:
>>> name = "Northwest-Shoals Community College"
>>> college_ugds.loc[name]
UGDS_WHITE 0.7912
UGDS_BLACK 0.1250
UGDS_HISP 0.0339
UGDS_ASIAN 0.0036
UGDS_AIAN 0.0088
UGDS_NHPI 0.0006
UGDS_2MOR 0.0012
UGDS_NRA 0.0033
UGDS_UNKN 0.0324
Name: Northwest-Shoals Community College, dtype: float64
>>> college_ugds.loc[name].round(2)
UGDS_WHITE 0.79
UGDS_BLACK 0.12
UGDS_HISP 0.03
UGDS_ASIAN 0.00
UGDS_AIAN 0.01
UGDS_NHPI 0.00
UGDS_2MOR 0.00
UGDS_NRA 0.00
UGDS_UNKN 0.03
Name: Northwest-Shoals Community College, dtype: float64
If we add .0001
before rounding, it changes to rounding up:
>>> (college_ugds.loc[name] + 0.0001).round(2)
UGDS_WHITE 0.79
UGDS_BLACK 0.13
UGDS_HISP 0.03
UGDS_ASIAN 0.00
UGDS_AIAN 0.01
UGDS_NHPI 0.00
UGDS_2MOR 0.00
UGDS_NRA 0.00
UGDS_UNKN 0.03
Name: Northwest-Shoals Community College, dtype: float64
.00501
to each value of college_ugds
:
>>> college_ugds + 0.00501
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... 0.03831 0.94031 ... 0.01091 0.01881
Universit... 0.59721 0.26501 ... 0.02291 0.01501
Amridge U... 0.30401 0.42421 ... 0.00501 0.27651
Universit... 0.70381 0.13051 ... 0.03821 0.04001
Alabama S... 0.02081 0.92581 ... 0.02931 0.01871
... ... ... ... ... ...
SAE Insti... NaN NaN ... NaN NaN
Rasmussen... NaN NaN ... NaN NaN
National ... NaN NaN ... NaN NaN
Bay Area ... NaN NaN ... NaN NaN
Excel Lea... NaN NaN ... NaN NaN
//
, to round down to the nearest whole number percentage:
>>> (college_ugds + 0.00501) // 0.01
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... 3.0 94.0 ... 1.0 1.0
Universit... 59.0 26.0 ... 2.0 1.0
Amridge U... 30.0 42.0 ... 0.0 27.0
Universit... 70.0 13.0 ... 3.0 4.0
Alabama S... 2.0 92.0 ... 2.0 1.0
... ... ... ... ... ...
SAE Insti... NaN NaN ... NaN NaN
Rasmussen... NaN NaN ... NaN NaN
National ... NaN NaN ... NaN NaN
Bay Area ... NaN NaN ... NaN NaN
Excel Lea... NaN NaN ... NaN NaN
100
:
>>> college_ugds_op_round =(
... (college_ugds + 0.00501) // 0.01 / 100
... )
>>> college_ugds_op_round.head()
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... 0.03 0.94 ... 0.01 0.01
Universit... 0.59 0.26 ... 0.02 0.01
Amridge U... 0.30 0.42 ... 0.00 0.27
Universit... 0.70 0.13 ... 0.03 0.04
Alabama S... 0.02 0.92 ... 0.02 0.01
>>> college_ugds_round = (college_ugds + 0.00001).round(2)
>>> college_ugds_round
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... 0.03 0.94 ... 0.01 0.01
Universit... 0.59 0.26 ... 0.02 0.01
Amridge U... 0.30 0.42 ... 0.00 0.27
Universit... 0.70 0.13 ... 0.03 0.04
Alabama S... 0.02 0.92 ... 0.02 0.01
... ... ... ... ... ....
SAE Insti... NaN NaN ... NaN NaN
Rasmussen... NaN NaN ... NaN NaN
National ... NaN NaN ... NaN NaN
Bay Area ... NaN NaN ... NaN NaN
Excel Lea... NaN NaN ... NaN NaN
>>> college_ugds_op_round.equals(college_ugds_round)
True
Steps 1 and 2 use the plus operator, which attempts to add a scalar value to each value of each column of the DataFrame. As the columns are all numeric, this operation works as expected. There are some missing values in each of the columns but they stay missing after the operation.
Mathematically, adding .005
should be enough so that the floor division in the next step correctly rounds to the nearest whole percentage. The trouble appears because of the inexactness of floating-point numbers:
>>> 0.045 + 0.005
0.049999999999999996
There is an extra .00001
added to each number to ensure that the floating-point representation has the first four digits the same as the actual value. This works because the maximum precision of all the points in the dataset is four decimal places.
Step 3 applies the floor division operator, //
, to all the values in the DataFrame. As we are dividing by a fraction, in essence, it is multiplying each value by 100
and truncating any decimals. Parentheses are needed around the first part of the expression, as floor division has higher precedence than addition. Step 4 uses the division operator to return the decimal to the correct position.
In step 5, we reproduce the previous steps with the round method. Before we can do this, we must again add an extra .00001
to each DataFrame value for a different reason from step 2. NumPy and Python 3 round numbers that are exactly halfway between either side to the even number. The bankers rounding (or ties to even http://bit.ly/2x3V5TU) technique is not usually what is formally taught in schools. It does not consistently bias numbers to the higher side (http://bit.ly/2zhsPy8).
It is necessary here to round up so that both DataFrame values are equal. The .equals
method determines if all the elements and indexes between two DataFrames are exactly the same and returns a Boolean.
Just as with Series, DataFrames have method equivalents of the operators. You may replace the operators with their method equivalents:
>>> college2 = (
... college_ugds.add(0.00501).floordiv(0.01).div(100)
... )
>>> college2.equals(college_ugds_op_round)
True
pandas uses the NumPy NaN (np.nan
) object to represent a missing value. This is an unusual object and has interesting mathematical properties. For instance, it is not equal to itself. Even Python's None
object evaluates as True
when compared to itself:
>>> np.nan == np.nan
False
>>> None == None
True
All other comparisons against np.nan
also return False
, except not equal to (!=
):
>>> np.nan > 5
False
>>> 5 > np.nan
False
>>> np.nan != 5
True
Series and DataFrames use the equals operator, ==
, to make element-by-element comparisons. The result is an object with the same dimensions. This recipe shows you how to use the equals operator, which is very different from the .equals
method.
As in the previous recipe, the columns representing the fraction of each race of undergraduate students from the college dataset will be used:
>>> college = pd.read_csv(
... "data/college.csv", index_col="INSTNM"
... )
>>> college_ugds = college.filter(like="UGDS_")
>>> college_ugds == 0.0019
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... False False ... False False
Universit... False False ... False False
Amridge U... False False ... False False
Universit... False False ... False False
Alabama S... False False ... False False
... ... ... ... ... ...
SAE Insti... False False ... False False
Rasmussen... False False ... False False
National ... False False ... False False
Bay Area ... False False ... False False
Excel Lea... False False ... False False
college_ugds
compared against itself, as follows:
>>> college_self_compare = college_ugds == college_ugds
>>> college_self_compare.head()
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... True True ... True True
Universit... True True ... True True
Amridge U... True True ... True True
Universit... True True ... True True
Alabama S... True True ... True True
.all
method to determine if each column contains only True
values yields an unexpected result:
>>> college_self_compare.all()
UGDS_WHITE False
UGDS_BLACK False
UGDS_HISP False
UGDS_ASIAN False
UGDS_AIAN False
UGDS_NHPI False
UGDS_2MOR False
UGDS_NRA False
UGDS_UNKN False
dtype: bool
>>> (college_ugds == np.nan).sum()
UGDS_WHITE 0
UGDS_BLACK 0
UGDS_HISP 0
UGDS_ASIAN 0
UGDS_AIAN 0
UGDS_NHPI 0
UGDS_2MOR 0
UGDS_NRA 0
UGDS_UNKN 0
dtype: int64
==
to find missing numbers, use the .isna
method:
>>> college_ugds.isna().sum()
UGDS_WHITE 661
UGDS_BLACK 661
UGDS_HISP 661
UGDS_ASIAN 661
UGDS_AIAN 661
UGDS_NHPI 661
UGDS_2MOR 661
UGDS_NRA 661
UGDS_UNKN 661
dtype: int64
==
) but with the .equals
method. This method treats NaNs that are in the same location as equal (note that the .eq
method is the equivalent of ==
):
>>> college_ugds.equals(college_ugds)
True
Step 1 compares a DataFrame to a scalar value while step 2 compares a DataFrame with another DataFrame. Both operations appear to be quite simple and intuitive at first glance. The second operation is checking whether the DataFrames have identically labeled indexes and thus the same number of elements. The operation will fail if this isn't the case.
Step 3 verifies that none of the columns in the DataFrames are equivalent to each other. Step 4 further shows the non-equivalence of np.nan
and itself. Step 5 verifies that there are indeed missing values in the DataFrame. Finally, step 6 shows the correct way to compare DataFrames with the .equals
method, which always returns a Boolean scalar value.
All the comparison operators have method counterparts that allow for more functionality. Somewhat confusingly, the .eq
DataFrame method does element-by-element comparison, just like the equals (==
) operator. The .eq
method is not at all the same as the .equals
method. The following code duplicates step 1:
>>> college_ugds.eq(0.0019) # same as college_ugds == .0019
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... False False ... False False
Universit... False False ... False False
Amridge U... False False ... False False
Universit... False False ... False False
Alabama S... False False ... False False
... ... ... ... ... ...
SAE Insti... False False ... False False
Rasmussen... False False ... False False
National ... False False ... False False
Bay Area ... False False ... False False
Excel Lea... False False ... False False
Inside the pandas.testing
sub-package, a function exists that developers should use when creating unit tests. The assert_frame_equal
function raises an AssertionError
if two DataFrames are not equal. It returns None
if the two DataFrames are equal:
>>> from pandas.testing import assert_frame_equal
>>> assert_frame_equal(college_ugds, college_ugds) is None
True
Unit tests are a very important part of software development and ensure that the code is running correctly. pandas contains many thousands of unit tests that help ensure that it is running properly. To read more on how pandas runs its unit tests, see the Contributing to pandas section in the documentation (http://bit.ly/2vmCSU6).
Many DataFrame methods have an axis
parameter. This parameter controls the direction in which the operation takes place. Axis parameters can be 'index'
(or 0
) or 'columns'
(or 1
). I prefer the string versions are they are more explicit and tend to make the code easier to read.
Nearly all DataFrame methods default the axis parameter to 0
, which applies to operations along the index. This recipe shows you how to invoke the same method along both axes.
>>> college = pd.read_csv(
... "data/college.csv", index_col="INSTNM"
... )
>>> college_ugds = college.filter(like="UGDS_")
>>> college_ugds.head()
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... 0.0333 0.9353 ... 0.0059 0.0138
Universit... 0.5922 0.2600 ... 0.0179 0.0100
Amridge U... 0.2990 0.4192 ... 0.0000 0.2715
Universit... 0.6988 0.1255 ... 0.0332 0.0350
Alabama S... 0.0158 0.9208 ... 0.0243 0.0137
.count
method returns the number of non-missing values. By default, its axis
parameter is set to 0
:
>>> college_ugds.count()
UGDS_WHITE 6874
UGDS_BLACK 6874
UGDS_HISP 6874
UGDS_ASIAN 6874
UGDS_AIAN 6874
UGDS_NHPI 6874
UGDS_2MOR 6874
UGDS_NRA 6874
UGDS_UNKN 6874
dtype: int64
The axis
parameter is almost always set to 0
. So, step 2 is equivalent to both college_ugds.count(axis=0)
and college_ugds.count(axis='index')
.
'columns'
changes the direction of the operation so that we get back a count of non-missing items in each row:
>>> college_ugds.count(axis="columns").head()
INSTNM
Alabama A & M University 9
University of Alabama at Birmingham 9
Amridge University 9
University of Alabama in Huntsville 9
Alabama State University 9
dtype: int64
.sum
method may be used to verify this:
>>> college_ugds.sum(axis="columns").head()
INSTNM
Alabama A & M University 1.0000
University of Alabama at Birmingham 0.9999
Amridge University 1.0000
University of Alabama in Huntsville 1.0000
Alabama State University 1.0000
dtype: float64
.median
method can be used:
>>> college_ugds.median(axis="index")
UGDS_WHITE 0.55570
UGDS_BLACK 0.10005
UGDS_HISP 0.07140
UGDS_ASIAN 0.01290
UGDS_AIAN 0.00260
UGDS_NHPI 0.00000
UGDS_2MOR 0.01750
UGDS_NRA 0.00000
UGDS_UNKN 0.01430
dtype: float64
The direction of operation on the axis is one of the more confusing aspects of pandas. Many pandas users have difficulty remembering the meaning of the axis parameter. I remember them by reminding myself that a Series only has one axis, the index (or 0). A DataFrame also has an index (axis 0) and columns (axis 1).
The .cumsum
method with axis=1
accumulates the race percentages across each row. It gives a slightly different view of the data. For example, it is very easy to see the exact percentage of white and black students for each school:
>>> college_ugds_cumsum = college_ugds.cumsum(axis=1)
>>> college_ugds_cumsum.head()
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... 0.0333 0.9686 ... 0.9862 1.0000
Universit... 0.5922 0.8522 ... 0.9899 0.9999
Amridge U... 0.2990 0.7182 ... 0.7285 1.0000
Universit... 0.6988 0.8243 ... 0.9650 1.0000
Alabama S... 0.0158 0.9366 ... 0.9863 1.0000
Many articles are written every year on the different aspects and impacts of diversity on college campuses. Various organizations have developed metrics attempting to measure diversity. US News is a leader in providing rankings for many different categories of colleges, with diversity being one of them. Their top 10 diverse colleges with Diversity Index are given as follows:
>>> pd.read_csv(
... "data/college_diversity.csv", index_col="School"
... )
Diversity Index
School
Rutgers University--Newark Newark, NJ 0.76
Andrews University Berrien Springs, MI 0.74
Stanford University Stanford, CA 0.74
University of Houston Houston, TX 0.74
University of Nevada--Las Vegas Las Vegas, NV 0.74
University of San Francisco San Francisco, CA 0.74
San Francisco State University San Francisco, CA 0.73
University of Illinois--Chicago Chicago, IL 0.73
New Jersey Institute of Technology Newark, NJ 0.72
Texas Woman's University Denton, TX 0.72
Our college dataset classifies race into nine different categories. When trying to quantify something without an obvious definition, such as diversity, it helps to start with something simple. In this recipe, our diversity metric will equal the count of the number of races having greater than 15% of the student population.
>>> college = pd.read_csv(
... "data/college.csv", index_col="INSTNM"
... )
>>> college_ugds = college.filter(like="UGDS_")
>>> (
... college_ugds.isnull()
... .sum(axis="columns")
... .sort_values(ascending=False)
... .head()
... )
INSTNM
Excel Learning Center-San Antonio South 9
Philadelphia College of Osteopathic Medicine 9
Assemblies of God Theological Seminary 9
Episcopal Divinity School 9
Phillips Graduate Institute 9
dtype: int64
.dropna
method to drop all rows that have all nine race percentages missing. We can then count the remaining missing values:
>>> college_ugds = college_ugds.dropna(how="all")
>>>; college_ugds.isnull().sum()
UGDS_WHITE 0
UGDS_BLACK 0
UGDS_HISP 0
UGDS_ASIAN 0
UGDS_AIAN 0
UGDS_NHPI 0
UGDS_2MOR 0
UGDS_NRA 0
UGDS_UNKN 0
dtype: int64
.ge
, to return a DataFrame with a Boolean value for each cell:
>>> college_ugds.ge(0.15)
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... False True ... False False
Universit... True True ... False False
Amridge U... True True ... False True
Universit... True False ... False False
Alabama S... False True ... False False
... ... ... ... ... ...
Hollywood... True True ... False False
Hollywood... False True ... False False
Coachella... True False ... False False
Dewey Uni... False False ... False False
Coastal P... True True ... False False
.sum
method to count the True
values for each college. Notice that a Series is returned:
>>> diversity_metric = college_ugds.ge(0.15).sum(
... axis="columns"
... )
>>> diversity_metric.head()
INSTNM
Alabama A & M University 1
University of Alabama at Birmingham 2
Amridge University 3
University of Alabama in Huntsville 1
Alabama State University 1
dtype: int64
.value_counts
method on this Series:
>>> diversity_metric.value_counts()
1 3042
2 2884
3 876
4 63
0 7
5 2
dtype: int64
diversity_metric
Series to find out which ones they are:
>>> diversity_metric.sort_values(ascending=False).head()
INSTNM
Regency Beauty Institute-Austin 5
Central Texas Beauty College-Temple 5
Sullivan and Cogliano Training Center 4
Ambria College of Nursing 4
Berkeley College-New York 4
dtype: int64
.loc
to select rows based on the index label:
>>> college_ugds.loc[
... [
... "Regency Beauty Institute-Austin",
... "Central Texas Beauty College-Temple",
... ]
... ]
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Regency B... 0.1867 0.2133 ... 0.0 0.2667
Central T... 0.1616 0.2323 ... 0.0 0.1515
>>> us_news_top = [
... "Rutgers University-Newark",
... "Andrews University",
... "Stanford University",
... "University of Houston",
... "University of Nevada-Las Vegas",
... ]
>>> diversity_metric.loc[us_news_top]
INSTNM
Rutgers University-Newark 4
Andrews University 3
Stanford University 3
University of Houston 3
University of Nevada-Las Vegas 3
dtype: int64
Step 2 counts and then displays the schools with the highest number of missing values. As there are nine columns in the DataFrame, the maximum number of missing values per school is nine. Many schools are missing values for each column. Step 3 removes rows that have all their values missing. The .dropna
method in step 3 has the how
parameter, which defaults to the string 'any'
, but may also be changed to 'all'
. When set to 'any'
, it drops rows that contain one or more missing values. When set to 'all'
, it only drops rows where all values are missing.
In this case, we conservatively drop rows that are missing all values. This is because it's possible that some missing values represent 0 percent. This did not happen to be the case here, as there were no missing values after the dropna
method was performed. If there were still missing values, we could have run the .fillna(0)
method to fill all the remaining values with 0.
Step 5 begins our diversity metric calculation using the greater than or equal to method, .ge
. This results in a DataFrame of all Booleans, which is summed horizontally by setting axis='columns'
.
The .value_counts
method is used in step 6 to produce a distribution of our diversity metric. It is quite rare for schools to have three races with 15% or more of the undergraduate student population. Step 7 and step 8 find two schools that are the most diverse based on our metric. Although they are diverse, it appears that many of the races are not fully accounted for and are defaulted into the unknown and two or more categories.
Step 9 selects the top five schools from the US News article. It then selects their diversity metric from our newly created Series. It turns out that these schools also score highly with our simple ranking system.
Alternatively, we can find the schools that are least diverse by ordering them by their maximum race percentage:
>>> (
... college_ugds.max(axis=1)
... .sort_values(ascending=False)
... .head(10)
... )
INSTNM
Dewey University-Manati 1.0
Yeshiva and Kollel Harbotzas Torah 1.0
Mr Leon's School of Hair Design-Lewiston 1.0
Dewey University-Bayamon 1.0
Shepherds Theological Seminary 1.0
Yeshiva Gedolah Kesser Torah 1.0
Monteclaro Escuela de Hoteleria y Artes Culinarias 1.0
Yeshiva Shaar Hatorah 1.0
Bais Medrash Elyon 1.0
Yeshiva of Nitra Rabbinical College 1.0
dtype: float64
We can also determine if any school has all nine race categories exceeding 1%:
>>> (college_ugds > 0.01).all(axis=1).any()
True