It is important to consider the steps that you, as an analyst, take when you first encounter a dataset after importing it into your workspace as a DataFrame. Is there a set of tasks that you usually undertake to examine the data? Are you aware of all the possible data types? This chapter begins by covering the tasks you might want to undertake when first encountering a new dataset. The chapter proceeds by answering common questions about things that are not that simple to do in pandas.
Although there is no standard approach when beginning a data analysis, it is typically a good idea to develop a routine for yourself when first examining a dataset. Similar to everyday routines that we have for waking up, showering, going to work, eating, and so on, a data analysis routine helps you to quickly get acquainted with a new dataset. This routine can manifest itself as a dynamic checklist of tasks that evolves as your familiarity with pandas and data analysis expands.
Exploratory Data Analysis (EDA) is a term used to describe the process of analyzing datasets. Typically it does not involve model creation, but summarizing the characteristics of the data and visualizing them. This is not new and was promoted by John Tukey in his book Exploratory Data Analysis in 1977.
Many of these same processes are still applicable and useful to understand a dataset. Indeed, they can also help with creating machine learning models later.
This recipe covers a small but fundamental part of EDA: the collection of metadata and descriptive statistics in a routine and systematic way. It outlines a standard set of tasks that can be undertaken when first importing any dataset as a pandas DataFrame. This recipe may help form the basis of the routine that you can implement when first examining a dataset.
Metadata describes the dataset or, more aptly, data about the data. Examples of metadata include the number of columns/rows, column names, data types of each column, the source of the dataset, the date of collection, the acceptable values for different columns, and so on. Univariate descriptive statistics are summary statistics about variables (columns) of the dataset, independent of all other variables.
First, some metadata on the college dataset will be collected, followed by basic summary statistics of each column:
.sample
method:
>>> import pandas as pd
>>> import numpy as np
>>> college = pd.read_csv("data/college.csv")
>>> college.sample(random_state=42)
INSTNM CITY ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
3649 Career P... San Antonio ... 20700 14977
.shape
attribute:
>>> college.shape
(7535, 27)
.info
method:
>>> college.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7535 entries, 0 to 7534
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 INSTNM 7535 non-null object
1 CITY 7535 non-null object
2 STABBR 7535 non-null object
3 HBCU 7164 non-null float64
4 MENONLY 7164 non-null float64
5 WOMENONLY 7164 non-null float64
6 RELAFFIL 7535 non-null int64
7 SATVRMID 1185 non-null float64
8 SATMTMID 1196 non-null float64
9 DISTANCEONLY 7164 non-null float64
10 UGDS 6874 non-null float64
11 UGDS_WHITE 6874 non-null float64
12 UGDS_BLACK 6874 non-null float64
13 UGDS_HISP 6874 non-null float64
14 UGDS_ASIAN 6874 non-null float64
15 UGDS_AIAN 6874 non-null float64
16 UGDS_NHPI 6874 non-null float64
17 UGDS_2MOR 6874 non-null float64
18 UGDS_NRA 6874 non-null float64
19 UGDS_UNKN 6874 non-null float64
20 PPTUG_EF 6853 non-null float64
21 CURROPER 7535 non-null int64
22 PCTPELL 6849 non-null float64
23 PCTFLOAN 6849 non-null float64
24 UG25ABV 6718 non-null float64
25 MD_EARN_WNE_P10 6413 non-null object
26 GRAD_DEBT_MDN_SUPP 7503 non-null object
dtypes: float64(20), int64(2), object(5)
memory usage: 1.6+ MB
>>> college.describe(include=[np.number]).T
count mean ... 75% max
HBCU 7164.0 0.014238 ... 0.000000 1.0
MENONLY 7164.0 0.009213 ... 0.000000 1.0
WOMENONLY 7164.0 0.005304 ... 0.000000 1.0
RELAFFIL 7535.0 0.190975 ... 0.000000 1.0
SATVRMID 1185.0 522.819409 ... 555.000000 765.0
... ... ... ... ... ...
PPTUG_EF 6853.0 0.226639 ... 0.376900 1.0
CURROPER 7535.0 0.923291 ... 1.000000 1.0
PCTPELL 6849.0 0.530643 ... 0.712900 1.0
PCTFLOAN 6849.0 0.522211 ... 0.745000 1.0
UG25ABV 6718.0 0.410021 ... 0.572275 1.0
>>> college.describe(include=[np.object]).T
count unique top freq
INSTNM 7535 7535 Academy ... 1
CITY 7535 2514 New York 87
STABBR 7535 59 CA 773
MD_EARN_W... 6413 598 PrivacyS... 822
GRAD_DEBT... 7503 2038 PrivacyS... 1510
After importing your dataset, a common task is to print out a sample of rows of the DataFrame for manual inspection with the .sample
method. The .shape
attribute returns some metadata; a tuple containing the number of rows and columns.
A method to get more metadata at once is the .info
method. It provides each column name, the number of non-missing values, the data type of each column, and the approximate memory usage of the DataFrame. Usually, a column in pandas has a single type (however, it is possible to have a column that has mixed types, and it will be reported as object
). DataFrames, as a whole, might be composed of columns with different data types.
Step 4 and step 5 produce descriptive statistics on different types of columns. By default, .describe
outputs a summary for all the numeric columns and silently drops any non-numeric columns. You can pass in other options to the include
parameter to include counts and frequencies for a column with non-numeric data types. Technically, the data types are part of a hierarchy where np.number
resides above integers and floats.
We can classify data as being either continuous or categorical. Continuous data is always numeric and can usually take on an infinite number of possibilities, such as height, weight, and salary. Categorical data represent discrete values that take on a finite number of possibilities, such as ethnicity, employment status, and car color. Categorical data can be represented numerically or with characters.
Categorical columns are usually going to be either of the type np.object
or pd.Categorical
. Step 5 ensures that both of these types are represented. In both step 4 and step 5, the output DataFrame is transposed with the .T
property. This may ease readability for DataFrames with many columns as it typically allows more data to fit on the screen without scrolling.
It is possible to specify the exact quantiles returned from the .describe
method when used with numeric columns:
>>> college.describe(
>>> include=[np.number],
... percentiles=[
... 0.01,
... 0.05,
... 0.10,
... 0.25,
... 0.5,
... 0.75,
... 0.9,
... 0.95,
... 0.99,
... ],
... ).T
count mean ... 99% max
HBCU 7164.0 0.014238 ... 1.000000 1.0
MENONLY 7164.0 0.009213 ... 0.000000 1.0
WOMENONLY 7164.0 0.005304 ... 0.000000 1.0
RELAFFIL 7535.0 0.190975 ... 1.000000 1.0
SATVRMID 1185.0 522.819409 ... 730.000000 765.0
... ... ... ... ... ...
PPTUG_EF 6853.0 0.226639 ... 0.946724 1.0
CURROPER 7535.0 0.923291 ... 1.000000 1.0
PCTPELL 6849.0 0.530643 ... 0.993908 1.0
PCTFLOAN 6849.0 0.522211 ... 0.986368 1.0
UG25ABV 6718.0 0.410021 ... 0.917383 1.0
A crucial part of data analysis involves creating and maintaining a data dictionary. A data dictionary is a table of metadata and notes on each column of data. One of the primary purposes of a data dictionary is to explain the meaning of the column names. The college dataset uses a lot of abbreviations that are likely to be unfamiliar to an analyst who is inspecting it for the first time.
A data dictionary for the college dataset is provided in the following college_data_dictionary.csv
file:
>>> pd.read_csv("data/college_data_dictionary.csv")
column_name description
0 INSTNM Institut...
1 CITY City Loc...
2 STABBR State Ab...
3 HBCU Historic...
4 MENONLY 0/1 Men ...
.. ... ...
22 PCTPELL Percent ...
23 PCTFLOAN Percent ...
24 UG25ABV Percent ...
25 MD_EARN_... Median E...
26 GRAD_DEB... Median d...
As you can see, it is immensely helpful in deciphering the abbreviated column names. DataFrames are not the best place to store data dictionaries. A platform such as Excel or Google Sheets with easy ability to edit values and append columns is a better choice. Alternatively, they can be described in a Markdown cell in Jupyter. A data dictionary is one of the first things that you can share as an analyst with collaborators.
It will often be the case that the dataset you are working with originated from a database whose administrators you will have to contact to get more information. Databases have representations of their data, called schemas. If possible, attempt to investigate your dataset with a Subject Matter Expert (SME – people who have expert knowledge of the data).
pandas has precise technical definitions for many data types. However, when you load data from type-less formats such as CSV, pandas has to infer the type.
This recipe changes the data type of one of the object columns from the college dataset to the special pandas categorical data type to drastically reduce its memory usage.
>>> college = pd.read_csv("data/college.csv")
>>> different_cols = [
... "RELAFFIL",
... "SATMTMID",
... "CURROPER",
... "INSTNM",
... "STABBR",
... ]
>>> col2 = college.loc[:, different_cols]
>>> col2.head()
RELAFFIL SATMTMID ... INSTNM STABBR
0 0 420.0 ... Alabama ... AL
1 0 565.0 ... Universi... AL
2 1 NaN ... Amridge ... AL
3 0 590.0 ... Universi... AL
4 0 430.0 ... Alabama ... AL
>>> col2.dtypes
RELAFFIL int64
SATMTMID float64
CURROPER int64
INSTNM object
STABBR object
dtype: object
.memory_usage
method:
>>> original_mem = col2.memory_usage(deep=True)
>>> original_mem
Index 128
RELAFFIL 60280
SATMTMID 60280
CURROPER 60280
INSTNM 660240
STABBR 444565
dtype: int64
RELAFFIL
column as it contains only 0 or 1. Let's convert this column to an 8-bit (1 byte) integer with the .astype
method:
>>> col2["RELAFFIL"] = col2["RELAFFIL"].astype(np.int8)
.dtypes
attribute to confirm the data type change:
>>> col2.dtypes
RELAFFIL int8
SATMTMID float64
CURROPER int64
INSTNM object
STABBR object
dtype: object
>>> col2.memory_usage(deep=True)
Index 128
RELAFFIL 7535
SATMTMID 60280
CURROPER 60280
INSTNM 660240
STABBR 444565
dtype: int64
>>> col2.select_dtypes(include=["object"]).nunique()
INSTNM 7535
STABBR 59
dtype: int64
STABBR
column is a good candidate to convert to categorical as less than one percent of its values are unique:
>>> col2["STABBR"] = col2["STABBR"].astype("category")
>>> col2.dtypes
RELAFFIL int8
SATMTMID float64
CURROPER int64
INSTNM object
STABBR category
dtype: object
>>> new_mem = col2.memory_usage(deep=True)
>>> new_mem
Index 128
RELAFFIL 7535
SATMTMID 60280
CURROPER 60280
INSTNM 660699
STABBR 13576
dtype: int64
RELAFFIL
column is, as expected, an eighth of its original size, while the STABBR
column has shrunk to just three percent of its original size:
>>> new_mem / original_mem
Index 1.000000
RELAFFIL 0.125000
SATMTMID 1.000000
CURROPER 1.000000
INSTNM 1.000695
STABBR 0.030538
dtype: float64
pandas defaults integer
and float
data types to 64 bits regardless of the maximum necessary size for the particular DataFrame. Integers, floats, and even Booleans may be coerced to a different data type with the .astype
method and passing it the exact type, either as a string or specific object, as done in step 4.
The RELAFFIL
column is a good choice to cast to a smaller integer type as the data dictionary explains that its values must be 0 or 1. The memory for RELAFFIL
is now an eighth of CURROPER
, which remains as its former type.
Columns that have an object
data type, such as INSTNM
, are not like the other pandas data types. For all the other pandas data types, each value in that column is the same data type. For instance, when a column has the int64
type, every column value is also int64
. This is not true for columns that have the object
data type. Each column value can be of any type. They can have a mix of strings, numerics, datetimes, or even other Python objects such as lists or tuples. For this reason, the object
data type is sometimes referred to as a catch-all for a column of data that doesn't match any of the other data types. The vast majority of the time, though, object
data type columns will all be strings.
Therefore, the memory of each value in an object
data type column is inconsistent. There is no predefined amount of memory for each value like the other data types. For pandas to extract the exact amount of memory of an object
data type column, the deep
parameter must be set to True
in the .memory_usage
method.
Object columns are targets for the largest memory savings. pandas has an additional categorical data type that is not available in NumPy. When converting to category, pandas internally creates a mapping from integers to each unique string value. Thus, each string only needs to be kept a single time in memory. As you can see, this change of data type reduced memory usage by 97%.
You might also have noticed that the index uses an extremely low amount of memory. If no index is specified during DataFrame creation, as is the case in this recipe, pandas defaults the index to a RangeIndex
. The RangeIndex
is very similar to the built-in range
function. It produces values on demand and only stores the minimum amount of information needed to create an index.
To get a better idea of how object data type columns differ from integers and floats, a single value from each one of these columns can be modified and the resulting memory usage displayed. The CURROPER
and INSTNM
columns are of int64
and object
types, respectively:
>>> college.loc[0, "CURROPER"] = 10000000
>>> college.loc[0, "INSTNM"] = (
... college.loc[0, "INSTNM"] + "a"
... )
>>> college[["CURROPER", "INSTNM"]].memory_usage(deep=True)
Index 80
CURROPER 60280
INSTNM 660804
dtype: int64
Memory usage for CURROPER
remained the same since a 64-bit integer is more than enough space for the larger number. On the other hand, the memory usage for INSTNM
increased by 105 bytes by just adding a single letter to one value.
Python 3 uses Unicode, a standardized character representation intended to encode all the world's writing systems. How much memory Unicode strings take on your machine depends on how Python was built. On this machine, it uses up to 4 bytes per character. pandas has some overhead (100 bytes) when making the first modification to a character value. Afterward, increments of 5 bytes per character are sustained.
Not all columns can be coerced to the desired type. Take a look at the MENONLY
column, which, from the data dictionary, appears to contain only 0s or 1s. The actual data type of this column upon import unexpectedly turns out to be float64
. The reason for this is that there happen to be missing values, denoted by np.nan
. There is no integer representation for missing values for the int64
type (note that the Int64
type found in pandas 0.24+ does support missing values, but it is not used by default). Any numeric column with even a single missing value will be turned into a float column. Furthermore, any column of an integer data type will automatically be coerced to a float if one of the values becomes missing:
>>> college["MENONLY"].dtype
dtype('float64')
>>> college["MENONLY"].astype(np.int8)
Traceback (most recent call last):
...
ValueError: Cannot convert non-finite values (NA or inf) to integer
Additionally, it is possible to substitute string names in place of Python objects when referring to data types. For instance, when using the include
parameter in the .describe
DataFrame method, it is possible to pass a list of either the NumPy or pandas objects or their equivalent string representation. For instance, each of the following produces the same result:
college.describe(include=['int64', 'float64']).T
college.describe(include=[np.int64, np.float64]).T
college.describe(include=['int', 'float']).T
college.describe(include=['number']).T
The type strings can also be used in combination with the .astype
method:
>>> college.assign(
... MENONLY=college["MENONLY"].astype("float16"),
... RELAFFIL=college["RELAFFIL"].astype("int8"),
... )
INSTNM CITY ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
0 Alabama ... Normal ... 30300 33888
1 Universi... Birmingham ... 39700 21941.5
2 Amridge ... Montgomery ... 40100 23370
3 Universi... Huntsville ... 45500 24097
4 Alabama ... Montgomery ... 26600 33118.5
... ... ... ... ... ...
7530 SAE Inst... Emeryville ... NaN 9500
7531 Rasmusse... Overland... ... NaN 21163
7532 National... Highland... ... NaN 6333
7533 Bay Area... San Jose ... NaN PrivacyS...
7534 Excel Le... San Antonio ... NaN 12125
Lastly, it is possible to see the enormous memory difference between the minimal RangeIndex
and Int64Index
, which stores every row index in memory:
>>> college.index = pd.Int64Index(college.index)
>>> college.index.memory_usage() # previously was just 80
60280
This recipe can be used to create catchy news headlines such as Out of the Top 100 Universities, These 5 have the Lowest Tuition, or From the Top 50 Cities to Live, these 10 are the Most Affordable.
During analysis, it is possible that you will first need to find a grouping of data that contains the top n values in a single column and, from this subset, find the bottom m values based on a different column.
In this recipe, we find the five lowest budget movies from the top 100 scoring movies by taking advantage of the convenience methods: .nlargest
and .nsmallest
.
movie_title
, imdb_score
, and budget
:
>>> movie = pd.read_csv("data/movie.csv")
>>> movie2 = movie[["movie_title", "imdb_score", "budget"]]
>>> movie2.head()
movie_title imdb_score budget
0 Avatar 7.9 237000000.0
1 Pirates ... 7.1 300000000.0
2 Spectre 6.8 245000000.0
3 The Dark... 8.5 250000000.0
4 Star War... 7.1 NaN
.nlargest
method to select the top 100 movies by imdb_score
:
>>> movie2.nlargest(100, "imdb_score").head()
movie_title imdb_score budget
movie_title imdb_score budget
2725 Towering Inferno 9.5 NaN
1920 The Shawshank Redemption 9.3 25000000.0
3402 The Godfather 9.2 6000000.0
2779 Dekalog 9.1 NaN
4312 Kickboxer: Vengeance 9.1 17000000.0
.nsmallest
method to return the five lowest budget films among those with a top 100 score:
>>> (
... movie2.nlargest(100, "imdb_score").nsmallest(
... 5, "budget"
... )
... )
movie_title imdb_score budget
4804 Butterfly Girl 8.7 180000.0
4801 Children of Heaven 8.5 180000.0
4706 12 Angry Men 8.9 350000.0
4550 A Separation 8.4 500000.0
4636 The Other Dream Team 8.4 500000.0
The first parameter of the .nlargest
method, n
, must be an integer and selects the number of rows to be returned. The second parameter, columns
, takes a column name as a string. Step 2 returns the 100 highest-scoring movies. We could have saved this intermediate result as its own variable but instead, we chain the .nsmallest
method to it in step 3, which returns exactly five rows, sorted by budget.
It is possible to pass a list of column names to the columns
parameter of the .nlargest
and .nsmallest
methods. This would only be useful to break ties in the event that there were duplicate values sharing the nth ranked spot in the first column in the list.
One of the most basic and common operations to perform during data analysis is to select rows containing the largest value of some column within a group. For instance, this would be like finding the highest-rated film of each year or the highest-grossing film by content rating. To accomplish this task, we need to sort the groups as well as the column used to rank each member of the group, and then extract the highest member of each group.
In this recipe, we will find the highest-rated film of each year.
movie_title
, title_year
, and imdb_score
:
>>> movie = pd.read_csv("data/movie.csv")
>>> movie[["movie_title", "title_year", "imdb_score"]]
movie_title ...
0 Avatar ...
1 Pirates of the Caribbean: At World's End ...
2 Spectre ...
3 The Dark Knight Rises ...
4 Star Wars: Episode VII - The Force Awakens ...
... ... ...
4911 Signed Sealed Delivered ...
4912 The Following ...
4913 A Plague So Pleasant ...
4914 Shanghai Calling ...
4915 My Date with Drew ...
.sort_values
method to sort the DataFrame by title_year
. The default behavior sorts from the smallest to the largest. Use the ascending=True
parameter to invert this behavior:
>>> (
... movie[
... ["movie_title", "title_year", "imdb_score"]
... ].sort_values("title_year", ascending=True)
... )
movie_title ...
4695 Intolerance: Love's Struggle Throughout the Ages ...
4833 Over the Hill to the Poorhouse ...
4767 The Big Parade ...
2694 Metropolis ...
4697 The Broadway Melody ...
... ... ...
4683 Heroes ...
4688 Home Movies ...
4704 Revolution ...
4752 Happy Valley ...
4912 The Following ...
>>> (
... movie[
... ["movie_title", "title_year", "imdb_score"]
... ].sort_values(
... ["title_year", "imdb_score"], ascending=False
... )
... )
movie_title title_year imdb_score
4312 Kickboxer: Vengeance 2016.0 9.1
4277 A Beginner's Guide to Snuff 2016.0 8.7
3798 Airlift 2016.0 8.5
27 Captain America: Civil War 2016.0 8.2
98 Godzilla Resurgence 2016.0 8.2
... ... ... ...
1391 Rush Hour NaN 5.8
4031 Creature NaN 5.0
2165 Meet the Browns NaN 3.5
3246 The Bold and the Beautiful NaN 3.5
2119 The Bachelor NaN 2.9
.drop_duplicates
method to keep only the first row of every year:
>>> (
... movie[["movie_title", "title_year", "imdb_score"]]
... .sort_values(
... ["title_year", "imdb_score"], ascending=False
... )
... .drop_duplicates(subset="title_year")
... )
movie_title title_year imdb_score
4312 Kickboxe... 2016.0 9.1
3745 Running ... 2015.0 8.6
4369 Queen of... 2014.0 8.7
3935 Batman: ... 2013.0 8.4
3 The Dark... 2012.0 8.5
... ... ... ...
2694 Metropolis 1927.0 8.3
4767 The Big ... 1925.0 8.3
4833 Over the... 1920.0 4.8
4695 Intolera... 1916.0 8.0
2725 Towering... NaN 9.5
This example shows how I use chaining to build up and test a sequence of pandas operations. In step 1, we slim the dataset down to concentrate on only the columns of importance. This recipe would work the same with the entire DataFrame. Step 2 shows how to sort a DataFrame by a single column, which is not exactly what we wanted. Step 3 sorts multiple columns at the same time. It works by first sorting all of title_year
and then, within each value of title_year
, sorts by imdb_score
.
The default behavior of the .drop_duplicates
method is to keep the first occurrence of each unique row, which would not drop any rows as each row is unique. However, the subset
parameter alters it to only consider the column (or list of columns) given to it. In this example, only one row for each year will be returned. As we sorted by year and score in the last step, the highest-scoring movie for each year is what we get.
As in most things pandas, there is more than one way to do this. If you find yourself comfortable with grouping operations, you can use the .groupby
method to do this as well:
>>> (
... movie[["movie_title", "title_year", "imdb_score"]]
... .groupby("title_year", as_index=False)
... .apply(
... lambda df:df.sort_values(
... "imdb_score", ascending=False
... ).head(1)
... )
... .droplevel(0)
... .sort_values("title_year", ascending=False)
... )
movie_title title_year imdb_score
90 4312 Kickboxe... 2016.0 9.1
89 3745 Running ... 2015.0 8.6
88 4369 Queen of... 2014.0 8.7
87 3935 Batman: ... 2013.0 8.4
86 3 The Dark... 2012.0 8.5
... ... ... ...
4 4555 Pandora'... 1929.0 8.0
3 2694 Metropolis 1927.0 8.3
2 4767 The Big ... 1925.0 8.3
1 4833 Over the... 1920.0 4.8
0 4695 Intolera... 1916.0 8.0
It is possible to sort one column in ascending order while simultaneously sorting another column in descending order. To accomplish this, pass in a list of Booleans to the ascending
parameter that corresponds to how you would like each column sorted. The following sorts title_year
and content_rating
in descending order and budget
in ascending order. It then finds the lowest budget film for each year and content rating group:
>>> (
... movie[
... [
... "movie_title",
... "title_year",
... "content_rating",
... "budget",
... ]
... ]
... .sort_values(
... ["title_year", "content_rating", "budget"],
... ascending=[False, False, True],
... )
... .drop_duplicates(
... subset=["title_year", "content_rating"]
... )
... )
movie_title title_year content_rating budget
4026 Compadres 2016.0 R 3000000.0
4658 Fight to... 2016.0 PG-13 150000.0
4661 Rodeo Girl 2016.0 PG 500000.0
3252 The Wailing 2016.0 Not Rated NaN
4659 Alleluia... 2016.0 NaN 500000.0
... ... ... ... ...
2558 Lilyhammer NaN TV-MA 34000000.0
807 Sabrina,... NaN TV-G 3000000.0
848 Stargate... NaN TV-14 1400000.0
2436 Carlos NaN Not Rated NaN
2119 The Bach... NaN NaN 3000000.0
By default, .drop_duplicates
keeps the very first appearance of a value, but this behavior may be modified by passing keep='last'
to select the last row of each group or keep=False
to drop all duplicates entirely.
The previous two recipes work similarly by sorting values in slightly different manners. Finding the top n values of a column of data is equivalent to sorting the entire column in descending order and taking the first n values. pandas has many operations that are capable of doing this in a variety of ways.
In this recipe, we will replicate the Selecting the smallest of the largest recipe with the .sort_values
method and explore the differences between the two.
>>> movie = pd.read_csv("data/movie.csv")
>>> (
... movie[["movie_title", "imdb_score", "budget"]]
... .nlargest(100, "imdb_score")
... .nsmallest(5, "budget")
... )
movie_title imdb_score budget
4804 Butterfly Girl 8.7 180000.0
4801 Children of Heaven 8.5 180000.0
4706 12 Angry Men 8.9 350000.0
4550 A Separation 8.4 500000.0
4636 The Other Dream Team 8.4 500000.0
.sort_values
to replicate the first part of the expression and grab the first 100 rows with the .head
method:
>>> (
... movie[["movie_title", "imdb_score", "budget"]]
... .sort_values("imdb_score", ascending=False)
... .head(100)
... )
movie_title imdb_score budget
2725 Towering... 9.5 NaN
1920 The Shaw... 9.3 25000000.0
3402 The Godf... 9.2 6000000.0
2779 Dekalog 9.1 NaN
4312 Kickboxe... 9.1 17000000.0
... ... ... ...
3799 Anne of ... 8.4 NaN
3777 Requiem ... 8.4 4500000.0
3935 Batman: ... 8.4 3500000.0
4636 The Othe... 8.4 500000.0
2455 Aliens 8.4 18500000.0
.sort_values
with .head
again to grab the lowest five by budget:
>>> (
... movie[["movie_title", "imdb_score", "budget"]]
... .sort_values("imdb_score", ascending=False)
... .head(100)
... .sort_values("budget")
... .head(5)
... )
movie_title imdb_score budget
4815 A Charlie Brown Christmas 8.4 150000.0
4801 Children of Heaven 8.5 180000.0
4804 Butterfly Girl 8.7 180000.0
4706 12 Angry Men 8.9 350000.0
4636 The Other Dream Team 8.4 500000.0
The .sort_values
method can nearly replicate .nlargest
by chaining the .head
method after the operation, as seen in step 2. Step 3 replicates .nsmallest
by chaining another .sort_values
method and completes the query by taking just the first five rows with the .head
method.
Take a look at the output from the first DataFrame from step 1 and compare it with the output from step 3. Are they the same? No! What happened? To understand why the two results are not equivalent, let's look at the tail of the intermediate steps of each recipe:
>>> (
... movie[["movie_title", "imdb_score", "budget"]]
... .nlargest(100, "imdb_score")
... .tail()
... )
movie_title imdb_score budget
4023 Oldboy 8.4 3000000.0
4163 To Kill a Mockingbird 8.4 2000000.0
4395 Reservoir Dogs 8.4 1200000.0
4550 A Separation 8.4 500000.0
4636 The Other Dream Team 8.4 500000.0
>>> (
... movie[["movie_title", "imdb_score", "budget"]]
... .sort_values("imdb_score", ascending=False)
... .head(100)
... .tail()
... )
movie_title imdb_score budget
3799 Anne of ... 8.4 NaN
3777 Requiem ... 8.4 4500000.0
3935 Batman: ... 8.4 3500000.0
4636 The Othe... 8.4 500000.0
2455 Aliens 8.4 18500000.0
The issue arises because more than 100 movies exist with a rating of at least 8.4. Each of the methods, .nlargest
and .sort_values
, breaks ties differently, which results in a slightly different 100-row DataFrame. If you pass in kind='mergsort'
to the .sort_values
method, you will get the same result as .nlargest
.
There are many strategies to trade stocks. One basic type of trade that many investors employ is the stop order. A stop order is an order placed by an investor to buy or sell a stock that executes whenever the market price reaches a certain point. Stop orders are useful to both prevent huge losses and protect gains.
For this recipe, we will only be examining stop orders used to sell currently owned stocks. In a typical stop order, the price does not change throughout the lifetime of the order. For instance, if you purchased a stock for $100 per share, you might want to set a stop order at $90 per share to limit your downside to 10%.
A more advanced strategy would be to continually modify the sale price of the stop order to track the value of the stock if it increases in value. This is called a trailing stop order. Concretely, if the same $100 stock increases to $120, then a trailing stop order 10% below the current market value would move the sale price to $108.
The trailing stop order never moves down and is always tied to the maximum value since the time of purchase. If the stock fell from $120 to $110, the stop order would still remain at $108. It would only increase if the price moved above $120.
This recipe requires the use of the third-party package pandas-datareader
, which fetches stock market prices online. It does not come pre-installed with pandas. To install this package, use the command line and run conda install pandas-datareader
or pip install pandas-datareader
. You may need to install the requests_cache
library as well.
This recipe determines the trailing stop order price given an initial purchase price for any stock.
>>> import datetime
>>> import pandas_datareader.data as web
>>> import requests_cache
>>> session = requests_cache.CachedSession(
... cache_name="cache",
... backend="sqlite",
... expire_after=datetime.timedelta(days=90),
... )
>>> tsla = web.DataReader(
... "tsla",
... data_source="yahoo",
... start="2017-1-1",
... session=session,
... )
>>> tsla.head(8)
High Low ... Volume Adj Close
Date ...
2017-01-03 220.330002 210.960007 ... 5923300 216.990005
2017-01-04 228.000000 214.309998 ... 11213500 226.990005
2017-01-05 227.479996 221.949997 ... 5911700 226.750000
2017-01-06 230.309998 225.449997 ... 5527900 229.009995
2017-01-09 231.919998 228.000000 ... 3979500 231.279999
2017-01-10 232.000000 226.889999 ... 3660000 229.869995
2017-01-11 229.979996 226.679993 ... 3650800 229.729996
2017-01-12 230.699997 225.580002 ... 3790200 229.589996
>>> tsla_close = tsla["Close"]
.cummax
method to track the highest closing price until the current date:
>>> tsla_cummax = tsla_close.cummax()
>>> tsla_cummax.head()
Date
2017-01-03 216.990005
2017-01-04 226.990005
2017-01-05 226.990005
2017-01-06 229.009995
2017-01-09 231.279999
Name: Close, dtype: float64
>>> (tsla["Close"].cummax().mul(0.9).head())
Date
2017-01-03 195.291005
2017-01-04 204.291005
2017-01-05 204.291005
2017-01-06 206.108995
2017-01-09 208.151999
Name: Close, dtype: float64
The .cummax
method works by retaining the maximum value encountered up to and including the current value. Multiplying this series by 0.9, or whatever cushion you would like to use, creates the trailing stop order. In this particular example, TSLA increased in value, and thus, its trailing stop has also increased.
This recipe gives just a taste of how useful pandas may be used to trade securities and stops short of calculating a return for if and when the stop order triggers.
A very similar strategy may be used during a weight-loss program. You can set a warning any time you have strayed too far away from your minimum weight. pandas provides you with the cummin
method to track the minimum value. If you keep track of your daily weight in a series, the following code provides a trailing weight loss of 5% above your lowest recorded weight to date:
weight.cummin() * 1.05