Every dimension of data in a Series or DataFrame is labeled in the Index object. It is this Index that separates pandas data structures from NumPy's n-dimensional array. Indexes provide meaningful labels for each row and column of data, and pandas users can select data through the use of these labels. Additionally, pandas allows its users to select data according to the position of the rows and columns. This dual selection capability, one using names and the other using the position, makes for powerful yet confusing syntax to select subsets of data.
Selecting data by label or position is not unique to pandas. Python dictionaries and lists are built-in data structures that select their data in exactly one of these ways. Both dictionaries and lists have precise instructions and limited use cases for what you can index with. A dictionary's key (its label) must be an immutable object, such as a string, integer, or tuple. Lists must either use integers (the position) or slice objects for selection. Dictionaries can only select one object at a time by passing the key to the indexing operator. In this way, pandas is combining the ability to select data using integers, as with lists, and labels, as with dictionaries.
Series and DataFrames are complex data containers that have multiple attributes that use an index operation to select data in different ways. In addition to the index operator itself, the .iloc
and .loc
attributes are available and use the index operator in their own unique ways.
Series and DataFrames allow selection by position (like Python lists) and by label (like Python dictionaries). When we index off of the .iloc
attribute, pandas selects only by position and works similarly to Python lists. The .loc
attribute selects only by index label, which is similar to how Python dictionaries work.
The .loc
and .iloc
attributes are available on both Series and DataFrames. This recipe shows how to select Series data by position with .iloc
and by label with .loc
. These indexers accept scalar values, lists, and slices.
The terminology can get confusing. An index operation is when you put brackets, [],
following a variable. For instance, given a Series s
, you can select data in the following ways: s[item]
and s.loc[item]
. The first performs the index operation directly on the Series. The second performs the index operation on the .loc
attribute.
>>> import pandas as pd
>>> import numpy as np
>>> college = pd.read_csv(
... "data/college.csv", index_col="INSTNM"
... )
>>> city = college["CITY"]
>>> city
INSTNM
Alabama A & M University Normal
University of Alabama at Birmingham Birmingham
Amridge University Montgomery
University of Alabama in Huntsville Huntsville
Alabama State University Montgomery
...
SAE Institute of Technology San Francisco Emeryville
Rasmussen College - Overland Park Overland...
National Personal Training Institute of Cleveland Highland...
Bay Area Medical Academy - San Jose Satellite Location San Jose
Excel Learning Center-San Antonio South San Antonio
Name: CITY, Length: 7535, dtype: object
>>> city["Alabama A & M University"]
'Normal'
.loc
attribute by name:
>>> city.loc["Alabama A & M University"]
'Normal'
.iloc
attribute by position:
>>> city.iloc[0]
'Normal'
>>> city[
... [
... "Alabama A & M University",
... "Alabama State University",
... ]
... ]
INSTNM
Alabama A & M University Normal
Alabama State University Montgomery
Name: CITY, dtype: object
.loc
:
>>> city.loc[
... [
... "Alabama A & M University",
... "Alabama State University",
... ]
... ]
INSTNM
Alabama A & M University Normal
Alabama State University Montgomery
Name: CITY, dtype: object
.iloc
:
>>> city.iloc[[0, 4]]
INSTNM
Alabama A & M University Normal
Alabama State University Montgomery
Name: CITY, dtype: object
>>> city[
... "Alabama A & M University":"Alabama State University"
... ]
INSTNM
Alabama A & M University Normal
University of Alabama at Birmingham Birmingham
Amridge University Montgomery
University of Alabama in Huntsville Huntsville
Alabama State University Montgomery
Name: CITY, dtype: object
>>> city[0:5]
INSTNM
Alabama A & M University Normal
University of Alabama at Birmingham Birmingham
Amridge University Montgomery
University of Alabama in Huntsville Huntsville
Alabama State University Montgomery
Name: CITY, dtype: object
.loc
:
>>> city.loc[
... "Alabama A & M University":"Alabama State University"
... ]
INSTNM
Alabama A & M University Normal
University of Alabama at Birmingham Birmingham
Amridge University Montgomery
University of Alabama in Huntsville Huntsville
Alabama State University Montgomery
Name: CITY, dtype: object
.iloc
:
>>> city.iloc[0:5]
INSTNM
Alabama A & M University Normal
University of Alabama at Birmingham Birmingham
Amridge University Montgomery
University of Alabama in Huntsville Huntsville
Alabama State University Montgomery
Name: CITY, dtype: object
>>> alabama_mask = city.isin(["Birmingham", "Montgomery"])
>>> city[alabama_mask]
INSTNM
University of Alabama at Birmingham Birmingham
Amridge University Montgomery
Alabama State University Montgomery
Auburn University at Montgomery Montgomery
Birmingham Southern College Birmingham
...
Fortis Institute-Birmingham Birmingham
Hair Academy Montgomery
Brown Mackie College-Birmingham Birmingham
Nunation School of Cosmetology Birmingham
Troy University-Montgomery Campus Montgomery
Name: CITY, Length: 26, dtype: object
If you have a Series, you can pull out the data using index operations. Depending on what you index with, you might get different types as output. If you index with a scalar on a Series, you will get back a scalar value. If you index with a list or a slice, you will get back a Series.
Looking at the examples, it appears that indexing directly off of the Series provides the best of both worlds: you can index by position or label. I would caution against using it at all. Remember, the Zen of Python states, "Explicit is better than implicit." Both .iloc
and .loc
are explicit, but indexing directly off of the Series is not explicit; it requires us to think about what we are indexing with and what type of index we have.
Consider this toy Series that uses integer values for the index:
>>> s = pd.Series([10, 20, 35, 28], index=[5, 2, 3, 1])
>>> s
5 10
2 20
3 35
1 28
dtype: int64
>>> s[0:4]
5 10
2 20
3 35
1 28
dtype: int64
>>> s[5]
10
>>> s[1]
28
When you index with a slice directly on a Series, it uses position, but otherwise it goes by label. This is confusing to the future you and future readers of your code. Remember, optimizing for readability is better than optimizing for easy-to-write code. The takeaway is to use the .iloc
and .loc
indexers.
Remember that when you slice by position, pandas uses the half-open interval. This interval is probably something you learned back in high school and promptly forgot. The half-open interval includes the first index, but not the end index. However, when you slice by label, pandas uses the closed interval and includes both the start and end index. This behavior is inconsistent with Python in general, but is practical for labels.
All of the examples in this section could be performed directly on the original DataFrame by using .loc
or .iloc
. We can pass in a tuple (without parentheses) of row and column labels or positions, respectively:
>>> college.loc["Alabama A & M University", "CITY"]
'Normal'
>>> college.iloc[0, 0]
'Normal'
>>> college.loc[
... [
... "Alabama A & M University",
... "Alabama State University",
... ],
... "CITY",
... ]
INSTNM
Alabama A & M University Normal
Alabama State University Montgomery
Name: CITY, dtype: object
>>> college.iloc[[0, 4], 0]
INSTNM
Alabama A & M University Normal
Alabama State University Montgomery
Name: CITY, dtype: object
>>> college.loc[
... "Alabama A & M University":"Alabama State University",
... "CITY",
... ]
INSTNM
Alabama A & M University Normal
University of Alabama at Birmingham Birmingham
Amridge University Montgomery
University of Alabama in Huntsville Huntsville
Alabama State University Montgomery
Name: CITY, dtype: object
>>> college.iloc[0:5, 0]
INSTNM
Alabama A & M University Normal
University of Alabama at Birmingham Birmingham
Amridge University Montgomery
University of Alabama in Huntsville Huntsville
Alabama State University Montgomery
Name: CITY, dtype: object
Care needs to be taken when using slicing off of .loc
. If the start index appears after the stop index, then an empty Series is returned without an exception:
>>> city.loc[
... "Reid State Technical College":"Alabama State University"
... ]
Series([], Name: CITY, dtype: object)
The most explicit and preferred way to select DataFrame rows is with .iloc
and .loc
. They are both capable of selecting by rows or by rows and columns.
This recipe shows you how to select rows from a DataFrame using the .iloc
and .loc
indexers:
>>> college = pd.read_csv(
... "data/college.csv", index_col="INSTNM"
... )
>>> college.sample(5, random_state=42)
CITY STABBR ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
INSTNM ...
Career Po... San Antonio TX ... 20700 14977
Ner Israe... Baltimore MD ... PrivacyS... PrivacyS...
Reflectio... Decatur IL ... NaN PrivacyS...
Capital A... Baton Rouge LA ... 26400 PrivacyS...
West Virg... Montgomery WV ... 43400 23969
<BLANKLINE>
[5 rows x 26 columns]
.iloc
:
>>> college.iloc[60]
CITY Anchorage
STABBR AK
HBCU 0
MENONLY 0
WOMENONLY 0
...
PCTPELL 0.2385
PCTFLOAN 0.2647
UG25ABV 0.4386
MD_EARN_WNE_P10 42500
GRAD_DEBT_MDN_SUPP 19449.5
Name: University of Alaska Anchorage, Length: 26, dtype: object
Because Python is zero-based, this is actually the 61st row. Note that pandas represents this row as a Series.
.loc
:
>>> college.loc["University of Alaska Anchorage"]
CITY Anchorage
STABBR AK
HBCU 0
MENONLY 0
WOMENONLY 0
...
PCTPELL 0.2385
PCTFLOAN 0.2647
UG25ABV 0.4386
MD_EARN_WNE_P10 42500
GRAD_DEBT_MDN_SUPP 19449.5
Name: University of Alaska Anchorage, Length: 26, dtype: object
.iloc
:
>>> college.iloc[[60, 99, 3]]
CITY STABBR ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
INSTNM ...
Universit... Anchorage AK ... 42500 19449.5
Internati... Tempe AZ ... 22200 10556
Universit... Huntsville AL ... 45500 24097
<BLANKLINE>
[3 rows x 26 columns]
Because we passed in a list of row positions, this returns a DataFrame.
.loc
by passing it a list of the institution names:
>>> labels = [
... "University of Alaska Anchorage",
... "International Academy of Hair Design",
... "University of Alabama in Huntsville",
... ]
>>> college.loc[labels]
CITY STABBR ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
INSTNM ...
Universit... Anchorage AK ... 42500 19449.5
Internati... Tempe AZ ... 22200 10556
Universit... Huntsville AL ... 45500 24097
<BLANKLINE>
[3 rows x 26 columns]
.iloc
to select contiguous rows of the data:
>>> college.iloc[99:102]
CITY STABBR ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
INSTNM ...
Internati... Tempe AZ ... 22200 10556
GateWay C... Phoenix AZ ... 29800 7283
Mesa Comm... Mesa AZ ... 35200 8000
<BLANKLINE>
[3 rows x 26 columns]
.loc
and is a closed interval (it includes both the start label and the stop label):
>>> start = "International Academy of Hair Design"
>>> stop = "Mesa Community College"
>>> college.loc[start:stop]
CITY STABBR ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
INSTNM ...
Internati... Tempe AZ ... 22200 10556
GateWay C... Phoenix AZ ... 29800 7283
Mesa Comm... Mesa AZ ... 35200 8000
<BLANKLINE>
[3 rows x 26 columns]
When we pass a scalar value, a list of scalars, or a slice to .iloc
or .loc
, this causes pandas to scan the index for the appropriate rows and return them. If a single scalar value is passed, a Series is returned. If a list or slice is passed, then a DataFrame is returned.
In step 5, the list of index labels can be selected directly from the DataFrame returned in step 4 without the need for copying and pasting:
>>> college.iloc[[60, 99, 3]].index.tolist()
['University of Alaska Anchorage', 'International Academy of Hair Design', 'University of Alabama in Huntsville']
There are many ways to select rows and columns. The easiest method to select one or more columns from a DataFrame is to index off of the DataFrame. However, this approach has a limitation. Indexing directly on a DataFrame does not allow you to select both rows and columns simultaneously. To select rows and columns, you will need to pass both valid row and column selections separated by a comma to either .iloc
or .loc
.
The generic form to select rows and columns will look like the following code:
df.iloc[row_idxs, column_idxs]
df.loc[row_names, column_names]
Where row_idxs
and column_idxs
can be scalar integers, lists of integers, or integer slices. While row_names
and column_names
can be the scalar names, lists of names, or names slices, row_names
can also be a Boolean array.
In this recipe, each step shows a simultaneous row and column selection using both .iloc
and .loc
.
>>> college = pd.read_csv(
... "data/college.csv", index_col="INSTNM"
... )
>>> college.iloc[:3, :4]
CITY STABBR HBCU MENONLY
INSTNM
Alabama A... Normal AL 1.0 0.0
Universit... Birmingham AL 0.0 0.0
Amridge U... Montgomery AL 0.0 0.0
>>> college.loc[:"Amridge University", :"MENONLY"]
CITY STABBR HBCU MENONLY
INSTNM
Alabama A... Normal AL 1.0 0.0
Universit... Birmingham AL 0.0 0.0
Amridge U... Montgomery AL 0.0 0.0
>>> college.iloc[:, [4, 6]].head()
WOMENONLY SATVRMID
INSTNM
Alabama A & M University 0.0 424.0
University of Alabama at Birmingham 0.0 570.0
Amridge University 0.0 NaN
University of Alabama in Huntsville 0.0 595.0
Alabama State University 0.0 425.0
>>> college.loc[:, ["WOMENONLY", "SATVRMID"]].head()
WOMENONLY SATVRMID
INSTNM
Alabama A & M University 0.0 424.0
University of Alabama at Birmingham 0.0 570.0
Amridge University 0.0 NaN
University of Alabama in Huntsville 0.0 595.0
Alabama State University 0.0 425.0
>>> college.iloc[[100, 200], [7, 15]]
SATMTMID UGDS_NHPI
INSTNM
GateWay Community College NaN 0.0029
American Baptist Seminary of the West NaN NaN
>>> rows = [
... "GateWay Community College",
... "American Baptist Seminary of the West",
... ]
>>> columns = ["SATMTMID", "UGDS_NHPI"]
>>> college.loc[rows, columns]
SATMTMID UGDS_NHPI
INSTNM
GateWay Community College NaN 0.0029
American Baptist Seminary of the West NaN NaN
>>> college.iloc[5, -4]
0.401
>>> college.loc["The University of Alabama", "PCTFLOAN"]
0.401
>>> college.iloc[90:80:-2, 5]
INSTNM
Empire Beauty School-Flagstaff 0
Charles of Italy Beauty College 0
Central Arizona College 0
University of Arizona 0
Arizona State University-Tempe 0
Name: RELAFFIL, dtype: int64
>>> start = "Empire Beauty School-Flagstaff"
>>> stop = "Arizona State University-Tempe"
>>> college.loc[start:stop:-2, "RELAFFIL"]
INSTNM
Empire Beauty School-Flagstaff 0
Charles of Italy Beauty College 0
Central Arizona College 0
University of Arizona 0
Arizona State University-Tempe 0
Name: RELAFFIL, dtype: int64
One of the keys to selecting rows and columns at the same time is to understand the use of the comma in the brackets. The selection to the left of the comma always selects rows based on the row index. The selection to the right of the comma always selects columns based on the column index.
It is not necessary to make a selection for both rows and columns simultaneously. Step 2 shows how to select all the rows and a subset of columns. The colon (:
) represents a slice object that returns all the values for that dimension.
To select only rows (along with all the columns), it is not necessary to use a colon following a comma. The default behavior is to select all the columns if there is no comma present. The previous recipe selected rows in exactly this manner. You can, however, use a colon to represent a slice of all the columns. The following lines of code are equivalent:
college.iloc[:10]
college.iloc[:10, :]
Sometimes, you want the functionality of both .iloc
and .loc
, to select data by both position and label. In earlier versions of pandas, .ix
was available to select data by both position and label. While this conveniently worked for those specific situations, it was ambiguous by nature and was a source of confusion for many pandas users. The .ix
indexer has subsequently been deprecated and thus should be avoided.
Before the .ix
deprecation, it was possible to select the first five rows and the columns of the college dataset from UGDS_WHITE
through UGDS_UNKN
using college.ix[:5, 'UGDS_WHITE':'UGDS_UNKN']
. This is now impossible to do directly using .loc
or .iloc
. The following recipe shows how to find the integer location of the columns and then use .iloc
to complete the selection.
INSTNM
) as the index:
>>> college = pd.read_csv(
... "data/college.csv", index_col="INSTNM"
... )
.get_loc
to find the integer position of the desired columns:
>>> col_start = college.columns.get_loc("UGDS_WHITE")
>>> col_end = college.columns.get_loc("UGDS_UNKN") + 1
>>> col_start, col_end
(10, 19)
col_start
and col_end
to select columns by position using .iloc
:
>>> college.iloc[:5, col_start:col_end]
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... 0.0333 0.9353 ... 0.0059 0.0138
Universit... 0.5922 0.2600 ... 0.0179 0.0100
Amridge U... 0.2990 0.4192 ... 0.0000 0.2715
Universit... 0.6988 0.1255 ... 0.0332 0.0350
Alabama S... 0.0158 0.9208 ... 0.0243 0.0137
<BLANKLINE>
[5 rows x 9 columns]
Step 2 first retrieves the column index through the .columns
attribute. Indexes have a .get_loc
method, which accepts an index label and returns its integer location. We find both the start and end integer locations for the columns that we wish to slice. We add one because slicing with .iloc
uses the half-open interval and is exclusive of the last item. Step 3 uses slice notation with the row and column positions.
We can do a very similar operation to use positions to get the labels for .loc
to work. The following shows how to select the 10th through 15th (inclusive) rows, along with columns UGDS_WHITE
through UGDS_UNKN
:
>>> row_start = college.index[10]
>>> row_end = college.index[15]
>>> college.loc[row_start:row_end, "UGDS_WHITE":"UGDS_UNKN"]
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Birmingha... 0.7983 0.1102 ... 0.0000 0.0051
Chattahoo... 0.4661 0.4372 ... 0.0000 0.0139
Concordia... 0.0280 0.8758 ... 0.0466 0.0000
South Uni... 0.3046 0.6054 ... 0.0019 0.0326
Enterpris... 0.6408 0.2435 ... 0.0012 0.0069
James H F... 0.6979 0.2259 ... 0.0007 0.0009
<BLANKLINE>
[6 rows x 9 columns]
Doing this same operation with .ix
(which is removed from pandas 1.0, so don't do this) would look like this (in versions prior to 1.0):
>>> college.ix[10:16, "UGDS_WHITE":"UGDS_UNKN"]
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Birmingha... 0.7983 0.1102 ... 0.0000 0.0051
Chattahoo... 0.4661 0.4372 ... 0.0000 0.0139
Concordia... 0.0280 0.8758 ... 0.0466 0.0000
South Uni... 0.3046 0.6054 ... 0.0019 0.0326
Enterpris... 0.6408 0.2435 ... 0.0012 0.0069
James H F... 0.6979 0.2259 ... 0.0007 0.0009
<BLANKLINE>
[6 rows x 9 columns]
It is possible to achieve the same results by chaining .loc and .iloc
together, but chaining indexers is typically a bad idea. It can be slower, and it is also undetermined whether it returns a view or a copy (which is not problematic when viewing the data, but can be when updating data. You might see the infamous SettingWithCopyWarning
warning):
>>> college.iloc[10:16].loc[:, "UGDS_WHITE":"UGDS_UNKN"]
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Birmingha... 0.7983 0.1102 ... 0.0000 0.0051
Chattahoo... 0.4661 0.4372 ... 0.0000 0.0139
Concordia... 0.0280 0.8758 ... 0.0466 0.0000
South Uni... 0.3046 0.6054 ... 0.0019 0.0326
Enterpris... 0.6408 0.2435 ... 0.0012 0.0069
James H F... 0.6979 0.2259 ... 0.0007 0.0009
<BLANKLINE>
[6 rows x 9 columns]
The .loc
attribute typically selects data based on the exact string label of the index. However, it also allows you to select data based on the lexicographic order of the values in the index. Specifically, .loc
allows you to select all rows with an index lexicographically using slice notation. This only works if the index is sorted.
In this recipe, you will first sort the index and then use slice notation inside the .loc indexer to select all rows between two strings.
>>> college = pd.read_csv(
... "data/college.csv", index_col="INSTNM"
... )
Sp
and Su
:
>>> college.loc["Sp":"Su"]
Traceback (most recent call last):
...
ValueError: index must be monotonic increasing or decreasing
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
...
KeyError: 'Sp'
>>> college = college.sort_index()
>>> college.loc["Sp":"Su"]
CITY STABBR ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
INSTNM ...
Spa Tech ... Ipswich MA ... 21500 6333
Spa Tech ... Plymouth MA ... 21500 6333
Spa Tech ... Westboro MA ... 21500 6333
Spa Tech ... Westbrook ME ... 21500 6333
Spalding ... Louisville KY ... 41700 25000
... ... ... ... ... ...
Studio Ac... Chandler AZ ... NaN 6333
Studio Je... New York NY ... PrivacyS... PrivacyS...
Stylemast... Longview WA ... 17000 13320
Styles an... Selmer TN ... PrivacyS... PrivacyS...
Styletren... Rock Hill SC ... PrivacyS... 9495.5
<BLANKLINE>
[201 rows x 26 columns]
The normal behavior of .loc
is to make selections of data based on the exact labels passed to it. It raises a KeyError
when these labels are not found in the index. However, one special exception to this behavior exists whenever the index is lexicographically sorted, and a slice is passed to it. Selection is now possible between the start and stop labels of the slice, even if those values are not found in the index.
With this recipe, it is easy to select colleges between two letters of the alphabet. For instance, to select all colleges that begin with the letters D through S, you would use college.loc['D':'T']
. Slicing like this is still closed and includes the last index, so this would technically return a college with the exact name T
.
This type of slicing also works when the index is sorted in the opposite direction. You can determine in which direction the index is sorted with the index attribute .is_monotonic_increasing
or .is_monotonic_decreasing
. Either of these must be True
in order for lexicographic slicing to work. For instance, the following code lexicographically sorts the index from Z to A:
>>> college = college.sort_index(ascending=False)
>>> college.index.is_monotonic_decreasing
True
>>> college.loc["E":"B"]
CITY ...
INSTNM ...
Dyersburg State Community College Dyersburg ...
Dutchess Community College Poughkeepsie ...
Dutchess BOCES-Practical Nursing Program Poughkeepsie ...
Durham Technical Community College Durham ...
Durham Beauty Academy Durham ...
... ... ...
Bacone College Muskogee ...
Babson College Wellesley ...
BJ's Beauty & Barber College Auburn ...
BIR Training Center Chicago ...
B M Spurr School of Practical Nursing Glen Dale ...