6

Selecting Subsets of Data

Introduction

Every dimension of data in a Series or DataFrame is labeled in the Index object. It is this Index that separates pandas data structures from NumPy's n-dimensional array. Indexes provide meaningful labels for each row and column of data, and pandas users can select data through the use of these labels. Additionally, pandas allows its users to select data according to the position of the rows and columns. This dual selection capability, one using names and the other using the position, makes for powerful yet confusing syntax to select subsets of data.

Selecting data by label or position is not unique to pandas. Python dictionaries and lists are built-in data structures that select their data in exactly one of these ways. Both dictionaries and lists have precise instructions and limited use cases for what you can index with. A dictionary's key (its label) must be an immutable object, such as a string, integer, or tuple. Lists must either use integers (the position) or slice objects for selection. Dictionaries can only select one object at a time by passing the key to the indexing operator. In this way, pandas is combining the ability to select data using integers, as with lists, and labels, as with dictionaries.

Selecting Series data

Series and DataFrames are complex data containers that have multiple attributes that use an index operation to select data in different ways. In addition to the index operator itself, the .iloc and .loc attributes are available and use the index operator in their own unique ways.

Series and DataFrames allow selection by position (like Python lists) and by label (like Python dictionaries). When we index off of the .iloc attribute, pandas selects only by position and works similarly to Python lists. The .loc attribute selects only by index label, which is similar to how Python dictionaries work.

The .loc and .iloc attributes are available on both Series and DataFrames. This recipe shows how to select Series data by position with .iloc and by label with .loc. These indexers accept scalar values, lists, and slices.

The terminology can get confusing. An index operation is when you put brackets, [], following a variable. For instance, given a Series s, you can select data in the following ways: s[item] and s.loc[item]. The first performs the index operation directly on the Series. The second performs the index operation on the .loc attribute.

How to do it…

  1. Read in the college dataset with the institution name as the index, and select a single column as a Series using an index operation:
    >>> import pandas as pd
    >>> import numpy as np
    >>> college = pd.read_csv(
    ...     "data/college.csv", index_col="INSTNM"
    ... )
    >>> city = college["CITY"]
    >>> city
    INSTNM
    Alabama A & M University                                       Normal
    University of Alabama at Birmingham                        Birmingham
    Amridge University                                         Montgomery
    University of Alabama in Huntsville                        Huntsville
    Alabama State University                                   Montgomery
    ...
    SAE Institute of Technology  San Francisco                 Emeryville
    Rasmussen College - Overland Park                         Overland...
    National Personal Training Institute of Cleveland         Highland...
    Bay Area Medical Academy - San Jose Satellite Location       San Jose
    Excel Learning Center-San Antonio South                   San Antonio
    Name: CITY, Length: 7535, dtype: object
    
  2. Pull out a scalar value from the Series directly:
    >>> city["Alabama A & M University"]
    'Normal'
    
  3. Pull out a scalar value using the .loc attribute by name:
    >>> city.loc["Alabama A & M University"]
    'Normal'
    
  4. Pull out a scalar value using the .iloc attribute by position:
    >>> city.iloc[0]
    'Normal'
    
  5. Pull out several values by indexing. Note that if we pass in a list to the index operation, pandas will now return a Series instead of a scalar:
    >>> city[
    ...     [
    ...         "Alabama A & M University",
    ...         "Alabama State University",
    ...     ]
    ... ]
    INSTNM
    Alabama A & M University        Normal
    Alabama State University    Montgomery
    Name: CITY, dtype: object
    
  6. Repeat the above using .loc:
    >>> city.loc[
    ...     [
    ...         "Alabama A & M University",
    ...         "Alabama State University",
    ...     ]
    ... ]
    INSTNM
    Alabama A & M University        Normal
    Alabama State University    Montgomery
    Name: CITY, dtype: object
    
  7. Repeat the above using .iloc:
    >>> city.iloc[[0, 4]]
    INSTNM
    Alabama A & M University        Normal
    Alabama State University    Montgomery
    Name: CITY, dtype: object
    
  8. Use a slice to pull out many values:
    >>> city[
    ...     "Alabama A & M University":"Alabama State University"
    ... ]
    INSTNM
    Alabama A & M University                   Normal
    University of Alabama at Birmingham    Birmingham
    Amridge University                     Montgomery
    University of Alabama in Huntsville    Huntsville
    Alabama State University               Montgomery
    Name: CITY, dtype: object
    
  9. Use a slice to pull out many values by position:
    >>> city[0:5]
    INSTNM
    Alabama A & M University                   Normal
    University of Alabama at Birmingham    Birmingham
    Amridge University                     Montgomery
    University of Alabama in Huntsville    Huntsville
    Alabama State University               Montgomery
    Name: CITY, dtype: object
    
  10. Use a slice to pull out many values with .loc:
    >>> city.loc[
    ...     "Alabama A & M University":"Alabama State University"
    ... ]
    INSTNM
    Alabama A & M University                   Normal
    University of Alabama at Birmingham    Birmingham
    Amridge University                     Montgomery
    University of Alabama in Huntsville    Huntsville
    Alabama State University               Montgomery
    Name: CITY, dtype: object
    
  11. Use a slice to pull out many values with .iloc:
    >>> city.iloc[0:5]
    INSTNM
    Alabama A & M University                   Normal
    University of Alabama at Birmingham    Birmingham
    Amridge University                     Montgomery
    University of Alabama in Huntsville    Huntsville
    Alabama State University               Montgomery
    Name: CITY, dtype: object
    
  12. Use a Boolean array to pull out certain values:
    >>> alabama_mask = city.isin(["Birmingham", "Montgomery"])
    >>> city[alabama_mask]
    INSTNM
    University of Alabama at Birmingham    Birmingham
    Amridge University                     Montgomery
    Alabama State University               Montgomery
    Auburn University at Montgomery        Montgomery
    Birmingham Southern College            Birmingham
                                              ...     
    Fortis Institute-Birmingham            Birmingham
    Hair Academy                           Montgomery
    Brown Mackie College-Birmingham        Birmingham
    Nunation School of Cosmetology         Birmingham
    Troy University-Montgomery Campus      Montgomery
    Name: CITY, Length: 26, dtype: object
    

How it works…

If you have a Series, you can pull out the data using index operations. Depending on what you index with, you might get different types as output. If you index with a scalar on a Series, you will get back a scalar value. If you index with a list or a slice, you will get back a Series.

Looking at the examples, it appears that indexing directly off of the Series provides the best of both worlds: you can index by position or label. I would caution against using it at all. Remember, the Zen of Python states, "Explicit is better than implicit." Both .iloc and .loc are explicit, but indexing directly off of the Series is not explicit; it requires us to think about what we are indexing with and what type of index we have.

Consider this toy Series that uses integer values for the index:

>>> s = pd.Series([10, 20, 35, 28], index=[5, 2, 3, 1])
>>> s
5    10
2    20
3    35
1    28
dtype: int64
>>> s[0:4]
5    10
2    20
3    35
1    28
dtype: int64
>>> s[5]
10
>>> s[1]
28

When you index with a slice directly on a Series, it uses position, but otherwise it goes by label. This is confusing to the future you and future readers of your code. Remember, optimizing for readability is better than optimizing for easy-to-write code. The takeaway is to use the .iloc and .loc indexers.

Remember that when you slice by position, pandas uses the half-open interval. This interval is probably something you learned back in high school and promptly forgot. The half-open interval includes the first index, but not the end index. However, when you slice by label, pandas uses the closed interval and includes both the start and end index. This behavior is inconsistent with Python in general, but is practical for labels.

There's more…

All of the examples in this section could be performed directly on the original DataFrame by using .loc or .iloc. We can pass in a tuple (without parentheses) of row and column labels or positions, respectively:

>>> college.loc["Alabama A & M University", "CITY"]
'Normal'
>>> college.iloc[0, 0]
'Normal'
>>> college.loc[
...     [
...         "Alabama A & M University",
...         "Alabama State University",
...     ],
...     "CITY",
... ]
INSTNM
Alabama A & M University        Normal
Alabama State University    Montgomery
Name: CITY, dtype: object
>>> college.iloc[[0, 4], 0]
INSTNM
Alabama A & M University        Normal
Alabama State University    Montgomery
Name: CITY, dtype: object
>>> college.loc[
...     "Alabama A & M University":"Alabama State University",
...     "CITY",
... ]
INSTNM
Alabama A & M University                   Normal
University of Alabama at Birmingham    Birmingham
Amridge University                     Montgomery
University of Alabama in Huntsville    Huntsville
Alabama State University               Montgomery
Name: CITY, dtype: object
>>> college.iloc[0:5, 0]
INSTNM
Alabama A & M University                   Normal
University of Alabama at Birmingham    Birmingham
Amridge University                     Montgomery
University of Alabama in Huntsville    Huntsville
Alabama State University               Montgomery
Name: CITY, dtype: object

Care needs to be taken when using slicing off of .loc. If the start index appears after the stop index, then an empty Series is returned without an exception:

>>> city.loc[
...     "Reid State Technical College":"Alabama State University"
... ]
Series([], Name: CITY, dtype: object)

Selecting DataFrame rows

The most explicit and preferred way to select DataFrame rows is with .iloc and .loc. They are both capable of selecting by rows or by rows and columns.

This recipe shows you how to select rows from a DataFrame using the .iloc and .loc indexers:

  1. Read in the college dataset, and set the index as the institution name:
    >>> college = pd.read_csv(
    ...     "data/college.csv", index_col="INSTNM"
    ... )
    >>> college.sample(5, random_state=42)
                         CITY STABBR  ...  MD_EARN_WNE_P10  GRAD_DEBT_MDN_SUPP
    INSTNM                            ...
    Career Po...  San Antonio     TX  ...        20700            14977
    Ner Israe...    Baltimore     MD  ...  PrivacyS...      PrivacyS...
    Reflectio...      Decatur     IL  ...          NaN      PrivacyS...
    Capital A...  Baton Rouge     LA  ...        26400      PrivacyS...
    West Virg...   Montgomery     WV  ...        43400            23969
    <BLANKLINE>
    [5 rows x 26 columns]
    
  2. To select an entire row at that position, pass an integer to .iloc:
    >>> college.iloc[60]
    CITY                  Anchorage
    STABBR                       AK
    HBCU                          0
    MENONLY                       0
    WOMENONLY                     0
                            ...
    PCTPELL                  0.2385
    PCTFLOAN                 0.2647
    UG25ABV                  0.4386
    MD_EARN_WNE_P10           42500
    GRAD_DEBT_MDN_SUPP      19449.5
    Name: University of Alaska Anchorage, Length: 26, dtype: object
    

    Because Python is zero-based, this is actually the 61st row. Note that pandas represents this row as a Series.

  3. To get the same row as the preceding step, pass the index label to .loc:
    >>> college.loc["University of Alaska Anchorage"]
    CITY                  Anchorage
    STABBR                       AK
    HBCU                          0
    MENONLY                       0
    WOMENONLY                     0
                            ...
    PCTPELL                  0.2385
    PCTFLOAN                 0.2647
    UG25ABV                  0.4386
    MD_EARN_WNE_P10           42500
    GRAD_DEBT_MDN_SUPP      19449.5
    Name: University of Alaska Anchorage, Length: 26, dtype: object
    
  4. To select a disjointed set of rows as a DataFrame, pass a list of integers to .iloc:
    >>> college.iloc[[60, 99, 3]]
                        CITY STABBR  ...  MD_EARN_WNE_P10  GRAD_DEBT_MDN_SUPP
    INSTNM                           ...
    Universit...   Anchorage     AK  ...        42500          19449.5
    Internati...       Tempe     AZ  ...        22200            10556
    Universit...  Huntsville     AL  ...        45500            24097
    <BLANKLINE>
    [3 rows x 26 columns]
    

    Because we passed in a list of row positions, this returns a DataFrame.

  5. The same DataFrame from step 4 may be reproduced with .loc by passing it a list of the institution names:
    >>> labels = [
    ...     "University of Alaska Anchorage",
    ...     "International Academy of Hair Design",
    ...     "University of Alabama in Huntsville",
    ... ]
    >>> college.loc[labels]
                        CITY STABBR  ...  MD_EARN_WNE_P10  GRAD_DEBT_MDN_SUPP
    INSTNM                           ...
    Universit...   Anchorage     AK  ...        42500          19449.5
    Internati...       Tempe     AZ  ...        22200            10556
    Universit...  Huntsville     AL  ...        45500            24097
    <BLANKLINE>
    [3 rows x 26 columns]
    
  6. Use slice notation with .iloc to select contiguous rows of the data:
    >>> college.iloc[99:102]
                     CITY STABBR  ...  MD_EARN_WNE_P10  GRAD_DEBT_MDN_SUPP
    INSTNM                        ...
    Internati...    Tempe     AZ  ...        22200            10556
    GateWay C...  Phoenix     AZ  ...        29800             7283
    Mesa Comm...     Mesa     AZ  ...        35200             8000
    <BLANKLINE>
    [3 rows x 26 columns]
    
  7. Slice notation also works with .loc and is a closed interval (it includes both the start label and the stop label):
    >>> start = "International Academy of Hair Design"
    >>> stop = "Mesa Community College"
    >>> college.loc[start:stop]
                     CITY STABBR  ...  MD_EARN_WNE_P10  GRAD_DEBT_MDN_SUPP
    INSTNM                        ...
    Internati...    Tempe     AZ  ...        22200            10556
    GateWay C...  Phoenix     AZ  ...        29800             7283
    Mesa Comm...     Mesa     AZ  ...        35200             8000
    <BLANKLINE>
    [3 rows x 26 columns]
    

How it works…

When we pass a scalar value, a list of scalars, or a slice to .iloc or .loc, this causes pandas to scan the index for the appropriate rows and return them. If a single scalar value is passed, a Series is returned. If a list or slice is passed, then a DataFrame is returned.

There's more…

In step 5, the list of index labels can be selected directly from the DataFrame returned in step 4 without the need for copying and pasting:

>>> college.iloc[[60, 99, 3]].index.tolist()
['University of Alaska Anchorage', 'International Academy of Hair Design', 'University of Alabama in Huntsville']

Selecting DataFrame rows and columns simultaneously

There are many ways to select rows and columns. The easiest method to select one or more columns from a DataFrame is to index off of the DataFrame. However, this approach has a limitation. Indexing directly on a DataFrame does not allow you to select both rows and columns simultaneously. To select rows and columns, you will need to pass both valid row and column selections separated by a comma to either .iloc or .loc.

The generic form to select rows and columns will look like the following code:

df.iloc[row_idxs, column_idxs]
df.loc[row_names, column_names]

Where row_idxs and column_idxs can be scalar integers, lists of integers, or integer slices. While row_names and column_names can be the scalar names, lists of names, or names slices, row_names can also be a Boolean array.

In this recipe, each step shows a simultaneous row and column selection using both .iloc and .loc.

How to do it…

  1. Read in the college dataset, and set the index as the institution name. Select the first three rows and the first four columns with slice notation:
    >>> college = pd.read_csv(
    ...     "data/college.csv", index_col="INSTNM"
    ... )
    >>> college.iloc[:3, :4]
                        CITY STABBR  HBCU  MENONLY
    INSTNM                                        
    Alabama A...      Normal     AL   1.0      0.0
    Universit...  Birmingham     AL   0.0      0.0
    Amridge U...  Montgomery     AL   0.0      0.0
    >>> college.loc[:"Amridge University", :"MENONLY"]
                        CITY STABBR  HBCU  MENONLY
    INSTNM                                        
    Alabama A...      Normal     AL   1.0      0.0
    Universit...  Birmingham     AL   0.0      0.0
    Amridge U...  Montgomery     AL   0.0      0.0
    
  2. Select all the rows of two different columns:
    >>> college.iloc[:, [4, 6]].head()
                                         WOMENONLY  SATVRMID
    INSTNM
    Alabama A & M University                   0.0     424.0
    University of Alabama at Birmingham        0.0     570.0
    Amridge University                         0.0       NaN
    University of Alabama in Huntsville        0.0     595.0
    Alabama State University                   0.0     425.0
    >>> college.loc[:, ["WOMENONLY", "SATVRMID"]].head()
                                         WOMENONLY  SATVRMID
    INSTNM
    Alabama A & M University                   0.0     424.0
    University of Alabama at Birmingham        0.0     570.0
    Amridge University                         0.0       NaN
    University of Alabama in Huntsville        0.0     595.0
    Alabama State University                   0.0     425.0
    
  3. Select disjointed rows and columns:
    >>> college.iloc[[100, 200], [7, 15]]
                                           SATMTMID  UGDS_NHPI
    INSTNM
    GateWay Community College                   NaN     0.0029
    American Baptist Seminary of the West       NaN        NaN
    >>> rows = [
    ...     "GateWay Community College",
    ...     "American Baptist Seminary of the West",
    ... ]
    >>> columns = ["SATMTMID", "UGDS_NHPI"]
    >>> college.loc[rows, columns]
                                           SATMTMID  UGDS_NHPI
    INSTNM
    GateWay Community College                   NaN     0.0029
    American Baptist Seminary of the West       NaN        NaN
    
  4. Select a single scalar value:
    >>> college.iloc[5, -4]
    0.401
    >>> college.loc["The University of Alabama", "PCTFLOAN"]
    0.401
    
  5. Slice the rows and select a single column:
    >>> college.iloc[90:80:-2, 5]
    INSTNM                              
    Empire Beauty School-Flagstaff     0
    Charles of Italy Beauty College    0
    Central Arizona College            0
    University of Arizona              0
    Arizona State University-Tempe     0
    Name: RELAFFIL, dtype: int64
    >>> start = "Empire Beauty School-Flagstaff"
    >>> stop = "Arizona State University-Tempe"
    >>> college.loc[start:stop:-2, "RELAFFIL"]
    INSTNM                              
    Empire Beauty School-Flagstaff     0
    Charles of Italy Beauty College    0
    Central Arizona College            0
    University of Arizona              0
    Arizona State University-Tempe     0
    Name: RELAFFIL, dtype: int64
    

How it works…

One of the keys to selecting rows and columns at the same time is to understand the use of the comma in the brackets. The selection to the left of the comma always selects rows based on the row index. The selection to the right of the comma always selects columns based on the column index.

It is not necessary to make a selection for both rows and columns simultaneously. Step 2 shows how to select all the rows and a subset of columns. The colon (:) represents a slice object that returns all the values for that dimension.

There's more…

To select only rows (along with all the columns), it is not necessary to use a colon following a comma. The default behavior is to select all the columns if there is no comma present. The previous recipe selected rows in exactly this manner. You can, however, use a colon to represent a slice of all the columns. The following lines of code are equivalent:

college.iloc[:10]
college.iloc[:10, :]

Selecting data with both integers and labels

Sometimes, you want the functionality of both .iloc and .loc, to select data by both position and label. In earlier versions of pandas, .ix was available to select data by both position and label. While this conveniently worked for those specific situations, it was ambiguous by nature and was a source of confusion for many pandas users. The .ix indexer has subsequently been deprecated and thus should be avoided.

Before the .ix deprecation, it was possible to select the first five rows and the columns of the college dataset from UGDS_WHITE through UGDS_UNKN using college.ix[:5, 'UGDS_WHITE':'UGDS_UNKN']. This is now impossible to do directly using .loc or .iloc. The following recipe shows how to find the integer location of the columns and then use .iloc to complete the selection.

How to do it…

  1. Read in the college dataset and assign the institution name (INSTNM) as the index:
    >>> college = pd.read_csv(
    ...     "data/college.csv", index_col="INSTNM"
    ... )
    
  2. Use the Index method .get_loc to find the integer position of the desired columns:
    >>> col_start = college.columns.get_loc("UGDS_WHITE")
    >>> col_end = college.columns.get_loc("UGDS_UNKN") + 1
    >>> col_start, col_end
    (10, 19)
    
  3. Use col_start and col_end to select columns by position using .iloc:
    >>> college.iloc[:5, col_start:col_end]
                  UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
    INSTNM                                ...                     
    Alabama A...      0.0333      0.9353  ...    0.0059     0.0138
    Universit...      0.5922      0.2600  ...    0.0179     0.0100
    Amridge U...      0.2990      0.4192  ...    0.0000     0.2715
    Universit...      0.6988      0.1255  ...    0.0332     0.0350
    Alabama S...      0.0158      0.9208  ...    0.0243     0.0137
    <BLANKLINE>
    [5 rows x 9 columns]
    

How it works…

Step 2 first retrieves the column index through the .columns attribute. Indexes have a .get_loc method, which accepts an index label and returns its integer location. We find both the start and end integer locations for the columns that we wish to slice. We add one because slicing with .iloc uses the half-open interval and is exclusive of the last item. Step 3 uses slice notation with the row and column positions.

There's more…

We can do a very similar operation to use positions to get the labels for .loc to work. The following shows how to select the 10th through 15th (inclusive) rows, along with columns UGDS_WHITE through UGDS_UNKN:

>>> row_start = college.index[10]
>>> row_end = college.index[15]
>>> college.loc[row_start:row_end, "UGDS_WHITE":"UGDS_UNKN"]
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Birmingha...      0.7983      0.1102  ...    0.0000     0.0051
Chattahoo...      0.4661      0.4372  ...    0.0000     0.0139
Concordia...      0.0280      0.8758  ...    0.0466     0.0000
South Uni...      0.3046      0.6054  ...    0.0019     0.0326
Enterpris...      0.6408      0.2435  ...    0.0012     0.0069
James H F...      0.6979      0.2259  ...    0.0007     0.0009
<BLANKLINE>
[6 rows x 9 columns]

Doing this same operation with .ix (which is removed from pandas 1.0, so don't do this) would look like this (in versions prior to 1.0):

>>> college.ix[10:16, "UGDS_WHITE":"UGDS_UNKN"]
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Birmingha...      0.7983      0.1102  ...    0.0000     0.0051
Chattahoo...      0.4661      0.4372  ...    0.0000     0.0139
Concordia...      0.0280      0.8758  ...    0.0466     0.0000
South Uni...      0.3046      0.6054  ...    0.0019     0.0326
Enterpris...      0.6408      0.2435  ...    0.0012     0.0069
James H F...      0.6979      0.2259  ...    0.0007     0.0009
<BLANKLINE>
[6 rows x 9 columns]

It is possible to achieve the same results by chaining .loc and .iloc together, but chaining indexers is typically a bad idea. It can be slower, and it is also undetermined whether it returns a view or a copy (which is not problematic when viewing the data, but can be when updating data. You might see the infamous SettingWithCopyWarning warning):

>>> college.iloc[10:16].loc[:, "UGDS_WHITE":"UGDS_UNKN"]
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Birmingha...      0.7983      0.1102  ...    0.0000     0.0051
Chattahoo...      0.4661      0.4372  ...    0.0000     0.0139
Concordia...      0.0280      0.8758  ...    0.0466     0.0000
South Uni...      0.3046      0.6054  ...    0.0019     0.0326
Enterpris...      0.6408      0.2435  ...    0.0012     0.0069
James H F...      0.6979      0.2259  ...    0.0007     0.0009
<BLANKLINE>
[6 rows x 9 columns]

Slicing lexicographically

The .loc attribute typically selects data based on the exact string label of the index. However, it also allows you to select data based on the lexicographic order of the values in the index. Specifically, .loc allows you to select all rows with an index lexicographically using slice notation. This only works if the index is sorted.

In this recipe, you will first sort the index and then use slice notation inside the .loc indexer to select all rows between two strings.

How to do it…

  1. Read in the college dataset, and set the institution name as the index:
    >>> college = pd.read_csv(
    ...     "data/college.csv", index_col="INSTNM"
    ... )
    
  2. Attempt to select all colleges with names lexicographically between Sp and Su:
    >>> college.loc["Sp":"Su"]
    Traceback (most recent call last):
      ...
    ValueError: index must be monotonic increasing or decreasing
    During handling of the above exception, another exception occurred:
    Traceback (most recent call last):
      ...
    KeyError: 'Sp'
    
  3. As the index is not sorted, the preceding command fails. Let's go ahead and sort the index:
    >>> college = college.sort_index()
    
  4. Now, let's rerun the same command from step 2:
    >>> college.loc["Sp":"Su"]
                        CITY STABBR  ...  MD_EARN_WNE_P10  GRAD_DEBT_MDN_SUPP
    INSTNM                           ...
    Spa Tech ...     Ipswich     MA  ...        21500             6333
    Spa Tech ...    Plymouth     MA  ...        21500             6333
    Spa Tech ...    Westboro     MA  ...        21500             6333
    Spa Tech ...   Westbrook     ME  ...        21500             6333
    Spalding ...  Louisville     KY  ...        41700            25000
    ...                  ...    ...  ...          ...              ...
    Studio Ac...    Chandler     AZ  ...          NaN             6333
    Studio Je...    New York     NY  ...  PrivacyS...      PrivacyS...
    Stylemast...    Longview     WA  ...        17000            13320
    Styles an...      Selmer     TN  ...  PrivacyS...      PrivacyS...
    Styletren...   Rock Hill     SC  ...  PrivacyS...           9495.5
    <BLANKLINE>
    [201 rows x 26 columns]
    

How it works…

The normal behavior of .loc is to make selections of data based on the exact labels passed to it. It raises a KeyError when these labels are not found in the index. However, one special exception to this behavior exists whenever the index is lexicographically sorted, and a slice is passed to it. Selection is now possible between the start and stop labels of the slice, even if those values are not found in the index.

There's more…

With this recipe, it is easy to select colleges between two letters of the alphabet. For instance, to select all colleges that begin with the letters D through S, you would use college.loc['D':'T']. Slicing like this is still closed and includes the last index, so this would technically return a college with the exact name T.

This type of slicing also works when the index is sorted in the opposite direction. You can determine in which direction the index is sorted with the index attribute .is_monotonic_increasing or .is_monotonic_decreasing. Either of these must be True in order for lexicographic slicing to work. For instance, the following code lexicographically sorts the index from Z to A:

>>> college = college.sort_index(ascending=False)
>>> college.index.is_monotonic_decreasing
True
>>> college.loc["E":"B"]
                                                  CITY  ...
INSTNM                                                  ...
Dyersburg State Community College            Dyersburg  ...
Dutchess Community College                Poughkeepsie  ...
Dutchess BOCES-Practical Nursing Program  Poughkeepsie  ...
Durham Technical Community College              Durham  ...
Durham Beauty Academy                           Durham  ...
...                                                ...  ...
Bacone College                                Muskogee  ...
Babson College                               Wellesley  ...
BJ's Beauty & Barber College                    Auburn  ...
BIR Training Center                            Chicago  ...
B M Spurr School of Practical Nursing        Glen Dale  ...
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset