- Read in the college dataset, create a separate DataFrame with STABBR as the index, and check whether the index is sorted:
>>> college = pd.read_csv('data/college.csv')
>>> college2 = college.set_index('STABBR')
>>> college2.index.is_monotonic
False
- Sort the index from college2 and store it as another object:
>>> college3 = college2.sort_index()
>>> college3.index.is_monotonic
True
- Time the selection of the state of Texas (TX) from all three DataFrames:
>>> %timeit college[college['STABBR'] == 'TX']
1.43 ms ± 53.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit college2.loc['TX']
526 µs ± 6.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit college3.loc['TX']
183 µs ± 3.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
- The sorted index performs nearly an order of magnitude faster than boolean selection. Let's now turn towards unique indexes. For this, we use the institution name as the index:
>>> college_unique = college.set_index('INSTNM')
>>> college_unique.index.is_unique
True
- Let's select Stanford University with boolean indexing:
>>> college[college['INSTNM'] == 'Stanford University']
- Let's select Stanford University with index selection:
>>> college_unique.loc['Stanford University']
CITY Stanford
STABBR CA
HBCU 0
...
UG25ABV 0.0401
MD_EARN_WNE_P10 86000
GRAD_DEBT_MDN_SUPP 12782
Name: Stanford University, dtype: object
- They both produce the same data, just with different objects. Let's time each approach:
>>> %timeit college[college['INSTNM'] == 'Stanford University']
1.3 ms ± 56.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit college_unique.loc['Stanford University']
157 µs ± 682 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)