The importance of indexes

Pandas indexes allow efficient lookup of values. If indexes did not exist, a linear search across all of our data would be required. Indexes create optimized shortcuts to specific data items using a direct lookup instead of a search process.

To begin examining the value of indexes we will use the following DataFrame of 10000 random numbers:

Suppose we want to look up the value of the random number where key==10099 (I explicitly picked this value as it is the last row in the DataFrame). We can do this using a Boolean selection.

Conceptually, this is simple. But what if we want to do this repeatedly? This can be simulated in Python using the %timeit statement. The following code performs the lookup repeatedly and reports on the performance.

This result states that there are 1,000 executions performed three times, and the fastest of those three took lookup 0.00535 seconds per loop on average (a total of 5.35 seconds for that one set of 1,000 loops).

Now let's try this using an index to help us look up the values. The following code sets the index of this DataFrame to match the values of the keys column.

And now it is possible to look up this value using .loc[].

That was just one lookup. Let's time it with %timeit.

The lookup using the index is roughly five times faster. Because of this greater performance, it is normally a best practice to perform lookup by index whenever possible. The downside of using an index is that it can take time to construct and also consumes more memory.

Many times, you will inherently know what your indexes should be and you can just create them upfront and get going with exploration. Other times, it will take some exploration first to determine the best index. And often it is possible that you do not have enough data or the proper fields to create a proper index. In these cases, you may need to use a partial index that returns multiple semi-ambiguous results and still perform Boolean selection on that set to get to the desired result.

It is a best practice when performing exploratory data analysis to first load the data and explore it using queries / Boolean selection. Then create an index if your data naturally supports one, or if you do require the increased speed.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset