Credit: Alex Martelli, David Ascher
Use an auxiliary
dictionary to do
preprocessing of the data, thereby reducing the need for iteration
over mostly irrelevant data. For instance, if xs
and ys
are the two data sets, with matching keys
as the first item in each entry, so that x[0] == y[0]
defines an
“interesting” pair:
auxdict = {} for y in ys: auxdict.setdefault(y[0], []).append(y) result = [ process(x, y) for x in xs for y in auxdict[x[0]] ]
To make the problem more concrete, let’s look at an
example. Say you need to analyze data about visitors to a web site
who have purchased something online. This means you need to perform
some computation based on data from two log files—one from the
web server and one from the credit-card processing framework. Each
log file is huge, but only a small number of the web server log
entries correspond to credit-card log entries. Let’s
assume that cclog
is a sequence of records, one
for each credit-card transaction, and that weblog
is a sequence of records describing each web site hit.
Let’s further assume that each record uses the
attribute ipaddress
to refer to the IP address
involved in each event. In this case, a reasonable first approach
would be to do something like:
results = [ process(webhit, ccinfo) for webhit in weblog for ccinfo in cclog if ccinfo.ipaddress==webhit.ipaddress ]
The problem with this approach is that the nested list comprehension will iterate over each entry in the web server log, and, for each entry in the web server log, it will also iterate over each entry in the credit-card log. This means that the algorithm has O(M x N) performance characteristics—in other words, the time it takes to compute will be proportional to the product of the size of both logs. As the web site becomes more popular and as data accumulates, the performance of the algorithm will rapidly deteriorate.
The key to optimizing this algorithm is to recognize that the
computation (process
) needs to happen for only a
small subset of all of the possible combinations of the two variables
(in this case, webhit
and
ccinfo
). If we could search for only the right
pairs, we might be able to speed up the process.
As Tim Peters says in the
introduction to this chapter, if you need to search, use an auxiliary
dictionary. The solution described earlier, rewritten to use our
variables, yields:
ipdict = {}
for webhit in weblog: ipdict.setdefault(webhit.ipaddress, []).append(webhit)
results = [ process(webhit, ccinfo) for ccinfo in cclog
for webhit in ipdict[ccinfo.ipaddress] ]
The highlighted line creates a dictionary mapping IP addresses to lists containing the data for each web hit. Because we’re indexing the dictionary by IP address, we are optimizing the data structures for a particular query: “give me all of the web records for a particular IP address.” The list comprehension now iterates over only the data we require—for each credit-card transaction, we iterate over all of the web hits corresponding to the IP address used in that transaction. Not only did the algorithm go from O(M x N) to O(M+N) from a theoretical point of view, but, because we chose to hold in the auxiliary dictionary data that is sparser from the point of view of the task at hand, we’ve made the solution faster than the alternative (which would also be O(M+N)).
Note that the test used to determine whether two records correspond to a pair of interest can be arbitrary. The generic description of the solution uses indexing, while the web example uses attribute-getting. The test you use will depend on your application and your data structures.
Recipe 1.6 for more details on the
setdefault
method of dictionaries; the
introduction to Chapter 2.