Don't overdo comprehensions and generators

We've seen how powerful list comprehensions and generator expressions can be. And they are, don't get me wrong, but the feeling that I have when I deal with them is that their complexity grows exponentially. The more you try to do within a single comprehension or a generator expression, the harder it becomes to read, understand, and therefore maintain or change.

If you check the Zen of Python again, there are a few lines that I think are worth keeping in mind when dealing with optimized code:

>>> import this
...
Explicit is better than implicit.
Simple is better than complex.
...
Readability counts.
...
If the implementation is hard to explain, it's a bad idea.
...

Comprehensions and generator expressions are more implicit than explicit, can be quite difficult to read and understand, and they can be hard to explain. Sometimes you have to break them apart using the inside-out technique, to understand what's going on.

To give you an example, let's talk a bit more about Pythagorean triples. Just to remind you, a Pythagorean triple is a tuple of positive integers (a, b, c) such that a2 + b2 = c2.

We saw how to calculate them in the Filtering a comprehension section, but we did it in a very inefficient way because we were scanning all pairs of numbers below a certain threshold, calculating the hypotenuse, and filtering out those that were not producing a triple.

A better way to get a list of Pythagorean triples is to generate them directly. There are many different formulas you can use to do this, we'll use the Euclidean formula.

This formula says that any triple (a, b, c), where a = m2 - n2, b = 2mn, c = m2 + n2, with m and n positive integers such that m > n, is a Pythagorean triple. For example, when m = 2 and n = 1, we find the smallest triple: (3, 4, 5).

There is one catch though: consider the triple (6, 8, 10) that is just like (3, 4, 5) with all the numbers multiplied by 2. This triple is definitely Pythagorean, since 62 + 82 = 102 , but we can derive it from (3, 4, 5) simply by multiplying each of its elements by 2. Same goes for (9, 12, 15), (12, 16, 20), and in general for all the triples that we can write as (3k, 4k, 5k), with k being a positive integer greater than 1.

A triple that cannot be obtained by multiplying the elements of another one by some factor, k, is called primitive. Another way of stating this is: if the three elements of a triple are coprime, then the triple is primitive. Two numbers are coprime when they don't share any prime factor amongst their divisors, that is, their greatest common divisor (GCD) is 1. For example, 3 and 5 are coprime, while 3 and 6 are not, because they are both divisible by 3.

So, the Euclidean formula tells us that if m and n are coprime, and m - n is odd, the triple they generate is primitive. In the following example, we will write a generator expression to calculate all the primitive Pythagorean triples whose hypotenuse (c) is less than or equal to some integer, N. This means we want all triples for which m2 + n2 ≤ N. When n is 1, the formula looks like this: m2 ≤ N - 1, which means we can approximate the calculation with an upper bound of m ≤ N1/2.

So, to recap: m must be greater than n, they must also be coprime, and their difference m - n must be odd. Moreover, in order to avoid useless calculations, we'll put the upper bound for m at floor(sqrt(N)) + 1.

The floor function for a real number, x, gives the maximum integer, n, such that n < x, for example, floor(3.8) = 3, floor(13.1) = 13. Taking floor(sqrt(N)) + 1 means taking the integer part of the square root of N and adding a minimal margin just to make sure we don't miss any numbers.

Let's put all of this into code, step by step. Let's start by writing a simple gcd function that uses Euclid's algorithm:

# functions.py
def gcd(a, b):
"""Calculate the Greatest Common Divisor of (a, b). """
while b != 0:
a, b = b, a % b
return a

The explanation of Euclid's algorithm is available on the web, so I won't spend any time here talking about it; we need to focus on the generator expression. The next step is to use the knowledge we gathered before to generate a list of primitive Pythagorean triples:

# pythagorean.triple.generation.py
from functions import gcd
N = 50

triples = sorted( # 1
((a, b, c) for a, b, c in ( # 2
((m**2 - n**2), (2 * m * n), (m**2 + n**2)) # 3
for m in range(1, int(N**.5) + 1) # 4
for n in range(1, m) # 5
if (m - n) % 2 and gcd(m, n) == 1 # 6
) if c <= N), key=lambda *triple: sum(*triple) # 7
)

There you go. It's not easy to read, so let's go through it line by line. At #3, we start a generator expression that is creating triples. You can see from #4 and #5 that we're looping on m in [1, M] with M being the integer part of sqrt(N), plus 1. On the other hand, n loops within [1, m), to respect the m > n rule. It's worth noting how I calculated sqrt(N), that is, N**.5, which is just another way to do it that I wanted to show you.

At #6, you can see the filtering conditions to make the triples primitive: (m - n) % 2 evaluates to True when (m - n) is odd, and gcd(m, n) == 1 means m and n are coprime. With these in place, we know the triples will be primitive. This takes care of the innermost generator expression. The outermost one starts at #2, and finishes at #7. We take the triples (a, b, c) in (...innermost generator...) such that c <= N

Finally, at #1 we apply sorting, to present the list in order. At #7, after the outermost generator expression is closed, you can see that we specify the sorting key to be the sum a + b + c. This is just my personal preference, there is no mathematical reason behind it.

So, what do you think? Was it straightforward to read? I don't think so. And believe me, this is still a simple example; I have seen much worse in my career. This kind of code is difficult to understand, debug, and modify. It shouldn't find a place in a professional environment.

So, let's see if we can rewrite this code into something more readable:

# pythagorean.triple.generation.for.py
from functions import gcd

def gen_triples(N):
for m in range(1, int(N**.5) + 1): # 1
for n in range(1, m): # 2
if (m - n) % 2 and gcd(m, n) == 1: # 3
c = m**2 + n**2 # 4
if c <= N: # 5
a = m**2 - n**2 # 6
b = 2 * m * n # 7
yield (a, b, c) # 8

triples = sorted(
gen_triples(50), key=lambda *triple: sum(*triple)) # 9

This is so much better. Let's go through it, line by line. You'll see how much easier it is to understand.

We start looping at #1 and #2, in exactly the same way we were looping in the previous example. On line #3, we have the filtering for primitive triples. On line #4, we deviate a bit from what we were doing before: we calculate c, and on line #5, we filter on c being less than or equal to N. Only when c satisfies that condition, we do calculate a and b, and yield the resulting tuple. It's always good to delay all calculations for as much as possible so that we don't waste time and CPU. On the last line, we apply sorting with the same key we were using in the generator expression example.

I hope you agree, this example is easier to understand. And I promise you, if you have to modify the code one day, you'll find that modifying this one is easy, while to modify the other version will take much longer (and it will be more error-prone).

If you print the results of both examples (they are the same), you will get this:

[(3, 4, 5), (5, 12, 13), (15, 8, 17), (7, 24, 25), (21, 20, 29), (35, 12, 37), (9, 40, 41)]  

The moral of the story is, try and use comprehensions and generator expressions as much as you can, but if the code starts to be complicated to modify or to read, you may want to refactor it into something more readable. Your colleagues will thank you.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset