Generators

Generators are one very powerful tool that Python gifts us with. They are based on the concepts of iteration, as we said before, and they allow for coding patterns that combine elegance with efficiency.

Generators are of two types:

  • Generator functions: These are very similar to regular functions, but instead of returning results through return statements, they use yield, which allows them to suspend and resume their state between each call
  • Generator expressions: These are very similar to the list comprehensions we've seen in this chapter, but instead of returning a list they return an object that produces results one by one

Generator functions

Generator functions come under all aspects like regular functions, with one difference: instead of collecting results and returning them at once, they can start the computation, yield one value, suspend their state saving everything they need to be able to resume and, if called again, resume and perform another step. Generator functions are automatically turned into their own iterators by Python, so you can call next on them.

This is all very theoretical so, let's make it clear why such a mechanism is so powerful, and then let's see an example.

Say I asked you to count out loud from 1 to a million. You start, and at some point I ask you to stop. After some time, I ask you to resume. At this point, what is the minimum information you need to be able to resume correctly? Well, you need to remember the last number you called. If I stopped you after 31415, you will just go on with 31416, and so on.

The point is, you don't need to remember all the numbers you said before 31415, nor do you need them to be written down somewhere. Well, you may not know it, but you're behaving like a generator already!

Take a good look at the following code:

first.n.squares.py

def get_squares(n):  # classic function approach
    return [x ** 2 for x in range(n)]
print(get_squares(10))

def get_squares_gen(n):  # generator approach
    for x in range(n):
        yield x ** 2  # we yield, we don't return
print(list(get_squares_gen(10)))

The result of the prints will be the same: [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]. But there is a huge difference between the two functions. get_squares is a classic function that collects all the squares of numbers in [0, n) in a list, and returns it. On the other hand, get_squares_gen is a generator, and behaves very differently. Each time the interpreter reaches the yield line, its execution is suspended. The only reason those prints return the same result is because we fed get_squares_gen to the list constructor, which when called like that exhausts the generator completely by asking the next element until a StopIteration is raised. Let's see this in detail:

first.n.squares.manual.py

def get_squares_gen(n):
    for x in range(n):
        yield x ** 2

squares = get_squares_gen(4)  # this creates a generator object
print(squares)  # <generator object get_squares_gen at 0x7f158...>
print(next(squares))  # prints: 0
print(next(squares))  # prints: 1
print(next(squares))  # prints: 4
print(next(squares))  # prints: 9
# the following raises StopIteration, the generator is exhausted,
# any further call to next will keep raising StopIteration
print(next(squares))

In the preceding code, each time we call next on the generator object, we either start it (first next) or make it resume from the last suspension point (any other next).

The first time we call next on it, we get 0, which is the square of 0, then 1, then 4, then 9 and since the for loop stops after that (n is 4), then the generator naturally ends. A classic function would at that point just return None, but in order to comply with the iteration protocol, a generator will instead raise a StopIteration exception.

This explains how a for loop works for example. When you call for k in range(n), what happens under the hood is that the for loop gets an iterator out of range(n) and starts calling next on it, until StopIteration is raised, which tells the for loop that the iteration has reached its end.

Having this behavior built-in in every iteration aspect of Python makes generators even more powerful because once we write them, we'll be able to plug them in whatever iteration mechanism we want.

At this point, you're probably asking yourself why would you want to use a generator instead of a regular function. Well, the title of this chapter should suggest the answer. I'll talk about performances later, so for now let's concentrate on another aspect: sometimes generators allow you to do something that wouldn't be possible with a simple list. For example, say you want to analyze all permutations of a sequence. If the sequence has length N, then the number of its permutations is N!. This means that if the sequence is 10 elements long, the number of permutations is 3628800. But a sequence of 20 elements would have 2432902008176640000 permutations. They grow factorially.

Now imagine you have a classic function that is attempting to calculate all permutations, put them in a list, and return it to you. With 10 elements, it would require probably a few tens of seconds, but for 20 elements there is simply no way that it can be done.

On the other hand, a generator function will be able to start the computation and give you back the first permutation, then the second, and so on. Of course you won't have the time to parse them all, they are too many, but at least you'll be able to work with some of them.

Remember when we were talking about the break statement in for loops? When we found a number dividing a candidate prime we were breaking the loop, no need to go on.

Sometimes it's exactly the same, only the amount of data you have to iterate over is so huge that you cannot keep it all in memory in a list. In this case, generators are invaluable: they make possible what wouldn't be possible otherwise.

So, in order to save memory (and time), use generator functions whenever possible.

It's also worth noting that you can use the return statement in a generator function. It will produce a StopIteration exception to be raised, effectively ending the iteration. This is extremely important. If a return statement were actually to make the function return something, it would break the iteration protocol. Python consistency prevents this, and allows us great ease when coding. Let's see a quick example:

gen.yield.return.py

def geometric_progression(a, q):
    k = 0
    while True:
        result = a * q**k
        if result <= 100000:
            yield result
        else:
            return
        k += 1

for n in geometric_progression(2, 5):
    print(n)

The preceding code yields all terms of the geometric progression a, aq, Generator functions, Generator functions, .... When the progression produces a term that is greater than 100,000, the generator stops (with a return statement). Running the code produces the following result:

$ python gen.yield.return.py
2
10
50
250
1250
6250
31250

The next term would have been 156250, which is too big.

Going beyond next

At the beginning of this chapter, I told you that generator objects are based on the iteration protocol. We'll see in the next chapter a complete example of how to write a custom iterator/iterable object. For now, I just want you to understand how next() works.

What happens when you call next(generator) is that you're calling the generator.__next__() method. Remember, a method is just a function that belongs to an object, and objects in Python can have special methods. Our friend __next__() is just one of these and its purpose is to return the next element of the iteration, or to raise StopIteration when the iteration is over and there are no more elements to return.

Note

In Python, an object's special methods are also called magic methods, or dunder (from "double underscore") methods.

When we write a generator function, Python automatically transforms it into an object that is very similar to an iterator, and when we call next(generator), that call is transformed in generator.__next__(). Let's revisit the previous example about generating squares:

first.n.squares.manual.method.py

def get_squares_gen(n):
    for x in range(n):
        yield x ** 2

squares = get_squares_gen(3)
print(squares.__next__())  # prints: 0
print(squares.__next__())  # prints: 1
print(squares.__next__())  # prints: 4
# the following raises StopIteration, the generator is exhausted,
# any further call to next will keep raising StopIteration
print(squares.__next__())

The result is exactly as the previous example, only this time instead of using the proxy call next(squares), we're directly calling squares.__next__().

Generator objects have also three other methods that allow controlling their behavior: send, throw, and close. send allows us to communicate a value back to the generator object, while throw and close respectively allow raising an exception within the generator and closing it. Their use is quite advanced and I won't be covering them here in detail, but I want to spend a few words at least about send, with a simple example.

Take a look at the following code:

gen.send.preparation.py

def counter(start=0):
    n = start
    while True:
        yield n
        n += 1

c = counter()
print(next(c))  # prints: 0
print(next(c))  # prints: 1
print(next(c))  # prints: 2

The preceding iterator creates a generator object that will run forever. You can keep calling it, it will never stop. Alternatively, you can put it in a for loop, for example, for n in counter(): ... and it will go on forever as well.

Now, what if you wanted to stop it at some point? One solution is to use a variable to control the while loop. Something like this:

gen.send.preparation.stop.py

stop = False
def counter(start=0):
    n = start
    while not stop:
        yield n
        n += 1

c = counter()
print(next(c))  # prints: 0
print(next(c))  # prints: 1
stop = True
print(next(c))  # raises StopIteration

This will do it. We start with stop = False, and until we change it to True, the generator will just keep going, like before. The moment we change stop to True though, the while loop will exit, and the next call will raise a StopIteration exception. This trick works, but I don't like it. We depend on an external variable, and this can lead to issues: what if another function changes that stop? Moreover, the code is scattered. In a nutshell, this isn't good enough.

We can make it better by using generator.send(). When we call generator.send(), the value that we feed to send will be passed in to the generator, execution is resumed, and we can fetch it via the yield expression. This is all very complicated when explained with words, so let's see an example:

gen.send.py

def counter(start=0):
    n = start
    while True:
        result = yield n             # A
        print(type(result), result)  # B
        if result == 'Q':
            break
        n += 1

c = counter()
print(next(c))         # C
print(c.send('Wow!'))  # D
print(next(c))         # E
print(c.send('Q'))     # F

Execution of the preceding code produces the following:

$ python gen.send.py
0
<class 'str'> Wow!
1
<class 'NoneType'> None
2
<class 'str'> Q
Traceback (most recent call last):
  File "gen.send.py", line 14, in <module>
    print(c.send('Q'))     # F
StopIteration

I think it's worth going through this code line by line, like if we were executing it, and see if we can understand what's going on.

We start the generator execution with a call to next (#C). Within the generator, n is set to the same value of start. The while loop is entered, execution stops (#A) and n (0) is yielded back to the caller. 0 is printed on the console.

We then call send (#D), execution resumes and result is set to 'Wow!' (still #A), then its type and value are printed on the console (#B). result is not 'Q', therefore n is incremented by 1 and execution goes back to the while condition, which, being True, evaluates to True (that wasn't hard to guess, right?). Another loop cycle begins, execution stops again (#A), and n (1) is yielded back to the caller. 1 is printed on the console.

At this point, we call next (#E), execution is resumed again (#A), and because we are not sending anything to the generator explicitly, Python behaves exactly like functions that are not using the return statement: the yield n expression (#A) returns None. result therefore is set to None, and its type and value are yet again printed on the console (#B). Execution continues, result is not 'Q' so n is incremented by 1, and we start another loop again. Execution stops again (#A) and n (2) is yielded back to the caller. 2 is printed on the console.

And now for the grand finale: we call send again (#F), but this time we pass in 'Q', therefore when execution is resumed, result is set to 'Q' (#A). Its type and value are printed on the console (#B), and then finally the if clause evaluates to True and the while loop is stopped by the break statement. The generator naturally terminates and this means a StopIteration exception is raised. You can see the print of its traceback on the last few lines printed on the console.

This is not at all simple to understand at first, so if it's not clear to you, don't be discouraged. You can keep reading on and then you can come back to this example after some time.

Using send allows for interesting patterns, and it's worth noting that send can only be used to resume the execution, not to start it. Only next starts the execution of a generator.

The yield from expression

Another interesting construct is the yield from expression. This expression allows you to yield values from a subiterator. Its use allows for quite advanced patterns, so let's just see a very quick example of it:

gen.yield.for.py

def print_squares(start, end):
    for n in range(start, end):
        yield n ** 2

for n in print_squares(2, 5):
    print(n)

The previous code prints the numbers 4, 9, 16 on the console (on separate lines). By now, I expect you to be able to understand it by yourself, but let's quickly recap what happens. The for loop outside the function gets an iterator from print_squares(2, 5) and calls next on it until iteration is over. Every time the generator is called, execution is suspended (and later resumed) on yield n ** 2, which returns the square of the current n.

Let's see how we can transform this code benefiting from the yield from expression:

gen.yield.from.py

def print_squares(start, end):
    yield from (n ** 2 for n in range(start, end))

for n in print_squares(2, 5):
    print(n)

This code produces the same result, but as you can see the yield from is actually running a subiterator (n ** 2 …). The yield from expression returns to the caller each value the subiterator is producing. It's shorter and it reads better.

Generator expressions

Let's now talk about the other techniques to generate values one at a time.

The syntax is exactly the same as list comprehensions, only, instead of wrapping the comprehension with square brackets, you wrap it with round braces. That is called a generator expression.

In general, generator expressions behave like equivalent list comprehensions, but there is one very important thing to remember: generators allow for one iteration only, then they will be exhausted. Let's see an example:

generator.expressions.py

>>> cubes = [k**3 for k in range(10)]  # regular list
>>> cubes
[0, 1, 8, 27, 64, 125, 216, 343, 512, 729]
>>> type(cubes)
<class 'list'>
>>> cubes_gen = (k**3 for k in range(10))  # create as generator
>>> cubes_gen
<generator object <genexpr> at 0x7ff26b5db990>
>>> type(cubes_gen)
<class 'generator'>
>>> list(cubes_gen)  # this will exhaust the generator
[0, 1, 8, 27, 64, 125, 216, 343, 512, 729]
>>> list(cubes_gen)  # nothing more to give
[]

Look at the line in which the generator expression is created and assigned the name cubes_gen. You can see it's a generator object. In order to see its elements, we can use a for loop, a manual set of calls to next, or simply, feed it to a list constructor, which is what I did.

Notice how, once the generator has been exhausted, there is no way to recover the same elements from it again. We need to recreate it, if we want to use it from scratch again.

In the next few examples, let's see how to reproduce map and filter using generator expressions:

gen.map.py

def adder(*n):
    return sum(n)
s1 = sum(map(lambda n: adder(*n), zip(range(100), range(1, 101))))
s2 = sum(adder(*n) for n in zip(range(100), range(1, 101)))

In the previous example, s1 and s2 are exactly the same: they are the sum of adder(0, 1), adder(1, 2), adder(2, 3), and so on, which translates to sum(1, 3, 5, ...). The syntax is different though, I find the generator expression to be much more readable:

gen.filter.py

cubes = [x**3 for x in range(10)]
odd_cubes1 = filter(lambda cube: cube % 2, cubes)
odd_cubes2 = (cube for cube in cubes if cube % 2)

In the previous example, odd_cubes1 and odd_cubes2 are the same: they generate a sequence of odd cubes. Yet again, I prefer the generator syntax. This should be evident when things get a little more complicated:

gen.map.filter.py

N = 20
cubes1 = map(
    lambda n: (n, n**3),
    filter(lambda n: n % 3 == 0 or n % 5 == 0, range(N))
)
cubes2 = (
    (n, n**3) for n in range(N) if n % 3 == 0 or n % 5 == 0)

The preceding code creates to generators cubes1 and cubes2. They are exactly the same, and return 2-tuples (n, Generator expressions) when n is a multiple of 3 or 5.

If you print the list (cubes1), you get: [(0, 0), (3, 27), (5, 125), (6, 216), (9, 729), (10, 1000), (12, 1728), (15, 3375), (18, 5832)].

See how much better the generator expression reads? It may be debatable when things are very simple, but as soon as you start nesting functions a bit, like we did in this example, the superiority of the generator syntax is evident. Shorter, simpler, more elegant.

Now, let me ask you a question: what is the difference between the following lines of code?

sum.example.py

s1 = sum([n**2 for n in range(10**6)])
s2 = sum((n**2 for n in range(10**6)))
s3 = sum(n**2 for n in range(10**6))

Strictly speaking, they all produce the same sum. The expressions to get s2 and s3 are exactly the same because the braces in s2 are redundant. They are both generator expressions inside the sum function. The expression to get s1 is different though. Inside sum, we find a list comprehension. This means that in order to calculate s1, the sum function has to call next on a list, a million times.

Do you see where we're loosing time and memory? Before sum can start calling next on that list, the list needs to have been created, which is a waste of time and space. It's much better for sum to call next on a simple generator expression. There is no need to have all the numbers from range(10**6) stored in a list.

So, watch out for extra parentheses when you write your expressions: sometimes it's easy to skip on these details, which makes our code much different. Don't believe me?

sum.example.2.py

s = sum([n**2 for n in range(10**8)])  # this is killed
# s = sum(n**2 for n in range(10**8))  # this succeeds
print(s)

Try running the preceding example. If I run the first line, this is what I get:

$ python sum.example.2.py
Killed

On the other hand, if I comment out the first line, and uncomment the second one, this is the result:

$ python sum.example.2.py
333333328333333350000000

Sweet generator expressions. The difference between the two lines is that in the first one, a list with the squares of the first hundred million numbers must be made before being able to sum them up. That list is huge, and we run out of memory (at least, my box did, if yours doesn't try a bigger number), therefore Python kills the process for us. Sad face.

But when we remove the square brackets, we don't make a list any more. The sum function receives 0, 1, 4, 9, and so on until the last one, and sums them up. No problems, happy face.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset