Distributed chunking with execnet

In this recipe, we'll do chunking and tagging over an execnet gateway. This will be very similar to the tagging in the previous recipe, but we'll be sending two objects instead of one, and we will be receiving a Tree instead of a list, which requires pickling and unpickling for serialization.

Getting ready

As in the previous recipe, you must have execnet installed.

How to do it...

The setup code is very similar to the last recipe, and we'll use the same pickled tagger as well. First, we'll pickle the default chunker used by nltk.chunk.ne_chunk(), though any chunker would do. Next, we make a gateway for the remote_chunk module, get a channel, and send the pickled tagger and chunker over. Then, we receive a pickled Tree, which we can unpickle and inspect to see the result. Finally, we exit the gateway:

>>> import execnet, remote_chunk
>>> import nltk.data, nltk.tag, nltk.chunk
>>> import pickle
>>> from nltk.corpus import treebank_chunk
>>> tagger = pickle.dumps(nltk.data.load(nltk.tag._POS_TAGGER))
>>> chunker = pickle.dumps(nltk.data.load(nltk.chunk._MULTICLASS_NE_CHUNKER))
>>> gw = execnet.makegateway()
>>> channel = gw.remote_exec(remote_chunk)
>>> channel.send(tagger)
>>> channel.send(chunker)
>>> channel.send(treebank_chunk.sents()[0])
>>> chunk_tree = pickle.loads(channel.receive())
>>> chunk_tree
Tree('S', [Tree('PERSON', [('Pierre', 'NNP')]), Tree('ORGANIZATION', [('Vinken', 'NNP')]), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')])
>>> gw.exit()

The communication this time is slightly different, as shown in the following diagram:

How to do it...

How it works...

The remote_chunk.py module is just a little bit more complicated than the remote_tag.py module from the previous recipe. In addition to receiving a pickled tagger, it also expects to receive a pickled chunker that implements the ChunkerI interface. Once it has both a tagger and a chunker, it expects to receive any number of tokenized sentences, which it tags and parses into a Tree. This Tree is then pickled and sent back over the channel:

import pickle

if __name__ == '__channelexec__':
  tagger = pickle.loads(channel.receive())
  chunker = pickle.loads(channel.receive())
  
  for sentence in channel:
    chunk_tree = chunker.parse(tagger.tag(sent))
    channel.send(pickle.dumps(chunk_tree))

Note

The Tree must be pickled because it is not a simple built-in type.

There's more...

Note that the remote_chunk module is pure. Its only external dependency is the pickle module, which is part of the Python standard library. It doesn't need to import any NLTK modules in order to use the tagger or chunker, because all the necessary data is pickled and sent over the channel. As long as you structure your remote code like this, with no external dependencies, you only need NLTK to be installed on a single machine—the one that starts the gateway and sends the objects over the channel.

Python subprocesses

If you look at your task/system monitor (or top on *nix) while running the execnet code, you may notice a few extra Python processes. Every gateway spawns a new, self-contained, shared-nothing Python interpreter process, which is killed when you call the exit() method. Unlike with threads, there is no shared memory to worry about, and no global interpreter lock to slow things down. All you have are separate communicating processes. This is true whether the processes are local or remote. Instead of locking and synchronization, all you have to worry about is the order in which the messages are sent and received.

See also

The previous recipe explains execnet gateways and channels in detail. In the next recipe, we'll use execnet to process a list in parallel.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset