It's fairly common to find incorrect verb forms in real-world language. For example, the correct form of is our children learning? is are our children learning? The verb is should only be used with singular nouns, while are is for plural nouns, such as children. We can correct these mistakes by creating verb correction mappings that are used depending on whether there's a plural or singular noun in the chunk.
We first need to define the verb correction mappings in transforms.py
. We'll create two mappings, one for plural to singular and another for singular to plural:
plural_verb_forms = { ('is', 'VBZ'): ('are', 'VBP'), ('was', 'VBD'): ('were', 'VBD') } singular_verb_forms = { ('are', 'VBP'): ('is', 'VBZ'), ('were', 'VBD'): ('was', 'VBD') }
Each mapping has a tagged verb that maps to another tagged verb. These initial mappings cover the basics of mapping is to are, was to were, and vice versa.
In transforms.py
is a function called correct_verbs()
. Pass it a chunk with incorrect verb forms and you'll get a corrected chunk back. It uses a helper function, first_chunk_index()
, to search the chunk for the position of the first tagged word where pred
returns True
. The pred
argument should be a callable function that takes a (word, tag)
tuple and returns True
or False
. Here's first_chunk_index()
:
def first_chunk_index(chunk, pred, start=0, step=1): l = len(chunk) end = l if step > 0 else -1 for i in range(start, end, step): if pred(chunk[i]): return i return None
For first_chunk_index()
to be useful, we need to use a predicate function. In the case of correct_verbs()
, the predicate function we need should return True
if the tag in the (word, tag)
argument starts with a given tag prefix, and False
otherwise.
def tag_startswith(prefix): def f(wt): return wt[1].startswith(prefix) return f
The tag_startswith()
function takes a tag prefix, such as NN
, and returns a predicate function that will take a (word, tag)
tuple and return True
if the tag starts with the given prefix. A function that returns another function is called a
higher order function. This is not as complicated as it might sound—just as you can use a function to generate and return new variables and values, some programming languages (such as Python) let you generate functions inside of other functions. In this case, we want a function that takes a single argument: (word, tag)
. But we also want this function to have access to a prefix variable. Since we cannot add arguments to the function definition, we instead generate a higher order function that has access to the prefix variable, while preserving the single (word, tag)
argument.
Now that we have defined first_chunk_index()
and tag_startswith()
, we can actually implement correct_verbs()
. This may seem like overkill for a single function, but we will be using first_chunk_index()
and tag_startswith()
in subsequent recipes.
def correct_verbs(chunk): vbidx = first_chunk_index(chunk, tag_startswith('VB')) # if no verb found, do nothing if vbidx is None: return chunk verb, vbtag = chunk[vbidx] nnpred = tag_startswith('NN') # find nearest noun to the right of verb nnidx = first_chunk_index(chunk, nnpred, start=vbidx+1) # if no noun found to right, look to the left if nnidx is None: nnidx = first_chunk_index(chunk, nnpred, start=vbidx-1, step=-1) # if no noun found, do nothing if nnidx is None: return chunk noun, nntag = chunk[nnidx] # get correct verb form and insert into chunk if nntag.endswith('S'): chunk[vbidx] = plural_verb_forms.get((verb, vbtag), (verb, vbtag)) else: chunk[vbidx] = singular_verb_forms.get((verb, vbtag), (verb, vbtag)) return chunk
When we call the preceding function on a part-of-speech tagged is our children learning
chunk, we get back the correct form, are our children learning
.
>>> from transforms import correct_verbs >>> correct_verbs([('is', 'VBZ'), ('our', 'PRP$'), ('children', 'NNS'), ('learning', 'VBG')]) [('are', 'VBP'), ('our', 'PRP$'), ('children', 'NNS'), ('learning', 'VBG')]
We can also try this with a singular noun and an incorrect plural verb:
>>> correct_verbs([('our', 'PRP$'), ('child', 'NN'), ('were', 'VBD'), ('learning', 'VBG')]) [('our', 'PRP$'), ('child', 'NN'), ('was', 'VBD'), ('learning', 'VBG')]
In this case, were
becomes was
because child
is a singular noun.
The correct_verbs()
function starts by looking for a verb in the chunk. If no verb is found, the chunk is returned with no changes. Once a verb is found, we keep the verb, its tag, and its index in the chunk. Then, we look on either side of the verb to find the nearest noun, starting on the right and looking to the left only if no noun is found on the right. If no noun is found at all, the chunk is returned as is. But if a noun is found, then we look up the correct verb form depending on whether or not the noun is plural.
Recall from Chapter 4, Part-of-speech Tagging, that plural nouns are tagged with NNS
, while singular nouns are tagged with NN
. That means we can check the plurality of a noun by looking to see whether its tag ends with S
. Once we get the corrected verb form, it is inserted into the chunk to replace the original verb form.