In previous chapters, we talked about all the preprocessing steps we need, in order to work with any text corpus. You should now be comfortable about parsing any kind of text and should be able to clean it. You should be able to perform all text preprocessing, such as Tokenization, Stemming, and Stop Word removal on any text. You can perform and customize all the preprocessing tools to fit your needs. So far, we have mainly discussed generic preprocessing to be done with text documents. Now let's move on to more intense NLP preprocessing steps.
In this chapter, we will discuss what part of speech tagging is, and what the significance of POS is in the context of NLP applications. We will also learn how to use NLTK to extract meaningful information using tagging and various taggers used for NLP intense applications. Lastly, we will learn how NLTK can be used to tag a named entity. We will discuss in detail the various NLP taggers and also give a small snippet to help you get going. We will also see the best practices, and where to use what kind of tagger. By the end of this chapter, readers will learn:
In your childhood, you may have heard the term Part of Speech (POS). It can really take good amount of time to get the hang of what adjectives and adverbs actually are. What exactly is the difference? Think about building a system where we can encode all this knowledge. It may look very easy, but for many decades, coding this knowledge into a machine learning model was a very hard NLP problem. I think current state of the art POS tagging algorithms can predict the POS of the given word with a higher degree of precision (that is approximately 97 percent). But still lots of research going on in the area of POS tagging.
Languages like English have many tagged corpuses available in the news and other domains. This has resulted in many state of the art algorithms. Some of these taggers are generic enough to be used across different domains and varieties of text. But in specific use cases, the POS might not perform as expected. For these use cases, we might need to build a POS tagger from scratch. To understand the internals of a POS, we need to have a basic understanding of some of the machine learning techniques. We will talk about some of these in Chapter 6, Text Classification, but we have to discuss the basics in order to build a custom POS tagger to fit our needs.
First, we will learn some of the pertained POS taggers available, along with a set of tokens. You can get the POS of individual words as a tuple. We will then move on to the internal workings of some of these taggers, and we will also talk about building a custom tagger from scratch.
When we talk about POS, the most frequent POS notification used is Penn Treebank:
Tag |
Description |
---|---|
|
Proper noun, singular |
|
Proper noun, plural |
|
Pre determiner |
|
Possessive ending |
|
Personal pronoun |
|
Possessive pronoun |
|
Adverb |
|
Adverb, comparative |
|
Adverb, superlative |
|
Particle |
|
Symbol (mathematical or scientific) |
|
to |
|
Interjection |
|
Verb, base form |
|
Verb, past tense |
|
Verb, gerund/present participle |
|
Verb, past |
|
Wh-pronoun |
|
Possessive wh-pronoun |
|
Wh-adverb |
|
Pound sign |
|
Dollar sign |
|
Sentence-final punctuation |
|
Comma |
|
Colon, semi-colon |
|
Left bracket character |
|
Right bracket character |
|
Straight double quote |
|
Left open single quote |
|
Left open double quote |
|
Right close single quote |
|
Right open double quote |
Looks pretty much like what we learned in primary school English class, right? Now once we have an understanding about what these tags mean, we can run an experiment:
>>>import nltk >>>from nltk import word_tokenize >>>s = "I was watching TV" >>>print nltk.pos_tag(word_tokenize(s)) [('I', 'PRP'), ('was', 'VBD'), ('watching', 'VBG'), ('TV', 'NN')]
If you just want to use POS for a corpus like news or something similar, you just need to know the preceding three lines of code. In this code, we are tokenizing a piece of text and using NLTK's pos_tag
method to get a tuple of (word, pos-tag
). This is one of the pre-trained POS taggers that comes with NLTK.
It's internally using the maxent
classifier (will discuss these classifiers in advanced chapters) trained model to predict to which class of tag a particular word belongs.
To get more details you can use the following link:
https://github.com/nltk/nltk/blob/develop/nltk/tag/__init__.py
NLTK has used python's powerful data structures efficiently, so we have a lot more flexibility in terms of use of the results of NLTK outputs.
You must be wondering what could be a typical use of POS in a real application. In a typical preprocessing, we might want to look for all the nouns. Now this code snippet will give us all the nouns in the given sentence:
>>>tagged = nltk.pos_tag(word_tokenize(s)) >>>allnoun = [word for word,pos in tagged if pos in ['NN','NNP'] ]
Try to answer the following questions:
Another awesome feature of NLTK is that it also has many wrappers around other pre-trained taggers, such as Stanford tools. A common example of a POS tagger is shown here:
>>>from nltk.tag.stanford import POSTagger >>>import nltk >>>stan_tagger = POSTagger('models/english-bidirectional-distdim.tagger','standford-postagger.jar') >>>tokens = nltk.word_tokenize(s) >>>stan_tagger.tag(tokens)
To use the above code, you need to download the Stanford tagger from http://nlp.stanford.edu/software/stanford-postagger-full-2014-08-27.zip. Extract both the jar and model into a folder, and give an absolute path in argument for the POSTagger
.
Summarizing this, there are mainly two ways to achieve any tagging task in NLTK:
Let's dig deeper into what goes on inside a typical POS tagger.
A typical tagger uses a lot of trained data, with sentences tagged for each word that will be the POS tag attached to it. Tagging is purely manual and looks like this:
Well/UH what/WP do/VBP you/PRP think/VB about/IN the/DT idea/NN of/IN ,/, uh/UH ,/, kids/NNS having/VBG to/TO do/VB public/JJ service/NN work/NN for/IN a/DT year/NN ?/.Do/VBP you/PRP think/VBP it/PRP 's/BES a/DT ,/,
The preceding sample is taken from the Penn Treebank switchboard corpus. People have done lot of manual work tagging large corpuses. There is a Linguistic Data Consortium (LDC) where people have dedicated so much time to tagging for different languages, different kinds of text and different kinds of tagging like POS, dependency parsing, and discourse (will talk about these later).
You can get all these resources and more information about them at https://www.ldc.upenn.edu/. (LDC provides a fraction of data for free but you can also purchase the entire tagged corpus. NLTK has approximately 10 percent of the PTB.)
If we also want to train our own POS tagger, we have to do the tagging exercise for our specific domain. This kind of tagging will require domain experts.
Typically, tagging problems like POS tagging are seen as sequence labeling problems or a classification problem where people have tried generative and discriminative models to predict the right tag for the given token.
Instead of jumping directly in to more sophisticated examples, let's start with some simple approaches for tagging.
The following snippet gives us the frequency distribution of POS tags in the Brown corpus:
>>>from nltk.corpus import brown >>>import nltk >>>tags = [tag for (word, tag) in brown.tagged_words(categories='news')] >>>print nltk.FreqDist(tags) <FreqDist: 'NN': 13162, 'IN': 10616, 'AT': 8893, 'NP': 6866, ',': 5133, 'NNS': 5066, '.': 4452, 'JJ': 4392 >
We can see NN
comes as the most frequent tag, so let's start building a very naïve POS tagger, by assigning NN
as a tag to all the test words. NLTK has a DefaultTagger
function that can be used for this. DefaultTagger
function is part of the Sequence tagger, which will be discussed next. There is a function called evaluate()
that gives the accuracy of the correctly predicted POS of the words. This is used to benchmark the tagger against the brown corpus. In the default_tagger
case, we are getting approximately 13 percent of the predictions correct. We will use the same benchmark for all the taggers moving forward.
>>>brown_tagged_sents = brown.tagged_sents(categories='news') >>>default_tagger = nltk.DefaultTagger('NN') >>>print default_tagger.evaluate(brown_tagged_sents) 0.130894842572
Not surprisingly, the above tagger performed poorly. The DefaultTagger
is part of a base class SequentialBackoffTagger
that serves tags based on the Sequence. Tagger tries to model the tags based on the context, and if it is not able to predict the tag correctly, it consults a BackoffTagger
. Typically, the DefaultTagger
parameter could be used as a BackoffTagger
.
Let's move on to more sophisticated sequential taggers.
N-gram tagger is a subclass of SequentialTagger
, where the tagger takes previous n words in the context, to predict the POS tag for the given token. There are variations of these taggers where people have tried it with UnigramsTagger
, BigramsTagger
, and TrigramTagger
:
>>>from nltk.tag import UnigramTagger >>>from nltk.tag import DefaultTagger >>>from nltk.tag import BigramTagger >>>from nltk.tag import TrigramTagger # we are dividing the data into a test and train to evaluate our taggers. >>>train_data = brown_tagged_sents[:int(len(brown_tagged_sents) * 0.9)] >>>test_data = brown_tagged_sents[int(len(brown_tagged_sents) * 0.9):] >>>unigram_tagger = UnigramTagger(train_data,backoff=default_tagger) >>>print unigram_tagger.evaluate(test_data) 0.826195866853 >>>bigram_tagger = BigramTagger(train_data, backoff=unigram_tagger) >>>print bigram_tagger.evaluate(test_data) 0.835300351655 >>>trigram_tagger = TrigramTagger(train_data,backoff=bigram_tagger) >>>print trigram_tagger.evaluate(test_data) 0.83327713281
Unigram just considers the conditional frequency of tags and predicts the most frequent tag for the every given token. The bigram_tagger
parameter will consider the tags of the given word and the previous word, and tag as tuple to get the given tag for the test word. The TrigramTagger
parameter looks for the previous two words with a similar process.
It's very evident that coverage of the TrigramTagger
parameter will be less and the accuracy of that instance will be high. On the other hand, UnigramTagger
will have better coverage. To deal with this tradeoff between precision/recall, we combine the three taggers in the preceding snippet. First it will look for the trigram of the given word sequence for predicting the tag; if not found it Backoff
to BigramTagger
parameter and to a UnigramTagger
parameter and in end to a NN
tag.
There is one more class of sequential tagger that is a regular expression based taggers. Here, instead of looking for the exact word, we can define a regular expression, and at the same time we can define the corresponding tag for the given expressions. For example, in the following code we have provided some of the most common regex patterns to get the different parts of speech. We know some of the patterns related to each POS category, for example we know the articles in English and we know that anything that ends with ness will be an adjective. Instead, we will write a bunch of regex and a pure python code, and the NLTK RegexpTagger
parameter will provide an elegant way of building a pattern based POS. This can also be used to induce domain related POS patterns.
>>>from nltk.tag.sequential import RegexpTagger >>>regexp_tagger = RegexpTagger( [( r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers ( r'(The|the|A|a|An|an)$', 'AT'), # articles ( r'.*able$', 'JJ'), # adjectives ( r'.*ness$', 'NN'), # nouns formed from adj ( r'.*ly$', 'RB'), # adverbs ( r'.*s$', 'NNS'), # plural nouns ( r'.*ing$', 'VBG'), # gerunds (r'.*ed$', 'VBD'), # past tense verbs (r'.*', 'NN') # nouns (default) ]) >>>print regexp_tagger.evaluate(test_data) 0.303627342358
We can see that by just using some of the obvious patterns for POS we are able to reach approximately 30 percent in terms of accuracy. If we combine regex taggers, such as the BackoffTagger
, we might improve the performance. The other use case for regex tagger is in the preprocessing step, where instead of using a raw Python function string.sub()
, we can use this tagger to tag date patterns, money patterns, location patterns and so on.
Brill tagger is a transformation based tagger, where the idea is to start with a guess for the given tag and, in next iteration, go back and fix the errors based on the next set of rules the tagger learned. It's also a supervised way of tagging, but unlike N-gram tagging where we count the N-gram patterns in training data, we look for transformation rules.
If the tagger starts with a Unigram
/ Bigram
tagger with an acceptable accuracy, then brill tagger, instead looking for a trigram tuple, will be looking for rules based on tags, position and the word itself.
An example rule could be:
Replace NN
with VB
when the previous word is TO
.
After we already have some tags based on UnigramTagger
, we can refine if with just one simple rule. This is an interactive process. With a few iterations and some more optimized rules, the brill tagger can outperform some of the N-gram taggers. The only piece of advice is to look out for over-fitting of the tagger for the training set.
You can also look at the work here for more example rules.
http://stp.lingfil.uu.se/~bea/publ/megyesi-BrillsPoSTagger.pdf
Until now we have just used some of the pre-trained taggers from NLTK or Stanford. While we have used them in the examples in previous section, the internals of the taggers are still a black box to us. For example, pos_tag
internally uses a Maximum Entropy Classifier (MEC). While StanfordTagger
also uses a modified version of Maximum Entropy. These are discriminatory models. While there are many Hidden Markov Model (HMM) and Conditional Random Field (CRF) based taggers, these are generative models.
Covering all of these topics is beyond the scope of the book. I would highly recommend the NLP class for a great understanding of these concepts. We will cover some of the classification techniques in Chapter 6, Text Classification, but some of these are very advanced topics in NLP, and will need more attention.
If I have to explain in short, the way to categorize POS tagging problem is either as a classification problem where given a word and the features like previous word, context, morphological variation, and so on. We classify the given word into a POS category, while the others try to model it as a generative model using the similar features. It's for the reader's reference to go over some of these topics using links in the tips.