POS tagging is also referred to as word category disambiguation or grammatical tagging. POS tagging may be of two types: rule-based or stochastic/probabilistic. E. Brill's tagger is based on the rule-based tagging algorithm.
A POS classifier takes a document as input and obtains word features. It trains itself with the help of these word features combined with the already available training labels. This type of classifier is referred to as a second order classifier, and it makes use of the bootstrap classifier in order to generate the tags for words.
A backoff
classifier is one in which backoff procedure is performed. The output is obtained in such a manner that the trigram POS tagger relies on the bigram POS tagger, which in turn relies on the unigram POS tagger.
While training a POS classifier, a feature set is generated. This feature set may comprise the following:
In NLTK, FastBrillTagger
is based on unigram. It makes use of a dictionary of words that are already known and the pos tag information.
Let's see the code for FastBrillTagger
used in NLTK:
from nltk.tag import UnigramTagger from nltk.tag import FastBrillTaggerTrainer from nltk.tag.brill import SymmetricProximateTokensTemplate from nltk.tag.brill import ProximateTokensTemplate from nltk.tag.brill import ProximateTagsRule from nltk.tag.brill import ProximateWordsRule ctx = [ # Context = surrounding words and tags. SymmetricProximateTokensTemplate(ProximateTagsRule, (1, 1)), SymmetricProximateTokensTemplate(ProximateTagsRule, (1, 2)), SymmetricProximateTokensTemplate(ProximateTagsRule, (1, 3)), SymmetricProximateTokensTemplate(ProximateTagsRule, (2, 2)), SymmetricProximateTokensTemplate(ProximateWordsRule, (0, 0)), SymmetricProximateTokensTemplate(ProximateWordsRule, (1, 1)), SymmetricProximateTokensTemplate(ProximateWordsRule, (1, 2)), ProximateTokensTemplate(ProximateTagsRule, (-1, -1), (1, 1)), ] tagger = UnigramTagger(sentences) tagger = FastBrillTaggerTrainer(tagger, ctx, trace=0) tagger = tagger.train(sentences, max_rules=100)
Classification may be defined as the process of deciding a POS tag for a given input.
In supervised classification, a training corpus is used that comprises a word and its correct tag. In unsupervised classification, any pair of words and a correct tag list does not exist:
In supervised classification, during training, a feature extractor accepts the input and labels and generates a set of features. These features set along with the label act as input to machine learning algorithms. During the testing or prediction phase, a feature extractor is used that generates features from unknown inputs, and the output is sent to a classifier model that generates an output in the form of label or pos tag information with the help of machine learning algorithms.
The maximum entropy classifier is one in that searches the parameter set in order to maximize the total likelihood of the corpus used for training.
It may be defined as follows: