Training a POS tagger

We will now look at training our own POS tagger, using NLTK's tagged set corpora and the sklearn random forest machine learning (ML) model. The complete Jupyter Notebook for this section is available at Chapter02/02_example.ipynb, in the book's code repository. This will be a classification task, as we need to predict the POS tag for a given word in a sentence. We will utilize the NLTK treebank dataset, with POS tags, as the training or labeled data. We will extract the word prefixes and suffixes, and previous and neighboring words in the text, as features for the training. These features are good indicators for categorizing words to different parts of speech. The code that follows shows how we can extract these features:

def sentence_features(st, ix):
    d_ft = {}
    d_ft['word'] = st[ix]
    d_ft['dist_from_first'] = ix - 0
    d_ft['dist_from_last'] = len(st) - ix
    d_ft['capitalized'] = st[ix][0].upper() == st[ix][0]
    d_ft['prefix1'] = st[ix][0]
    d_ft['prefix2'] = st[ix][:2]
    d_ft['prefix3'] = st[ix][:3]
    d_ft['suffix1'] = st[ix][-1]
    d_ft['suffix2'] = st[ix][-2:]
    d_ft['suffix3'] = st[ix][-3:]
    d_ft['prev_word'] = '' if ix==0 else st[ix-1]
    d_ft['next_word'] = '' if ix==(len(st)-1) else st[ix+1]
    d_ft['numeric'] = st[ix].isdigit()
    return d_ft

The function sentence_features() converts the text input into a dictionary of features, d_ft. Each sentence, which is a Python list, is passed in along with the corresponding index of the current word, for which the feature is to be extracted. This index, ix, is used to obtain the neighboring word features, as well as the prefixes/suffixes. Later in the example, we will look at the importance of these features after training. We will now use the treebank tagged sentences, with the universal tags that we explained in the previous section, as the labeled or training data:

tagged_sentences = nltk.corpus.treebank.tagged_sents(tagset='universal')

We used the universal tags for simplicity, as specified in the tagset named argument passed to the tagged_sents function. Instead of the universal tags, we could also utilize the fine-grained treebank POS tags, which would result in a large number of labels. We will now extract the features for each tagged sentence in the corpus, along with the training labels. The features are stored in the X variable, and the POS tags, or labels, are stored in the y variable. We will use the following code to extract the features:

def ext_ft(tg_sent):
    sent, tag = [], []
 
    for tg in tg_sent:
        for index in range(len(tg)):
            sent.append(sentence_features(get_untagged_sentence(tg), index))
            tag.append(tg[index][1])
 
    return sent, tag

X,y = ext_ft(tagged_sentences)

In sklearn, we utilize DictVectorizer to convert the feature-value dictionary to training vectors or instances. It should be noted that, for values that are strings, DictVectorizer transforms them to a one-hot encoded vector. For example, if the number of possible values for the suffix3 feature is 50, then there will be 50 features in the output. We will use the following code to apply DictVectorizer:

n_sample = 50000
dict_vectorizer = DictVectorizer(sparse=False)
X_transformed = dict_vectorizer.fit_transform(X[0:n_sample])
y_sampled = y[0:n_sample]

A sample size of around 50,000 sentences was utilized, to speed up the training. The training instances are further split into 80% training and 20% test set (refer to the Notebook). An ensemble classifier, using RandomForestClassifier from sklearn, is utilized as the POS tagger model, as shown in the following code:

rf = RandomForestClassifier(n_jobs=4)
rf.fit(X_train,y_train)

After training, we can verify the POS tagger with an example sentence. Before passing it to the predict() function, we will extract the features by using the same function (sentence_features()) that we used for the NLTK labeled data, as shown in the following code:

def predict_pos_tags(sentence):
 tagged_sentence = []
 features = [sentence_features(sentence, index) for index in range(len(sentence))]
 features = dict_vectorizer.transform(features)
 tags = rf.predict(features)
 return zip(sentence, tags)

We converted the sentence variable, which is a list of words into its corresponding features, using the sentence_features() function. The dictionary of features extracted from this function is vectorized using the previously trained dict_vectorizer:

test_sentence = "This is a simple POS tagger"
for tagged in predict_pos_tags(test_sentence.split()):
  print(tagged)

We pass the test sentence, as a list of words, to the predict_pos_tags function. This will output the tags for each of the words in the sentence. The output that follows shows the POS tags of the sample sentence:

('This', 'DET')
('is', 'VERB')
('a', 'DET')
('simple', 'ADJ')
('POS', 'NOUN')
('tagger', 'NOUN')

The output looks reasonable, as it can identify determiners, verbs, adjectives, and nouns in the sentence. To evaluate the accuracy, we can predict the POS tags for the test data, using the following code:

predictions = rf.predict(X_test)
accuracy_score(y_test,predictions)

Output
0.94520000000000004

The accuracy (of around 94%) looks reasonable. We will also look at the confusion matrix, to observe how well the tagger performs for each of the POS tags. We will utilize the confusion_matrix function from sklearn, as shown in the following code:

conf_matrix = confusion_matrix(y_test,predictions)
plt.figure(figsize=(10,10))
plt.xticks(np.arange(len(rf.classes_)),rf.classes_)
plt.yticks(np.arange(len(rf.classes_)),rf.classes_)
plt.imshow(conf_matrix,cmap=plt.cm.Blues)
plt.colorbar()

In the code for plotting the confusion matrix, we have used the classes from the random forest classifier as the x and y labels. These labels are the POS tags in the data that we utilized for training. The plot that follows shows the pictorial representation of the confusion matrix:

It looks like the tagger performs relatively well for nouns, verbs, and determiners in sentences, which can be seen in the dark regions in the plot. We will now look at the top features of the model, with the help of the following code:

feature_list = zip(dict_vectorizer.get_feature_names(),rf.feature_importances_)
sorted_features = sorted(feature_list,key=lambda x: x[1], reverse=True)
print(sorted_features[0:20])

The random forest feature importance is stored in the Python feature_importances list. We will sort this list in descending order of feature importance, and will print the top 20 features, using the following code:

Output:

[('next_word=', 0.020920214730751722), ('capitalized', 0.01772036411509819), ('prefix1=,', 0.017100349286406635), ('suffix1=,', 0.013300188138108692), ('suffix2=ed', 0.012324641839199037), ('prefix1=*', 0.01184006667636649), ('suffix2=he', 0.010212280707210959), ('prefix2=th', 0.01012750927310713), ('suffix2=to', 0.010110760622078928), ('prefix3=the', 0.0094462675592230805), ('dist_from_first', 0.0093968467476374141), ('suffix1=f', 0.0092678798994399649), ('word=the', 0.0091584437614083847), ('dist_from_last', 0.0087969654754903419), ('prefix2=to', 0.0086095477647111125), ('suffix1=d', 0.0082316431932524976), ('word=a', 0.0077318882551946199), ('prefix2=an', 0.0074132280379715434), ('suffix1=s', 0.0067561700034315057), ('word=and', 0.0065749584774608179)]

You can see that some of the suffix features get higher importance scores. For example, words ending with ed are usually verbs in the past tense. We also find that some punctuation, such as commas, influence the tagging. Though POS tagging is also a type of text classification, we will look at the next most common NLP task, which is sentiment classification.

Table of Contents for Training a POS tagger

Create new playlist

Sign In

Sign Up

Table of Contents for
Training a POS tagger