POS is concerned with identifying the types of components found in a sentence. For example, this sentence has several elements, including the verb "has", several nouns such as "example" and "elements", and adjectives such as "several". Tagging, or more specifically POS tagging, is the process of associating element types to words.
POS tagging is useful as it adds more information about the sentence. We can ascertain the relationship between words and often their relative importance. The results of tagging are often used in later processing steps.
This task can be difficult as we are unable to rely upon a simple dictionary of words to determine their type. For example, the word lead
can be used as both a noun and as a verb. We might use it in either of the following two sentences:
He took the lead in the play. Lead the way!
POS tagging will attempt to associate the proper label to each word of a sentence.
To illustrate this process, we will be using OpenNLP (https://opennlp.apache.org/). This is an open source Apache project which supports many other NLP processing tasks.
We will be using the POSModel
class, which can be trained to recognize POS elements. In this example, we will use it with a previously trained model based on the Penn TreeBank
tag-set (http://www.comp.leeds.ac.uk/ccalas/tagsets/upenn.html). Various pretrained models are found at http://opennlp.sourceforge.net/models-1.5/. We will be using the en-pos-maxent.bin
model. This has been trained on English text using what is called maximum entropy.
Maximum entropy refers to the amount of uncertainty in the model which it maximizes. For a given problem there is a set of probabilities describing what is known about the data set. These probabilities are used to build a model. For example, we may know that there is a 23 percent chance that one specific event may follow a certain condition. We do not want to make any assumptions about unknown probabilities so we avoid adding unjustified information. A maximum entropy approach attempts to preserve as much uncertainty as possible; hence it attempts to maximize entropy.
We will also use the POSTaggerME
class, which is a maximum entropy tagger. This is the class that will make tag predictions. With any sentence, there may be more than one way of classifying, or tagging, its components.
We start with code to acquire the previously trained English tagger model and a simple sentence to be tagged:
try (InputStream input = new FileInputStream( new File("en-pos-maxent.bin"));) { String sentence = "Let's parse this sentence."; ... } catch (IOException ex) { // Handle exceptions }
The tagger uses an array of strings, where each string is a word. The following sequence takes the previous sentence and creates an array called words
. The first part uses the Scanner
class to parse the sentence string. We could have used other code to read the data from a file if needed. After that, the List
class's toArray
method is used to create the array of strings:
List<String> list = new ArrayList<>(); Scanner scanner = new Scanner(sentence); while(scanner.hasNext()) { list.add(scanner.next()); } String[] words = new String[1]; words = list.toArray(words);
The model is then built using the file containing the model:
POSModel posModel = new POSModel(input);
The tagger is then created based on the model:
POSTaggerME posTagger = new POSTaggerME(posModel);
The tag
method does the actual work. It is passed an array of words and returns an array of tags. The words and tags are then displayed:
String[] posTags = posTagger.tag(words); for(int i=0; i<posTags.length; i++) { out.println(words[i] + " - " + posTags[i]); }
The output for this example follows:
Let's - NNP parse - NN this - DT sentence. - NN
The analysis has determined that the word let's
is a singular proper noun while the words parse
and sentence
are singular nouns. The word this
is a determiner, that is, it is a word that modifies another and helps identify a phrase as general or specific. A list of tags is provided in the next section.
The POS elements returned abbreviations. A list of Penn TreeBankPOS tags can be found at https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html. The following is a shortened version of this list:
Tag |
Description |
Tag |
Description |
|
Determiner |
|
Adverb |
|
Adjective |
|
Adverb, comparative |
|
Adjective, comparative |
|
Adverb, superlative |
|
Adjective, superlative |
|
Particle |
|
Noun, singular or mass |
|
Symbol |
|
Noun, plural |
|
Top of the parse tree |
|
Proper noun, singular |
|
Verb, base form |
|
Proper noun, plural |
|
Verb, past tense |
|
Possessive ending |
|
Verb, gerund or present participle |
|
Personal pronoun |
|
Verb, past participle |
|
Possessive pronoun |
|
Verb, non-3rd person singular present |
|
Simple declarative clause |
|
Verb, 3rd person singular present |
As mentioned earlier, there may be more than one possible set of POS assignments for a sentence. The topKSequences
method, as shown next, will return various assignment possibilities along with a score. The method returns an array of Sequence
objects whose toString
method returns the score and POS list:
Sequence sequences[] = posTagger.topKSequences(words); for(Sequence sequence : sequences) { out.println(sequence); }
The output for the previous sentence follows, where the last sequence is considered to be the most probable alternative:
-2.3264880694837213 [NNP, NN, DT, NN] -2.6610271245387853 [NNP, VBD, DT, NN] -2.6630142638557217 [NNP, VB, DT, NN]
Each line of output assigns possible tags to each word of the sentence. We can see that only the second word, parse
, is determined to have other possible tags.
Next, we will demonstrate how to extract relationships from text.