Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Understanding tagging and POS

POS is concerned with identifying the types of components found in a sentence. For example, this sentence has several elements, including the verb "has", several nouns such as "example" and "elements", and adjectives such as "several". Tagging, or more specifically POS tagging, is the process of associating element types to words.

POS tagging is useful as it adds more information about the sentence. We can ascertain the relationship between words and often their relative importance. The results of tagging are often used in later processing steps.

This task can be difficult as we are unable to rely upon a simple dictionary of words to determine their type. For example, the word lead can be used as both a noun and as a verb. We might use it in either of the following two sentences:

He took the lead in the play.
Lead the way!

POS tagging will attempt to associate the proper label to each word of a sentence.

Using OpenNLP to identify POS

To illustrate this process, we will be using OpenNLP (https://opennlp.apache.org/). This is an open source Apache project which supports many other NLP processing tasks.

We will be using the POSModel class, which can be trained to recognize POS elements. In this example, we will use it with a previously trained model based on the Penn TreeBank tag-set (http://www.comp.leeds.ac.uk/ccalas/tagsets/upenn.html). Various pretrained models are found at http://opennlp.sourceforge.net/models-1.5/. We will be using the en-pos-maxent.bin model. This has been trained on English text using what is called maximum entropy.

Maximum entropy refers to the amount of uncertainty in the model which it maximizes. For a given problem there is a set of probabilities describing what is known about the data set. These probabilities are used to build a model. For example, we may know that there is a 23 percent chance that one specific event may follow a certain condition. We do not want to make any assumptions about unknown probabilities so we avoid adding unjustified information. A maximum entropy approach attempts to preserve as much uncertainty as possible; hence it attempts to maximize entropy.

We will also use the POSTaggerME class, which is a maximum entropy tagger. This is the class that will make tag predictions. With any sentence, there may be more than one way of classifying, or tagging, its components.

We start with code to acquire the previously trained English tagger model and a simple sentence to be tagged:

try (InputStream input = new FileInputStream( 
        new File("en-pos-maxent.bin"));) { 
    String sentence = "Let's parse this sentence."; 
    ... 
} catch (IOException ex) { 
    // Handle exceptions 
}

The tagger uses an array of strings, where each string is a word. The following sequence takes the previous sentence and creates an array called words. The first part uses the Scanner class to parse the sentence string. We could have used other code to read the data from a file if needed. After that, the List class's toArray method is used to create the array of strings:

List<String> list = new ArrayList<>(); 
Scanner scanner = new Scanner(sentence); 
while(scanner.hasNext()) { 
    list.add(scanner.next()); 
} 
String[] words = new String[1]; 
words = list.toArray(words);

The model is then built using the file containing the model:

POSModel posModel = new POSModel(input);

The tagger is then created based on the model:

POSTaggerME posTagger = new POSTaggerME(posModel);

The tag method does the actual work. It is passed an array of words and returns an array of tags. The words and tags are then displayed:

String[] posTags = posTagger.tag(words); 
for(int i=0; i<posTags.length; i++) { 
    out.println(words[i] + " - " + posTags[i]); 
}

The output for this example follows:

Let's - NNP
parse - NN
this - DT
sentence. - NN

The analysis has determined that the word let's is a singular proper noun while the words parse and sentence are singular nouns. The word this is a determiner, that is, it is a word that modifies another and helps identify a phrase as general or specific. A list of tags is provided in the next section.

Understanding POS tags

The POS elements returned abbreviations. A list of Penn TreeBankPOS tags can be found at https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html. The following is a shortened version of this list:

Tag	Description	Tag	Description
`DT`	Determiner	`RB`	Adverb
`JJ`	Adjective	`RBR`	Adverb, comparative
`JJR`	Adjective, comparative	`RBS`	Adverb, superlative
`JJS`	Adjective, superlative	`RP`	Particle
`NN`	Noun, singular or mass	`SYM`	Symbol
`NNS`	Noun, plural	`TOP`	Top of the parse tree
`NNP`	Proper noun, singular	`VB`	Verb, base form
`NNPS`	Proper noun, plural	`VBD`	Verb, past tense
`POS`	Possessive ending	`VBG`	Verb, gerund or present participle
`PRP`	Personal pronoun	`VBN`	Verb, past participle
`PRP$`	Possessive pronoun	`VBP`	Verb, non-3rd person singular present
`S`	Simple declarative clause	`VBZ`	Verb, 3rd person singular present

As mentioned earlier, there may be more than one possible set of POS assignments for a sentence. The topKSequences method, as shown next, will return various assignment possibilities along with a score. The method returns an array of Sequence objects whose toString method returns the score and POS list:

    Sequence sequences[] = posTagger.topKSequences(words); 
    for(Sequence sequence : sequences) { 
        out.println(sequence); 
    }

The output for the previous sentence follows, where the last sequence is considered to be the most probable alternative:

-2.3264880694837213 [NNP, NN, DT, NN]
-2.6610271245387853 [NNP, VBD, DT, NN]
-2.6630142638557217 [NNP, VB, DT, NN]

Each line of output assigns possible tags to each word of the sentence. We can see that only the second word, parse, is determined to have other possible tags.

Next, we will demonstrate how to extract relationships from text.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Understanding tagging and POS

Create new playlist

Sign In

Sign Up

Understanding tagging and POS

Using OpenNLP to identify POS

Understanding POS tags

Table of Contents for
Understanding tagging and POS