Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Creating POS-tagged corpora

A corpus may be known as a collection of documents. A corpora is the collection of multiple corpus.

Let's see the following code, which will generate a data directory inside the home directory:

>>> import nltk
>>> import os,os.path
>>> create = os.path.expanduser('~/nltkdoc')
>>> if not os.path.exists(create):
  os.mkdir(create)


>>> os.path.exists(create)
True
>>> import nltk.data
>>> create in nltk.data.path
True

This code will create a data directory named ~/nltkdoc inside the home directory. The last line of this code will return True and will ensure that the data directory has been created. If the last line of the code returns False, then it means that the data directory has not been created and we need to create it manually. After creating the data directory manually, we can test the last line and it will then return True. Within this directory, we can create another directory named nltkcorpora that will hold the whole corpus. The path will be ~/nltkdoc/nltkcorpora. Also, we can create a subdirectory named important that will hold all the necessary files.

The path will be ~/nltkdoc/nltkcorpora/important.

Let's see the following code to load a text file into the subdirectory:

>>> import nltk.data
>>> nltk.data.load('nltkcorpora/important/firstdoc.txt',format='raw')
'nltk
'

Here, in the previous code, we have mentioned format='raw', since nltk.data.load() cannot interpret .txt files.

There is a word list corpus in NLTK known as the Names corpus. It consists of two files, namely, male.txt and female.txt.

Let's see the code to generate the length of male.txt and female.txt:

>>> import nltk
>>> from nltk.corpus import names
>>> names.fileids()
['female.txt', 'male.txt']
>>> len(names.words('male.txt'))
2943
>>> len(names.words('female.txt'))
5001

NLTK also consists of a large collection of English words. Let's see the code that describes the number of words present in the English word file:

>>> import nltk
>>> from nltk.corpus import words
>>> words.fileids()
['en', 'en-basic']
>>> len(words.words('en'))
235886
>>> len(words.words('en-basic'))
850

Consider the following code used in NLTK for defining the Maxent Treebank POS tagger:

def pos_tag(tok):
    """

We can use POS tagger given by NLTK to tag a list of tokens:

>>> from nltk.tag import pos_tag 
>>> from nltk.tokenize import word_tokenize 
>>> pos_tag(word_tokenize("Papa's favourite hobby is reading.")) 
        [('Papa', 'NNP'), ("'s", 'POS'), ('favourite', 'JJ'), ('hobby', 'NN'), ('is',
        'VBZ'), ('reading', 'VB'),  ('.', '.')]

    :param tokens: list of tokens that need to be tagged
    :type tok: list(str)
    :return: The tagged tokens
    :rtype: list(tuple(str, str))
    """
    tagger = load(_POS_TAGGER)
    return tagger.tag(tok)

def batch_pos_tag(sent):
    """
    We can use part of speech tagger given by NLTK to perform tagging of list of tokens.
    """
    tagger = load(_POS_TAGGER)
    return tagger.batch_tag(sent)

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Creating POS-tagged corpora

Create new playlist

Sign In

Sign Up

Creating POS-tagged corpora

Table of Contents for
Creating POS-tagged corpora