Tokenisation

Tokenisation separates a corpus into sentences or words or tokens. Tokenization is needed to make our texts ready for further processing and is the first step in creating an NLP Pipeline. A token can vary according to the task we are performing or the domain in which we are working, so keep an open mind as to what you consider as a token! 

Know the Code: NLTK is powerful as much of the hard coding work is already done in the library. You can read more about NLTK Tokenisation here http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.api.TokenizerI.tokenize_sents.

Let's try to load a corpus and use NLTK tokenizer to first tokenize the raw corpus into sentences and then each sentence further into words:

text = u"""
Dealing with textual data is very crucial so to handle these text data we need some
basic text processing steps. Most of the processing steps covered in this section are
commonly used in NLP and involve the combination of several steps into a single
executable flow. This is usually referred to as the NLP pipeline. These flow
can be a combination of tokenization, stemming, word frequency, parts of
speech tagging, etc.
"""

# Sentence Tokenization
sentenses = nltk.sent_tokenize(text)

# Word Tokenization
words = [nltk.word_tokenize(s) for s in sentenses]

OUTPUT:
SENTENCES:
[u' Dealing with textual data is very crucial so to handle these text data we need some basic text processing steps.',
u'Most of the processing steps covered in this section are commonly used in NLP and involve the combination of several steps into a single executable flow.',
u'This is usually referred to as the NLP pipeline.',
u'These flow can be a combination of tokenization, stemming, word frequency, parts of speech tagging, etc.']

WORDS:
[[u'Dealing', u'with', u'textual', u'data', u'is', u'very', u'crucial', u'so', u'to', u'handle', u'these', u'text', u'data', u'we', u'need', u'some', u'basic', u'text', u'processing', u'steps', u'.'], [u'Most', u'of', u'the', u'processing', u'steps', u'covered', u'in', u'this', u'section', u'are', u'commonly', u'used', u'in', u'NLP', u'and', u'involve', u'the', u'combination', u'of', u'several', u'steps', u'into', u'a', u'single', u'executable', u'flow', u'.'], [u'This', u'is', u'usually', u'referred', u'to', u'as', u'the', u'NLP', u'pipeline', u'.'], [u'These', u'flow', u'can', u'be', u'a', u'combination', u'of', u'tokenization', u',', u'stemming', u',', u'word', u'frequency', u',', u'parts', u'of', u'speech', u'tagging', u',', u'etc', u'.']]
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset