Tokenisation

Tokenisation separates a corpus into sentences or words or tokens. Tokenization is needed to make our texts ready for further processing and is the first step in creating an NLP Pipeline. A token can vary according to the task we are performing or the domain in which we are working, so keep an open mind as to what you consider as a token!

Know the Code: NLTK is powerful as much of the hard coding work is already done in the library. You can read more about NLTK Tokenisation here http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.api.TokenizerI.tokenize_sents.

Let's try to load a corpus and use NLTK tokenizer to first tokenize the raw corpus into sentences and then each sentence further into words:

text = u"""
Dealing with textual data is very crucial so to handle these text data we need some 
basic text processing steps. Most of the processing steps covered in this section are 
commonly used in NLP and involve the combination of several steps into a single 
executable flow. This is usually referred to as the NLP pipeline. These flow 
can be a combination of tokenization, stemming, word frequency, parts of 
speech tagging, etc.
"""

# Sentence Tokenization
sentenses = nltk.sent_tokenize(text)

# Word Tokenization
words = [nltk.word_tokenize(s) for s in sentenses]

OUTPUT:
SENTENCES:
[u'
Dealing with textual data is very crucial so to handle these text data we need some 
basic text processing steps.', 
u'Most of the processing steps covered in this section are 
commonly used in NLP and involve the combination of several steps into a single 
executable flow.', 
u'This is usually referred to as the NLP pipeline.', 
u'These flow 
can be a combination of tokenization, stemming, word frequency, parts of 
speech tagging, etc.']

WORDS:
[[u'Dealing', u'with', u'textual', u'data', u'is', u'very', u'crucial', u'so', u'to', u'handle', u'these', u'text', u'data', u'we', u'need', u'some', u'basic', u'text', u'processing', u'steps', u'.'], [u'Most', u'of', u'the', u'processing', u'steps', u'covered', u'in', u'this', u'section', u'are', u'commonly', u'used', u'in', u'NLP', u'and', u'involve', u'the', u'combination', u'of', u'several', u'steps', u'into', u'a', u'single', u'executable', u'flow', u'.'], [u'This', u'is', u'usually', u'referred', u'to', u'as', u'the', u'NLP', u'pipeline', u'.'], [u'These', u'flow', u'can', u'be', u'a', u'combination', u'of', u'tokenization', u',', u'stemming', u',', u'word', u'frequency', u',', u'parts', u'of', u'speech', u'tagging', u',', u'etc', u'.']]

Table of Contents for Tokenisation

Create new playlist

Sign In

Sign Up

Table of Contents for
Tokenisation