Chapter 1. Tokenizing Text and WordNet Basics

In this chapter, we will cover the following recipes:

  • Tokenizing text into sentences
  • Tokenizing sentences into words
  • Tokenizing sentences using regular expressions
  • Training a sentence tokenizer
  • Filtering stopwords in a tokenized sentence
  • Looking up Synsets for a word in WordNet
  • Looking up lemmas and synonyms in WordNet
  • Calculating WordNet Synset similarity
  • Discovering word collocations


Natural Language ToolKit (NLTK) is a comprehensive Python library for natural language processing and text analytics. Originally designed for teaching, it has been adopted in the industry for research and development due to its usefulness and breadth of coverage. NLTK is often used for rapid prototyping of text processing programs and can even be used in production applications. Demos of select NLTK functionality and production-ready APIs are available at

This chapter will cover the basics of tokenizing text and using WordNet. Tokenization is a method of breaking up a piece of text into many pieces, such as sentences and words, and is an essential first step for recipes in the later chapters. WordNet is a dictionary designed for programmatic access by natural language processing systems. It has many different use cases, including:

  • Looking up the definition of a word
  • Finding synonyms and antonyms
  • Exploring word relations and similarity
  • Word sense disambiguation for words that have multiple uses and definitions

NLTK includes a WordNet corpus reader, which we will use to access and explore WordNet. A corpus is just a body of text, and corpus readers are designed to make accessing a corpus much easier than direct file access. We'll be using WordNet again in the later chapters, so it's important to familiarize yourself with the basics first.

