Chapter 9. Parsing Specific Data Types

In this chapter, we will cover the following recipes:

  • Parsing dates and times with dateutil
  • Timezone lookup and conversion
  • Extracting URLs from HTML with lxml
  • Cleaning and stripping HTML
  • Converting HTML entities with BeautifulSoup
  • Detecting and converting character encodings

Introduction

This chapter covers parsing specific kinds of data, focusing primarily on dates, times, and HTML. Luckily, there are a number of useful libraries to accomplish this, so we don't have to delve into tricky and overly complicated regular expressions. These libraries can be great complements to NLTK:

  • dateutil provides datetime parsing and timezone conversion
  • lxml and BeautifulSoup can parse, clean, and convert HTML
  • charade and UnicodeDammit can detect and convert text character encoding

These libraries can be useful for preprocessing text before passing it to an NLTK object, or postprocessing text that has been processed and extracted using NLTK. Coming up is an example that ties many of these tools together.

Let's say you need to parse a blog article about a restaurant. You can use lxml or BeautifulSoup to extract the article text, outbound links, and the date and time when the article was written. The date and time can then be parsed to a Python datetime object with dateutil. Once you have the article text, you can use charade to ensure it's utf-8 before cleaning out the HTML and running it through NLTK-based part-of-speech tagging, chunk extraction, and/or text classification to create additional metadata about the article. Real-world text processing often requires more than just NLTK-based natural language processing, and the functionality covered in this chapter can help with those additional requirements.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset