Cleaning and stripping HTML

Cleaning up text is one of the unfortunate but entirely necessary aspects of text processing. When it comes to parsing HTML, you probably don't want to deal with any embedded JavaScript or CSS, and are only interested in the tags and text.

Getting ready

You'll need to install lxml. See the previous recipe or http://lxml.de/installation.html for installation instructions.

How to do it...

We can use the clean_html() function in the lxml.html.clean module to remove unnecessary HTML tags and embedded JavaScript from an HTML string:

>>> import lxml.html.clean
>>> lxml.html.clean.clean_html('<html><head></head><body onload=loadfunc()>my text</body></html>')
'<div><body>my text</body></div>'

The result is much cleaner and easier to deal with.

How it works...

The lxml.html.clean_html() function parses the HTML string into a tree and then iterates over and removes all nodes that should be removed. It also cleans nodes of unnecessary attributes (such as embedded JavaScript) using regular expression matching and substitution.

There's more...

The lxml.html.clean module defines a default Cleaner class that's used when you call clean_html(). You can customize the behavior of this class by creating your own instance and calling its clean_html() method. For more details on this class, see http://lxml.de/lxmlhtml.html#cleaning-up-html.

See also

The lxml.html module was introduced in the previous recipe for parsing HTML and extracting links. In the next recipe, we'll cover unescaping HTML entities.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset