Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Cleaning and stripping HTML

Cleaning up text is one of the unfortunate but entirely necessary aspects of text processing. When it comes to parsing HTML, you probably don't want to deal with any embedded JavaScript or CSS, and are only interested in the tags and text.

Getting ready

You'll need to install lxml. See the previous recipe or http://lxml.de/installation.html for installation instructions.

How to do it...

We can use the clean_html() function in the lxml.html.clean module to remove unnecessary HTML tags and embedded JavaScript from an HTML string:

>>> import lxml.html.clean
>>> lxml.html.clean.clean_html('<html><head></head><body onload=loadfunc()>my text</body></html>')
'<div><body>my text</body></div>'

The result is much cleaner and easier to deal with.

How it works...

The lxml.html.clean_html() function parses the HTML string into a tree and then iterates over and removes all nodes that should be removed. It also cleans nodes of unnecessary attributes (such as embedded JavaScript) using regular expression matching and substitution.

There's more...

The lxml.html.clean module defines a default Cleaner class that's used when you call clean_html(). You can customize the behavior of this class by creating your own instance and calling its clean_html() method. For more details on this class, see http://lxml.de/lxmlhtml.html#cleaning-up-html.

Table of Contents for
Cleaning and stripping HTML

Cleaning and stripping HTML

Getting ready

How to do it...

How it works...

There's more...

See also

Table of Contents for Cleaning and stripping HTML

Create new playlist

Sign In

Sign Up

Cleaning and stripping HTML

Getting ready

How to do it...

How it works...

There's more...

See also

Table of Contents for
Cleaning and stripping HTML