Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Converting HTML entities with BeautifulSoup

HTML entities are strings such as "&" or "<". These are encodings of normal ASCII characters that have special uses in HTML. For example, "<" is the entity for "<", but you can't just have "<" within HTML tags because it is the beginning character for an HTML tag, hence the need to escape it and define the "<" entity. "&" is the entity code for "&", which as we've just seen is the beginning character for an entity code. If you need to process the text within an HTML document, then you'll want to convert these entities back to their normal characters so you can recognize them and handle them appropriately.

Getting ready

You'll need to install BeautifulSoup, which you should be able to do with sudo pip install beautifulsoup4 or sudo easy_install beautifulsoup4. You can read more about BeautifulSoup at http://www.crummy.com/software/BeautifulSoup/.

How to do it...

BeautifulSoup is an HTML parser library that can also be used for entity conversion. It's quite simple: create an instance of BeautifulSoup given a string containing HTML entities, then get the string attribute:

>>> from bs4 import BeautifulSoup
>>> BeautifulSoup('&lt;').string
'<'
>>> BeautifulSoup('&amp;').string
'&'

However, the reverse is not true. If you try to do BeautifulSoup('<'), you will get a None result because that is not valid in HTML.

How it works...

To convert the HTML entities, BeautifulSoup looks for tokens that look like an entity and replaces them with their corresponding value in the htmlentitydefs.name2codepoint dictionary from the Python standard library. It can do this if the entity token is within an HTML tag, or when it's in a normal string.

There's more...

BeautifulSoup is an excellent HTML and XML parser in its own right, and can be a great alternative to lxml. It's particularly good at handling malformed HTML. You can read more about how to use it at http://www.crummy.com/software/BeautifulSoup/bs4/doc/.

Extracting URLs with BeautifulSoup

Here's an example of using BeautifulSoup to extract URLs, like we did in the Extracting URLs from HTML with lxml recipe. You first create the soup with an HTML string, call the findAll() method with 'a' to get all anchor tags, and pull out the 'href' attribute to get the URLs:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('Hello <a href="/world">world</a>')
>>> [a['href'] for a in soup.findAll('a')]
['/world']

Table of Contents for
Converting HTML entities with BeautifulSoup

Converting HTML entities with BeautifulSoup

Getting ready

How to do it...

How it works...

There's more...

Extracting URLs with BeautifulSoup

See also

Table of Contents for Converting HTML entities with BeautifulSoup

Create new playlist

Sign In

Sign Up

Converting HTML entities with BeautifulSoup

Getting ready

How to do it...

How it works...

There's more...

Extracting URLs with BeautifulSoup

See also

Table of Contents for
Converting HTML entities with BeautifulSoup