Converting HTML entities with BeautifulSoup

HTML entities are strings such as "&amp;" or "&lt;". These are encodings of normal ASCII characters that have special uses in HTML. For example, "&lt;" is the entity for "<", but you can't just have "<" within HTML tags because it is the beginning character for an HTML tag, hence the need to escape it and define the "&lt;" entity. "&amp;" is the entity code for "&", which as we've just seen is the beginning character for an entity code. If you need to process the text within an HTML document, then you'll want to convert these entities back to their normal characters so you can recognize them and handle them appropriately.

Getting ready

You'll need to install BeautifulSoup, which you should be able to do with sudo pip install beautifulsoup4 or sudo easy_install beautifulsoup4. You can read more about BeautifulSoup at http://www.crummy.com/software/BeautifulSoup/.

How to do it...

BeautifulSoup is an HTML parser library that can also be used for entity conversion. It's quite simple: create an instance of BeautifulSoup given a string containing HTML entities, then get the string attribute:

>>> from bs4 import BeautifulSoup
>>> BeautifulSoup('&lt;').string
'<'
>>> BeautifulSoup('&amp;').string
'&'

However, the reverse is not true. If you try to do BeautifulSoup('<'), you will get a None result because that is not valid in HTML.

How it works...

To convert the HTML entities, BeautifulSoup looks for tokens that look like an entity and replaces them with their corresponding value in the htmlentitydefs.name2codepoint dictionary from the Python standard library. It can do this if the entity token is within an HTML tag, or when it's in a normal string.

There's more...

BeautifulSoup is an excellent HTML and XML parser in its own right, and can be a great alternative to lxml. It's particularly good at handling malformed HTML. You can read more about how to use it at http://www.crummy.com/software/BeautifulSoup/bs4/doc/.

Extracting URLs with BeautifulSoup

Here's an example of using BeautifulSoup to extract URLs, like we did in the Extracting URLs from HTML with lxml recipe. You first create the soup with an HTML string, call the findAll() method with 'a' to get all anchor tags, and pull out the 'href' attribute to get the URLs:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('Hello <a href="/world">world</a>')
>>> [a['href'] for a in soup.findAll('a')]
['/world']

See also

In the Extracting URLs from HTML with lxml recipe, we covered how to use lxml to extract URLs from an HTML string, and we also covered the Cleaning and stripping HTML recipe after that.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset