HTML entities are strings such as "&"
or "<"
. These are encodings of normal ASCII characters that have special uses in HTML. For example, "<"
is the entity for "<"
, but you can't just have "<"
within HTML tags because it is the beginning character for an HTML tag, hence the need to escape it and define the "<"
entity. "&"
is the entity code for "&"
, which as we've just seen is the beginning character for an entity code. If you need to process the text within an HTML document, then you'll want to convert these entities back to their normal characters so you can recognize them and handle them appropriately.
You'll need to install BeautifulSoup
, which you should be able to do with sudo pip install beautifulsoup4
or sudo easy_install beautifulsoup4
. You can read more about BeautifulSoup
at http://www.crummy.com/software/BeautifulSoup/.
BeautifulSoup
is an HTML parser library that can also be used for entity conversion. It's quite simple: create an instance of BeautifulSoup
given a string containing HTML entities, then get the string
attribute:
>>> from bs4 import BeautifulSoup >>> BeautifulSoup('<').string '<' >>> BeautifulSoup('&').string '&'
However, the reverse is not true. If you try to do BeautifulSoup('<')
, you will get a None
result because that is not valid in HTML.
To convert the HTML entities, BeautifulSoup
looks for tokens that look like an entity and replaces them with their corresponding value in the htmlentitydefs.name2codepoint
dictionary from the Python standard library. It can do this if the entity token is within an HTML tag, or when it's in a normal string.
BeautifulSoup
is an excellent HTML and XML parser in its own right, and can be a great alternative to lxml
. It's particularly good at handling malformed HTML. You can read more about how to use it at http://www.crummy.com/software/BeautifulSoup/bs4/doc/.
Here's an example of using BeautifulSoup
to extract URLs, like we did in the Extracting URLs from HTML with lxml recipe. You first create the soup
with an HTML string, call the findAll()
method with 'a'
to get all anchor tags, and pull out the 'href'
attribute to get the URLs:
>>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup('Hello <a href="/world">world</a>') >>> [a['href'] for a in soup.findAll('a')] ['/world']