Extracting URLs from HTML with lxml

A common task when parsing HTML is extracting links. This is one of the core functions of every general web crawler. There are a number of Python libraries for parsing HTML, and lxml is one of the best. As you'll see, it comes with some great helper functions geared specifically towards link extraction.

Getting ready

lxml is a Python binding for the C libraries libxml2 and libxslt. This makes it a very fast XML and HTML parsing library, while still being Pythonic. But that also means you need to install the C libraries for it to work. Installation instructions are available at http://lxml.de/installation.html. But if you're running Ubuntu Linux, installation is as easy as sudo apt-get install python-lxml. You can also try doing pip install lxml. The latest version as of this writing is 3.3.5.

How to do it...

lxml comes with an html module designed specifically for parsing HTML. Using the fromstring() function, we can parse an HTML string and get a list of all the links. The iterlinks() method generates 4-tuples of the form (element, attr, link, pos):

  • element: This is the parsed node of the anchor tag from which the link is extracted. If you're just interested in the link, you can ignore this.
  • attr: This is the attribute the link came from, which is usually 'href'.
  • link: This is the actual URL extracted from the anchor tag.
  • pos: This is the numeric index of the anchor tag in the document. The first tag has a pos of 0, the second has a pos of 1, and so on.

Here's some code to demonstrate:

>>> from lxml import html
>>> doc = html.fromstring('Hello <a href="/world">world</a>')
>>> links = list(doc.iterlinks())
>>> len(links)
1
>>> (el, attr, link, pos) = links[0]
>>> attr
'href'
>>> link
'/world'
>>> pos
0

How it works...

lxml parses the HTML into an ElementTree. This is a tree structure of parent nodes and child nodes, where each node represents an HTML tag and contains all the corresponding attributes of that tag. Once the tree is created, it can be iterated on to find elements, such as the a or anchor tag. The core tree handling code is in the lxml.etree module, while the lxml.html module contains only HTML-specific functions for creating and iterating a tree. For complete documentation, see the lxml tutorial at http://lxml.de/tutorial.html.

There's more...

You'll notice that the link mentioned earlier is relative, meaning it's not an absolute URL. We can make it absolute by calling the make_links_absolute() method with a base URL before extracting the links:

>>> doc.make_links_absolute('http://hello')
>>> abslinks = list(doc.iterlinks())
>>> (el, attr, link, pos) = abslinks[0]
>>> link
'http://hello/world'

Extracting links directly

If you don't want to do anything other than extract links, you can call the iterlinks() function with an HTML string:

>>> links = list(html.iterlinks('Hello <a href="/world">world</a>'))
>>> links[0][2]
'/world'

Parsing HTML from URLs or files

Instead of parsing an HTML string using the fromstring() function, you can call the parse() function with a URL or filename; for example, html.parse('http://my/url') or html.parse('/path/to/file'). The result will be the same as if you loaded the URL or file into a string yourself and then called fromstring().

Extracting links with XPaths

Instead of using the iterlinks() method, you can also get links using the xpath() method, which is a general way to extract whatever you want from HTML or XML parse trees:

>>> doc.xpath('//a/@href')[0]
'http://hello/world'

For more on XPath syntax, see http://www.w3schools.com/XPath/xpath_syntax.asp.

See also

In the next recipe, we'll cover cleaning and stripping HTML.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset