Searching with XPath

In order to avoid exhaustive iteration and the checking of every element, we need to use XPath, which is a query language that was developed specifically for XML, and is supported by lxml.

To get started with XPath, use the Python shell from the last section, and do the following:

>>> root.xpath('body')
[<Element body at 0x4477530>]

This is the simplest form of XPath expression; it searches for children of the current element that have tag names that match the specified tag name. The current element is the one we call xpath() on—in this case, root. The root element is the top-level <html> element in the HTML document, and so the returned element is the <body> element.

XPath expressions can contain multiple levels of elements. The searches start from the node the xpath() call is made on and work down the tree as they match successive elements in the expression. We can use this to find just the <div> child elements of <body>:

>>> root.xpath('body/div')
[<Element div at 0x447a1e8>, <Element div at 0x447a210>, <Element div at 0x447a238>]

In body and div expression means, match the <div> children of the <body> children of the current element. Elements with the same tag can appear more than once at the same level in an XML document, so an XPath expression can match multiple elements, hence the xpath() function always returns a list.

The preceding queries are relative to the element that we call xpath() on, but we can force a search from the root of the tree by adding a slash to the start of the expression. We can also perform a search over all the descendants of an element, with the help of a double-slash. To do this, try the following:

>>> root.xpath('//h1')
[<Element h1 at 0x447aa58>]

The real power of XPath lies in applying additional conditions to the elements in the path:

>>> root.xpath('//div[@id="content"]')
[<Element div at 0x3d6d800>]

The square brackets after div, [@id="content"], form a condition that we place on the <div> elements that we're matching. The @ sign before id keyword means that id refers to an attribute, so the condition means: only elements with an id attribute equal to "content". This is how we can find our content <div> tag.

Before we employ this to extract our information, let's just touch on a couple of useful things that we can do with conditions. We can specify just a tag name, as shown here:

>>> root.xpath('//div[h1]')
[<Element div at 0x3d6d800>]

This returns all the <div> elements that have an <h1> child element. Also try the following:

>>> root.xpath('body/div[2]')
[<Element div at 0x3d6d800>]

Putting a number as a condition will return the element at that position in the matched list. In this case, this is the second <div> child element of <body>. Note that these indexes start at 1, unlike Python indexing which starts at 0. There's a lot more that XPath can do: the full specification is a World Wide Web Consortium (W3C) standard. The latest version can be found at: http://www.w3.org/TR/xpath-3.

Now, let's finish up by writing a script to get our Debian version information.

You can find the following code in the get_debian_version.py file in the lxml folder:

import re
import requests

from lxml.etree import HTML
response =
requests.get('https://www.debian.org/releases/stable/index.en.html')
root = HTML(response.content)

title_text = root.find('head').find('title').text

if re.search('u201c(.*)u201d', title_text):
release = re.search('u201c(.*)u201d', title_text).group(1)
p_text = root.xpath('//div[@id="content"]/p[1]')[0].text
version = p_text.split()[1]
print('Codename: {} Version: {}'.format(release, version))

Here, we have downloaded and parsed the web page by pulling out the text that we want with the help of XPath. We have used a regular expression to pull out stretch version name, and a split to extract the version 9.6. Finally, we print it out. So, run it as shown here:

$ python get_debian_version.py
Codename: stretch
Version: 9.6
XPath is a language that allows you to select nodes from an XML document and calculate values from their content. There are several XPath versions approved by the W3C. In this URL, you can see documentation and all XPath versions: https://www.w3.org/TR/xpath/all/.

In this example, we are using XPath expressions to get images and links from a URL. For extracting images, we use the '//img/@src' XPath expression and for extracting links we use the '//a/@href' expression.

You can find the following code in the get_links_images.py file in the lxml folder:

#!/usr/bin/env python3

import os
import requests
from lxml import html

class Scraping:

def scrapingImages(self,url):
print(" Getting images from url:"+ url)
try:
response = requests.get(url)
parsed_body = html.fromstring(response.text)
# regular expresion for get images
images = parsed_body.xpath('//img/@src')
print('Found images %s' % len(images))
#create directory for save images
os.system("mkdir images")
for image in images:
if image.startswith("http") == False:
download = url + "/"+ image
else:
download = image
print(download)
# download images in images directory
r = requests.get(download)
f = open('images/%s' % download.split('/')[-1], 'wb')
f.write(r.content)
f.close()
except Exception as e:
print("Connection error in " + url)
pass

In the previous code block, we define the scrapingImages function for extracting images from a URL using the regular expression '//img/@src'. In the next code block, in a similar way, we define the scrapingLinks function for extracting links from a URL using the regular expression '//a/@href':

    def scrapingLinks(self,url):
print(" Getting links from url:"+ url)
try:
response = requests.get(url)
parsed_body = html.fromstring(response.text)
# regular expression for get links
links = parsed_body.xpath('//a/@href')
print('Found links %s' % len(links))
for link in links:
print(link)
except Exception as e:
print("Connection error in " + url)
pass

if __name__ == "__main__":
target = "https://news.ycombinator.com"
scraping = Scraping()
scraping.scrapingImages(target)
scraping.scrapingLinks(target)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset