Extracting data with XPath

Since HTML is a structure document, you can use a computer program to navigate that structure and extract selected text nodes, attributes, and elements. Most Web extraction tools are based on XPath: an XML standard that can be used to navigate in a XML structure and select elements, attributes, and text nodes using path notation. Although HTML is not as strict as XML, it has similar structures that can be represented as XPath paths and is supported by many Web scraping tools.

For example, the first lines of the previous web page have the following structure:

<html>
<head>
<title>Planetary Fact Sheet</title>
</head>
<body bgcolor=FFFFFF>
<p>
<hr>
<H1>Planetary Fact Sheet - Metric</H1>
<hr>
<p>
<table> ...

It's not XML or XHTML, since attributes are not within quotes and tags don't close, but you can still use XPath to extract data from it. This path will give you the title:

/html/head/title/text()

Any one of these one will return the bgcolor attribute (its name and value) from the body tag:

/html/body/@bgcolor
/html/body/attribute::bgcolor

This one will return the contents of the <H1> header:

/html/head/h1/text()

This one is tricky. If this was XML, it would be /html/head/p/hr/H1, because all XML tags must close, but HTML parsers automatically close the <p> and <hr> tags because there can't be an <h1> header inside them. HTML is also case insensitive, so using H1 or h1 doesn't make any difference with these parsers. Still, this may still confuse some parsers. You can play it safe by using:

/html/head//H1/text()

The // or double slash means that between <head> and <H1> there can be any number of levels. This is compatible with the XML or HTML absolute path.

You can experiment with XPath using your browser's JavaScript console, writing XPath expressions inside $x(expression). Let's try it out using the Planetary Fact Sheet page. Open the page in your browser and then open a console window, and type the following:

$x("//table")

This will select all the tables in the document. In this case, there is only one. You can also view the source code or inspect the page to discover the absolute path:

$x("/html/body/p/table")

Enter this command and the console will reveal the HTML fragment corresponding to your selection. Now let's select the row that contains diameters. It's the third row in the table. You can ignore the existing <thead> or <tbody> tags using the //. XPath counts child nodes starting with 1, not 0 as in JavaScript. The command returns a single <tr> element in an array. We can extract it using [0]:

$x("//table//tr[3]")[0]

This will select the following fragment:

<tr>
<td align="left"><b><a href="planetfact_notes.html#diam">Diameter</a>
(km)</b></td>
<td align="center" bgcolor="F5F5F5">4879</td>
<td align="center" bgcolor="FFFFFF">12,104</td>
<td align="center" bgcolor="F5F5F5">12,756</td>
<td align="center" bgcolor="FFFFFF">3475</td>
<td align="center" bgcolor="F5F5F5">6792</td>
<td align="center" bgcolor="FFFFFF">142,984</td>
<td align="center" bgcolor="F5F5F5">120,536</td>
<td align="center" bgcolor="FFFFFF">51,118</td>
<td align="center" bgcolor="F5F5F5">49,528</td>
<td align="center" bgcolor="FFFFFF">2370</td>
</tr>

To select the diameter of the earth, you need to add one more path step:

$x("//table//tr[3]/td[4]")[0]

The result is as follows:

<td align="center" bgcolor="F5F5F5">12,756</td>

To extract the text, you need to include the text() function at the end of the XPath expression. You also need to extract the data from the $x() function result, using the data property:

const result = $x("/html/body/p/table/tbody/tr[3]/td[4]/text()")[0].data

This will return the result as a string. You can then use regular expressions to remove the comma and then convert the result to a number:

const value = +data.replace(/,/g,''); 
// removes commas and converts to number

You might want to automate that with a programming library if you need to extract lots of data, such as all the planetary diameters. The $x() command only works in the browser console, but many programming languages support XPath libraries and APIs. You can also use tools such as Scrapy (in Python) or testing tools such as Selenium (in several languages) that support XPath selectors for extracting data from HTML.

XPath is a very powerful data extraction language, and this was only a very brief introduction. But there are also alternatives, such as XQuery (another XML standard with a query syntax) and CSS selectors (used by JQuery and also supported by Scrapy and Selenium).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset