Scraping information from websites

Governments or jurisdictions around the world are increasingly embracing the importance of open data, which aims to increase citizen involvement and informs about decision making, making policies more open to public scrutiny. Some examples of open data initiatives around the world include data.gov (United States of America), data.gov.uk (United Kingdom), and data.gov.hk (Hong Kong).

These data portals often provide Application Programming Interfaces (APIs; see Chapter 7, Visualizing Online Data, for more details) for programmatic access to data. However, APIs are not available for some datasets; hence, we resort to good old web scraping techniques to extract information from websites.

BeautifulSoup (https://www.crummy.com/software/BeautifulSoup/) is an incredibly useful package used to scrape information from websites. Basically, everything marked with an HTML tag can be scraped with this wonderful package, from text, links, tables, and styles, to images. Scrapy is also a good package for web scraping, but it is more like a framework for writing powerful web crawlers. So, if you just need to fetch a table from a page, BeautifulSoup offers simpler procedures.

We are going to use BeautifulSoup version 4.6 throughout this chapter. To install BeautifulSoup 4, we can once again rely on PyPI:

pip install beautifulsoup4

The data on USA unemployment rates and earnings by educational attainment (2016) is available at https://www.bls.gov/emp/ep_table_001.htm. Currently, BeautifulSoup does not handle HTML requests. So we need to use the urllib.request or requests package to fetch a web page for us. Of the two options, the requests package is probably easier to use, due to its higher-level HTTP client interface. If requests is not available on your system, you can install it through PyPI:

pip install requests

Let's take a look at the web page before we write the web scraping code. If we use the Google Chrome browser to visit the Bureau of Labor Statistics website, we can inspect the HTML code corresponding to the table we need by right-clicking:

A pop-up window for code inspection will be shown, which allows us to read the code for each of the elements on the page.

More specifically, we can see that the column names are defined in the <thead>...</thead> section, while the table content is defined in the <tbody>...</tbody> section.

In order to instruct BeautifulSoup to scrape the information we need, we need to give clear directions to it. We can right-click on the relevant section in the code inspection window and copy the unique identifier in the format of the CSS selector.

Cascading Style Sheets (CSS) selectors were originally designed for applying element-specific styles to a website. For more information, visit the following page: https://www.w3schools.com/cssref/css_selectors.asp.

Let's try to get the CSS selectors for thead and tbody, and use the BeautifulSoup.select() method to scrape the respective HTML code:

import requests
from bs4 import BeautifulSoup


# Specify the url
url = "https://www.bls.gov/emp/ep_table_001.htm"

# Query the website and get the html response
response = requests.get(url)

# Parse the returned html using BeautifulSoup
bs = BeautifulSoup(response.text)

# Select the table header by CSS selector
thead = bs.select("#bodytext > table > thead")[0]

# Select the table body by CSS selector
tbody = bs.select("#bodytext > table > tbody")[0]

# Make sure the code works
print(thead)

We see the following output from the previous code:

<thead> <tr> <th scope="col"><p align="center" valign="top"><strong>Educational attainment</strong></p></th> <th scope="col"><p align="center" valign="top">Unemployment rate (%)</p></th> <th scope="col"><p align="center" valign="top">Median usual weekly earnings ($)</p></th> </tr> </thead>

Next, we are going to find all instances of <th>...</th> in <thead>...</thead>, which contains the name of each column. We will build a dictionary of lists with headers as keys to hold the data:

# Get the column names
headers = []

# Find all header columns in <thead> as specified by <th> html tags
for col in thead.find_all('th'):
headers.append(col.text.strip())

# Dictionary of lists for storing parsed data
data = {header:[] for header in headers}

Finally, we parse the remaining rows (<tr>...</tr>) from the body (<tbody>...</tbody>) of the table and convert the data into a pandas DataFrame:

import pandas as pd


# Parse the rows in table body
for row in tbody.find_all('tr'):
# Find all columns in a row as specified by <th> or <td> html tags
cols = row.find_all(['th','td'])

# enumerate() allows us to loop over an iterable,
# and return each item preceded by a counter
for i, col in enumerate(cols):
# Strip white space around the text
value = col.text.strip()

# Try to convert the columns to float, except the first column
if i > 0:
value = float(value.replace(',','')) # Remove all commas in
# string

# Append the float number to the dict of lists
data[headers[i]].append(value)

# Create a dataframe from the parsed dictionary
df = pd.DataFrame(data)

# Show an excerpt of parsed data
df.head()

 

Educational attainment Median usual weekly earnings ($) Unemployment rate (%)
0 Doctoral degree 1664.0 1.6
1 Professional degree 1745.0 1.6
2 Master's degree 1380.0 2.4
3 Bachelor's degree 1156.0 2.7
4 Associate's degree 819.0 3.6

We have now fetched the HTML table and formatted it as a structured pandas DataFrame.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset