In previous chapters, we've seen how to use the Requests library to retrieve web pages. As I've said before, it is a fantastic tool, but unfortunately, it won't work for us here. The page we want to scrape is entirely AJAX-based. Asynchronous JavaScript (AJAX) is a method for retrieving data from a server without having to reload the page. What this means for us is that we'll need to use a browser to retrieve the data. While that might sound like it would require an enormous amount of overhead, there are two libraries that, when used together, make it a lightweight task.
The two libraries are Selenium and ChromeDriver. Selenium is a powerful tool for automating web browsers, and ChromeDriver is a browser. Why use ChromeDriver rather than Firefox or Chrome itself? ChromeDriver is what's known as a headless browser. This means it has no user interface. This keeps it lean, making it ideal for what we're trying to do.
We'll also need another library called BeautifulSoup to parse the data from the page. If you don't have that installed, you should pip install that now as well.
With that done, let's get started. We'll start out within the Jupyter Notebook. This works best for exploratory analysis. Later, when we've completed our exploration, we'll move on to working in a text editor for the code we want to deploy. This is done in following steps:
- First, we import our routine libraries, as shown in the following code snippet:
import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline
- Next, make sure you have installed BeautifulSoup and Selenium, and downloaded ChromeDriver, as mentioned previously. We'll import those now in a new cell:
from bs4 import BeautifulSoup from selenium import webdriver # replace this with the path of where you downloaded chromedriver chromedriver_path = "/Users/alexcombs/Downloads/chromedriver" browser = webdriver.Chrome(chromedriver_path)
Notice that I have referenced the path on my machine where I have downloaded ChromeDriver. Note that you will have to replace that line with the path on your own machine.