Pulling down listing data

We'll be using the RentHop site, http://www.renthop.com, to source our listing data. The following screenshot of the site shows the layout of the listings we'll be retrieving:

What we can see is that the listings have the address, the price, the number of bedrooms, and the number of bathrooms. We'll start by retrieving this information for each listing.

We are going to be using the Python Requests library for this task. Requests is dubbed HTTP for humans, and it makes it super easy to retrieve websites. If you want an overview on how to use Requests, the quick start guide is available at http://docs.python-requests.org/en/master/user/quickstart/. Follow these steps:

  1. So, the first step is to prepare our Jupyter Notebook with the imports we'll be using for this task. We do that in the following code snippet:
import numpy as np 
import pandas as pd 
import requests 
import matplotlib.pyplot as plt 
%matplotlib inline 

We'll likely need to import more libraries later on, but for now this should get us started.

  1. We are going to use NYC apartment data in our model. The URL for that data is https://www.renthop.com/nyc/apartments-for-rent. Let's run a quick test and make sure we can retrieve that page. We do that in the following code:
r = requests.get('https://www.renthop.com/nyc/apartments-for-rent') 
r.content 
  1. This code makes a call to the site, and retrieves the information, storing it in the r object. There are a number of attributes we could retrieve from that r object, but for now, we just want the page content. We can see the output of that in the following screenshot:

  1. Upon inspection, it looks like everything we want is contained in this. To verify that, let's copy all of the HTML and paste it into a text editor, and then open it in a browser. I'm going to do that using Sublime Text, a popular text editor available at https://www.sublimetext.com/.
  1. In the following screenshot, you can see that I have pasted the copied HTML from the Jupyter output into Sublime Text and saved it as test.html:

HTML text

  1. Next, we click on Open in Browser, and we can see output that resembles the following image:

Notice that although the text doesn't render cleanly (due to the lack of CSS), all the data we are targeting is there. Fortunately for us, that means the RentHop site doesn't use any advanced JavaScript rendering, so that should make our job much easier. If it did, we'd have to use a different tool like Selenium.

Let's now examine the page elements to see how we can parse the page data:

  1. Open the RentHop site in Chrome and right-click anywhere on the page.
  2. At the bottom of the context menu, you should see Inspect. Click on that. The page should now resemble the following image:

  1. In the tool that just opened, in the upper left-hand corner, there is a square with an arrow in the corner. Click that, and then click on the data on the page. It should look like the following:

We can see from this that each listing's data is in a table, and that the first td tag contains the price, the second contains the number of bedrooms, and the third contains the number of bathrooms. We will also want the address of the apartment that can be found in an anchor, or a tag.

Let's now begin building out our code to test our parsing of the data. To do our HTML parsing, we are going to use a library call BeautifulSoup. The documentation for it can be found at https://www.crummy.com/software/BeautifulSoup/. BeautifulSoup is a popular, easy-to-use Python HTML parsing library. It can be pip installed if you don't already have it. We are going to use it to pull out all of the individual specs for our apartment listings:

  1. To get started, we simply need to pass our page content into the BeautifulSoup class. This can be seen in the following code:
from bs4 import BeautifulSoup 
 
soup = BeautifulSoup(r.content, "html5lib") 
  1. We now can use this soup object that we've created to begin parsing out our apartment data. The first thing we want to do is retrieve that div tag that contains our listing data on the page. We see that in the following code:
listing_divs = soup.select('div[class*=search-info]') 
listing_divs 

What we've done in the preceding code is to select all divs that contain search-info. These are exactly the divs that have our data.

  1. Next, we look at the output from this in the following screenshot:

  1. Notice that we have a Python list of all the div tags we were seeking. We know from looking at the page that there should be twenty of these. Let's confirm that:
len(listing_divs) 
  1. We then see the following output, which confirms that we have captured them all as we wanted:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset