Pulling out the individual data points

Now that we have all the divs with our listing data for each apartment, we need to pull out the individual data points for each apartments.

These are the points in each that we want to target:

  • URL of the listing
  • Address of the apartment
  • Neighborhood
  • Number of bedrooms
  • Number of bathrooms

Obviously, we love to have way more info—things such as square footage, for example, but we'll have to make do with what we have.

Let's begin by looking at the first listing:

listing_divs[0] 

The preceding code results in the following output:

Notice that this first div contains all of the data points we were looking for. We just now need to begin our parse to target them each individually. Let's look at the first one we want to retrieve, the URL.

We can see that the URL for the page is with an anchor, or a tag. Let's parse that out now. We can do that with another select statement, as can be seen in the following code snippet:

listing_divs[0].select('a[id*=title]')[0]['href'] 

We see the output in the following screenshot:

This is exactly what we were hoping for. We can now continue to retrieve the other data points for the listing. We do that in the following code:

href = current_listing.select('a[id*=title]')[0]['href'] 
addy = current_listing.select('a[id*=title]')[0].string 
hood = current_listing.select('div[id*=hood]')[0] 
       .string.replace('
','') 

Let's now verify this by printing out what we've captured. We do that in the following code:

print(href) 
print(addy) 
print(hood) 

The preceding code results in the following output:

Based on this output, we are getting the data we need. Let's continue on with the last few items we need—the bedrooms, bathrooms, and the price.

Since these items have a slightly different presentation in that they are in a table tag in our div and then inside a table row, or tr, we will need to iterate over each point to capture our data. We do that in the following code:

listing_specs = listing_divs[0].select('table[id*=info] tr') 
for spec in listing_specs: 
    spec_data = spec.text.strip().replace(' ', '_').split() 
    print(spec_data) 

The preceding code results in the following output:

Again, this is exactly what we were looking for. We now have all the data that we were seeking. Let's now pull it all together in a loop so that we can pull the data from each listing and save it into a list.

In the following code, we will pull out all the data points for each listing:

listing_list = [] 
for idx in range(len(listing_divs)): 
    indv_listing = [] 
    current_listing = listing_divs[idx] 
    href = current_listing.select('a[id*=title]')[0]['href'] 
    addy = current_listing.select('a[id*=title]')[0].string 
    hood = current_listing.select('div[id*=hood]')[0] 
    .string.replace('
','') 
     
    indv_listing.append(href) 
    indv_listing.append(addy) 
    indv_listing.append(hood) 
     
    listing_specs = current_listing.select('table[id*=info] tr') 
    for spec in listing_specs: 
        try: 
            indv_listing.extend(spec.text.strip() 
                                .replace(' ', '_').split()) 
        except: 
            indv_listing.extend(np.nan) 
    listing_list.append(indv_listing)     

Let's unpack a bit what we did in the preceding code. We know we have 20 divs that contain the apartment listing on the page, so we create a for loop that goes through each one and pulls out the data and adds it to indv_listing. When that is complete, all the data for the individual listing is then added to the listing_list, which contains all the final info for the 20 apartment listings. We verify that with the following code:

listing_list 

The preceding code results in the following output:

Again, we appear to be getting the results we expect, so we will continue on. A check of the number of items in listing_list also confirms we have all 20 apartments on the page.

So far, we have successfully retrieved one page of data. While that is great, we are going to need far more apartments if we want to build any kind of meaningful model. To do this, we will need to iterate over a number of pages. To that end, we'll need to use the appropriate URLs. We can see that at the bottom of the listings, there is a button that says Next. If you right-click on that button, and click Copy Link Address, you see it looks like the following URL: https://www.renthop.com/search/nyc?max_price=50000&min_price=0&page=2&sort=hopscore&q=&search=0.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset