Parsing

To begin our parsing, we will need to use the library we mentioned earlier called BeautifulSoup. We imported it earlier, so now we just need to pass the page source into BeautifulSoup. We do that by means of the following code:

soup = BeautifulSoup(browser.page_source, "html5lib") 

Notice that the browser object contains a page_source attribute. That is all the HTML we retrieved with our get request earlier. The other parameter passed into BeautifulSoup is the parsing library it will use. Here, we will stick with html5lib.

Now, once the content of the page has been passed to BeautifulSoup, we want to start to extract the elements of interest. That's where the div elements with the info-container class come in. We are going to retrieve those. Each one corresponds to a single city.

Let's retrieve them, but we'll just look at the first one:

cards = soup.select('div[class*=info-container]') 
 
cards[0] 

The output for the preceding code is shown as follows:

In the preceding code, we used the select method on our soup object. The select method allows us to use CSS selectors to reference the elements of interest. Here, we have specified that we want divs that have a class attribute that contains somewhere within the class name the string info-container. There is excellent documentation on BeautifulSoup that explains these CSS selectors and other methods, and is available at https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors.

Looking at the preceding output, notice that buried deep within the markup, there is the name of the destination city (London) and the fare price ($440). Since we just want the data and not all the surrounding markup, we'll need to create code to iterate over each info-container divs and pull out the city and the fare:

for card in cards: 
    print(card.select('h3')[0].text) 
    print(card.select('span[class*=price]')[0].text) 
    print('
') 

The preceding code results in the following output:

Since it looks as if we were able to successfully retrieve the fares for each city, let's now move on to constructing a full scrape and parse for a large number of fares.

We are now going to attempt to retrieve the lowest cost, non-stop fares from NYC to Europe for a 26-week period. I'm using a start date of December 01, 2018, but obviously, if you are reading this after that date, make sure to adjust your dates accordingly.

The first thing we'll need is to bring in some additional imports. We do that in the following code:

from datetime import date, timedelta 
from time import sleep 

Next, we'll construct the remainder of our scraping code:

start_sat = '2018-12-01' 
end_sat = '2018-12-08' 
 
start_sat_date = datetime.strptime(start_sat, '%Y-%m-%d') 
end_sat_date = datetime.strptime(end_sat, '%Y-%m-%d') 
 
fare_dict = {} 
 
for i in range(26):     
    sat_start = str(start_sat_date).split()[0] 
    sat_end = str(end_sat_date).split()[0] 
     
    fare_dict.update({sat_start: {}}) 
     
    sats = "https://www.google.com/flights/?f=0#f=0&flt=/m/02_286.r/m/02j9z." +  
    sat_start + "*r/m/02j9z./m/02_286." +  
    sat_end + ";c:USD;e:1;s:0*1;sd:1;t:e" 
     
    sleep(np.random.randint(3,7)) 
     
    browser.get(sats) 
     
    soup = BeautifulSoup(browser.page_source, "html5lib") 
     
    cards = soup.select('div[class*=info-container]') 
     
    for card in cards: 
        city = card.select('h3')[0].text 
        fare = card.select('span[class*=price]')[0].text 
        fare_dict[sat_start] = {**fare_dict[sat_start], **{city: fare}} 
         
    start_sat_date = start_sat_date + timedelta(days=7) 
    end_sat_date = end_sat_date + timedelta(days=7) 

That's a fair amount of code, so we'll unpack what is going on line by line. The first two lines just create our start and end dates that we'll use. The next two lines convert those date strings into datetime objects. This will be used later on when we want to add a week to each using timedelta. The last line before the for loop simply creates a dictionary that will hold our parsed data.

The next line begins a for loop. Inside this loop that will run 26 iterations, we convert our datetime object back into a string so that we can pass it into the URL that we will call with our browser object. Also, notice that on each iteration we populate our fare dictionary with the start date. We then create our URL using the date strings we created.

Next, we insert a random pause using the numpy.random function and the Python sleep function. This is simply to prevent us from appearing to be a bot and overtaxing the site.

We then retrieve the page with our browser object, pass it into BeautifulSoup for parsing, select the info-container divs, and then parse and update our fare dictionary. Finally, we add one week to our start and end dates so that the next iteration goes one week forward in time.

Now, let's look at the data in our fare dictionary:

fare_dict 

The preceding code results in the following output:

As you can see, we have a dictionary with date as the primary key, and then sub dictionaries with city/fare pairings.

Now, let's dive into one city to examine the data. We'll begin with Berlin:

city_key = 'Berlin' 
for key in fare_dict: 
    print(key, fare_dict[key][city_key]) 

The preceding code results in the following output:

One thing we notice right away is that we'll need to clean up the airfares so that we can work with them. We'll need to remove the dollar sign and the commas and convert them into integers. We do that in the following code:

city_dict = {} 
for k,v in fare_dict.items(): 
    city_dict.update({k:int(v[city_key].replace(',','').split('$')[1])}) 

The preceding code results in the following output:

Remember, the output shown in the preceding code is only for Berlin, as we are just examining one city at the moment.

Now, let's plot that data:

prices = [int(x) for x in city_dict.values()] 
dates = city_dict.keys() 
 
fig,ax = plt.subplots(figsize=(10,6)) 
plt.scatter(dates, prices, color='black', s=50) 
ax.set_xticklabels(dates, rotation=-70); 

The preceding code generates the following output:

Notice that we have 26 consecutive weeks of data, in this case, for non-stop flights from NYC to Berlin leaving on Saturday and returning the following Saturday. There appears to be a fair amount of variation in these fares. Just eyeballing the data, it appears that there might be two outliers on the high end toward the beginning of the period and the end.

Now, let's take a look at another city. To do this, we simply need to return to our code and change the city_key variable. We can then rerun the cells below it. We'll do that in the following code:

city_key = 'Milan' 
for key in fare_dict: 
    print(key, fare_dict[key][city_key]) 

This results in the following output:

We'll need to remove the dollar sign and the commas and convert them into integers. We do that in the following code:

city_dict = {} 
for k,v in fare_dict.items(): 
    city_dict.update({k:int(v[city_key].replace(',','').split('$')[1])}) 
 
city_dict 

The preceding code results in the following output:

Now, let's plot that data:

prices = [int(x) for x in city_dict.values()] 
dates = city_dict.keys() 
 
fig,ax = plt.subplots(figsize=(10,6)) 
plt.scatter(dates, prices, color='black', s=50) 
ax.set_xticklabels(dates, rotation=-70); 

The preceding code results in the following output:

Here, we can see even wider variations, with fares ranging from under $600 to over $1,200. Those cheap fares on the left are exactly the type of fares we'd like to know about. We are going to want to create an outlier detection system that will tell us about these bargain fares. We'll move on and discuss that now.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset