How to do it...

Navigate to The American Presidency Project approval page for President Donald Trump (http://www.presidency.ucsb.edu/data/popularity.php?pres=45). You should get a page that contains a time series plot with the data in a table directly following it:

The read_html function is able to scrape tables off web pages and place their data into DataFrames. It works best with simple HTML tables and provides some useful parameters to select the exact table you desire in case there happen to be multiple tables on the same page. Let's go ahead and use read_html with its default values, which will return all the tables as DataFrames in a list:

>>> base_url = 'http://www.presidency.ucsb.edu/data/popularity.php?pres={}'
>>> trump_url = base_url.format(45)
>>> df_list = pd.read_html(trump_url)
>>> len(df_list)
14

The function has returned 14 tables, which seems preposterous at first, as the web page appears to show only a single element that most people would recognize as a table. The read_html function formally searches for HTML table elements that begin with <table. Let's inspect the HTML page by right-clicking on the approval data table and selecting inspect or inspect element:

This opens up the console, which is a very powerful tool for web development. For this recipe, we will only need it for a few tasks. All consoles allow you to search the HTML for a specific word. Let's search for the word table. My browser found 15 different HTML tables, very close to the number returned by read_html:

Let's begin inspecting the DataFrames in df_list:

>>> df0 = df_list[0]
>>> df0.shape
(308, 1794)

>>> df0.head(7)

Looking back at the web page, there is a row in the approval table for nearly each day beginning January 22, 2017, until the day the data was scraped--September 25, 2017. This is a little more than eight months or 250 rows of data, which is somewhat close to the 308 lines in that first table. Scanning through the rest of the tables, you can see that lots of empty meaningless tables were discovered, as well as tables for different parts of the web page that don't actually resemble tables. Let's use some of the parameters of the read_html function to help us select the table we desire. We can use the match parameter to search for a specific string in the table. Let's search for a table with the word Start Date in it:

>>> df_list = pd.read_html(trump_url, match='Start Date')
>>> len(df_list)
3

By searching for a specific string in the table, we have reduced the number of tables down to just three. Another useful parameter is attrs, which accepts a dictionary of HTML attributes paired with their value. We would like to find some unique attributes for our particular table. To do this, let's right-click again in our data table. This time, make sure to click at the very top in one of the table headers. For example, right click on President, and select inspect or inspect element again:

The element that you selected should be highlighted. This is actually not the element we are interested in. Keep looking until you come across an HTML tag beginning with <table. All the words to the left of the equal signs are the attributes or attrs and to the right are the values. Let's use the align attribute with its value center in our search:

>>> df_list = pd.read_html(trump_url, match='Start Date',
                           attrs={'align':'center'})
>>> len(df_list)
1

>>> trump = df_list[0]
>>> trump.shape
(249, 19)

>>> trump.head(8)

We only matched with one table and the number of rows is very close to the total days between the first and last dates. Looking at the data, it appears that we have indeed found the table we are looking for. The six column names appear to be on line 4. We can go even further and precisely select the rows we want to skip and which row we would like to use for the column names with the skiprows and header parameters. We can also make sure that the start and end dates are coerced correctly to the right data type with the parse_dates parameter:

>>> df_list = pd.read_html(trump_url, match='Start Date',
                           attrs={'align':'center'}, 
                           header=0, skiprows=[0,1,2,3,5], 
                           parse_dates=['Start Date',
                                        'End Date'])
>>> trump = df_list[0]
>>> trump.head()

This is almost exactly what we want, except for the columns with missing values. Let's use the dropna method to drop columns with all values missing:

>>> trump = trump.dropna(axis=1, how='all')
>>> trump.head()

Let's fill the missing values in the President column in a forward direction with the ffill method. Let's first check whether there are any missing values in the other columns:

>>> trump.isnull().sum()
President         242
Start Date          0
End Date            0
Approving           0
Disapproving        0
unsure/no data      0
dtype: int64

>>> trump = trump.ffill()
trump.head()

Finally, it is important to check the data types to ensure they are correct:

>>> trump.dtypes
President                 object
Start Date        datetime64[ns]
End Date          datetime64[ns]
Approving                  int64
Disapproving               int64
unsure/no data             int64
dtype: object

Let's build a function with all the steps combined into one to automate the process of retrieving approval data for any President:

>>> def get_pres_appr(pres_num):
        base_url =
'http://www.presidency.ucsb.edu/data/popularity.php?pres={}'
        pres_url = base_url.format(pres_num)
        df_list = pd.read_html(pres_url, match='Start Date',
                               attrs={'align':'center'}, 
                               header=0, skiprows=[0,1,2,3,5], 
                               parse_dates=['Start Date',
                                            'End Date'])
        pres = df_list[0].copy()
        pres = pres.dropna(axis=1, how='all')
        pres['President'] = pres['President'].ffill()
        return pres.sort_values('End Date') 
                   .reset_index(drop=True)

The only parameter, pres_num, denotes the order number of each president. Barack Obama was the 44th President of the United States; pass 44 to the get_pres_appr function to retrieve his approval numbers:

>>> obama = get_pres_appr(44)
>>> obama.head()

There is Presidential approval rating data dating back to 1941 during President Franklin Roosevelt's third term. With our custom function along with the concat function, it is possible to grab all the presidential approval rating data from this site. For now, let's just grab the approval rating data for the last five presidents and output the first three rows for each President:

>>> pres_41_45 = pd.concat([get_pres_appr(x) for x in range(41,46)],
                            ignore_index=True)
>>> pres_41_45.groupby('President').head(3)

Before continuing, let's determine if there are any dates with multiple approval ratings:

>>> pres_41_45['End Date'].value_counts().head(8)
1990-08-26    2
1990-03-11    2
1999-02-09    2
2013-10-10    2
1990-08-12    2
1992-11-22    2
1990-05-22    2
1991-09-30    1
Name: End Date, dtype: int64

Only a few of the days have duplicate values. To help simplify our analysis, let's keep only the first row where the duplicate date exists:

>>> pres_41_45 = pres_41_45.drop_duplicates(subset='End Date')

Let's get a few summary statistics on the data:

>>> pres_41_45.shape
(3679, 6)

>>> pres_41_45['President'].value_counts()
Barack Obama          2786
George W. Bush         270
Donald J. Trump        243
William J. Clinton     227
George Bush            153
Name: President, dtype: int64

>>> pres_41_45.groupby('President', sort=False) 
                       .median().round(1)

Let's plot each President's approval rating on the same chart. To do this, we will group by each President, iterate through each group, and individually plot the approval rating for each date:

>>> from matplotlib import cm
>>> fig, ax = plt.subplots(figsize=(16,6))

>>> styles = ['-.', '-', ':', '-', ':']
>>> colors = [.9, .3, .7, .3, .9]
>>> groups = pres_41_45.groupby('President', sort=False)

>>> for style, color, (pres, df) in zip(styles, colors, groups):
        df.plot('End Date', 'Approving', ax=ax,
                label=pres, style=style, color=cm.Greys(color), 
                title='Presedential Approval Rating')

This chart places all the Presidents sequentially one after the other. We can compare them on a simpler scale by plotting their approval rating against the number of days in office. Let's create a new variable to represent the number of days in office:

>>> days_func = lambda x: x - x.iloc[0]
>>> pres_41_45['Days in Office'] = pres_41_45.groupby('President') 
                                             ['End Date'] 
                                             .transform(days_func)
>>> pres_41_45.groupby('President').head(3)

We have successfully given each row a relative number of days since the start of the presidency. It's interesting that the new column, Days in Office, has a string representation of its value. Let's check its data type:

>>> pres_41_45.dtypes
...
Days in Office    timedelta64[ns]
dtype: object

The Days in Office column is a timedelta64 object with nanosecond precision. This is far more precision than is needed. Let's change the data type to integer by getting just the days:

>>> pres_41_45['Days in Office'] = pres_41_45['Days in Office'] 
                                             .dt.days
>>> pres_41_45['Days in Office'].head()
0     0
1    32
2    35
3    43
4    46
Name: Days in Office, dtype: int64

We could plot this data in a similar fashion to what we did in step 19, but there is a completely different method that doesn't involve any looping. By default, when calling the plot method on a DataFrame, pandas attempts to plot each column of data as a line plot and uses the index as the x-axis. Knowing this, let's pivot our data so that each President has his own column for approval rating:

>>> pres_pivot = pres_41_45.pivot(index='Days in Office',
                                  columns='President',
                                  values='Approving')
>>> pres_pivot.head()

Now that each President has his own column of approval ratings, we can plot each column directly without grouping. To reduce the clutter in the plot, we will only plot Barack Obama and Donald J. Trump:

>>> plot_kwargs = dict(figsize=(16,6), color=cm.gray([.3, .7]), 
                       style=['-', '--'], title='Approval Rating')
>>> pres_pivot.loc[:250, ['Donald J. Trump', 'Barack Obama']] 
              .ffill().plot(**plot_kwargs)

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...