Importing online population data in the CSV format

Let's begin by looking at the steps to import an online CSV file as a pandas DataFrame. In this example, we are going to use the annual population summary published by the Department of Economic and Social Affairs, United Nations, in 2015. Projected population figures towards 2100 were also included in the dataset:

import numpy as np # Python scientific computing package
import pandas as pd # Python data analysis package

# URL for Annual Population by Age and Sex - Department of Economic
# and Social Affairs, United Nations
source = "https://github.com/PacktPublishing/Matplotlib-2.x-By-Example/blob/master/WPP2015_DB04_Population_Annual.zip"

# Pandas support both local or online files
data = pd.read_csv(source, header=0, compression='zip', encoding='latin_1')

# Show the first five rows of the DataFrame
data.head()

The expected output of the code is shown here:

LocID Location VarID Variant Time MidPeriod SexID Sex AgeGrp AgeGrpStart AgeGrpSpan Value

 

0 4 Afghanistan 2 Medium 1950 1950.5 1 Male 0-4 0 5 630.044
1 4 Afghanistan 2 Medium 1950 1950.5 1 Male 5-9 5 5 516.205
2 4 Afghanistan 2 Medium 1950 1950.5 1 Male 10-14 10 5 461.378
3 4 Afghanistan 2 Medium 1950 1950.5 1 Male 15-19 15 5 414.368
4 4 Afghanistan 2 Medium 1950 1950.5 1 Male 20-24 20 5 374.110

The pandas.read_csv class is extremely versatile, supporting column headers, custom delimiters, various compressed formats (for example, .gzip, .bz2, .zip, and .xz), different text encodings, and much more. Readers can consult the documentation page (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) for more information.

By calling the .head() function of the Pandas DataFrame object, we can quickly observe the first five rows of the data.

As we progress through this chapter, we are going to integrate this population dataset with other datasets in Quandl. However, Quandl uses three-letter country codes (ISO 3166 alpha-3) to denote geographical locations; therefore we need to reformat the location names accordingly.

The pycountry package is an excellent choice for conversion of country names according to ISO 3166 standards. Similarly, pycountry can be installed through PyPI:

pip install pycountry 

Continuing the previous code example, we are going to add a new country column to the dataframe:

from pycountry import countries


def get_alpha_3(location):
"""Convert full country name to three letter code (ISO 3166 alpha-3)

Args:
location: Full location name
Returns:
three letter code or None if not found"""

try:
return countries.get(name=location).alpha_3
except:
return None

# Add a new country column to the dataframe
population_df['country'] = population_df['Location'].apply(lambda x: get_alpha_3(x))
population_df.head()

The expected output of the code is shown here:

- LocID Location VarID Variant Time MidPeriod SexID Sex AgeGrp AgeGrpStart AgeGrpSpan Value country
0 4 Afghanistan 2 Medium 1950 1950.5 1 Male 0-4 0 5 630.044 AFG
1 4 Afghanistan 2 Medium 1950 1950.5 1 Male 5-9 5 5 516.205 AFG
2 4 Afghanistan 2 Medium 1950 1950.5 1 Male 10-14 10 5 461.378 AFG
3 4 Afghanistan 2 Medium 1950 1950.5 1 Male 15-19 15 5 414.368 AFG
4 4 Afghanistan 2 Medium 1950 1950.5 1 Male 20-24 20 5 374.110 AFG
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset