Let's begin by looking at the steps to import an online CSV file as a pandas DataFrame. In this example, we are going to use the annual population summary published by the Department of Economic and Social Affairs, United Nations, in 2015. Projected population figures towards 2100 were also included in the dataset:
import numpy as np # Python scientific computing package
import pandas as pd # Python data analysis package
# URL for Annual Population by Age and Sex - Department of Economic
# and Social Affairs, United Nations
source = "https://github.com/PacktPublishing/Matplotlib-2.x-By-Example/blob/master/WPP2015_DB04_Population_Annual.zip"
# Pandas support both local or online files
data = pd.read_csv(source, header=0, compression='zip', encoding='latin_1')
# Show the first five rows of the DataFrame
data.head()
The expected output of the code is shown here:
LocID | Location | VarID | Variant | Time | MidPeriod | SexID | Sex | AgeGrp | AgeGrpStart | AgeGrpSpan | Value |
|
0 | 4 | Afghanistan | 2 | Medium | 1950 | 1950.5 | 1 | Male | 0-4 | 0 | 5 | 630.044 |
1 | 4 | Afghanistan | 2 | Medium | 1950 | 1950.5 | 1 | Male | 5-9 | 5 | 5 | 516.205 |
2 | 4 | Afghanistan | 2 | Medium | 1950 | 1950.5 | 1 | Male | 10-14 | 10 | 5 | 461.378 |
3 | 4 | Afghanistan | 2 | Medium | 1950 | 1950.5 | 1 | Male | 15-19 | 15 | 5 | 414.368 |
4 | 4 | Afghanistan | 2 | Medium | 1950 | 1950.5 | 1 | Male | 20-24 | 20 | 5 | 374.110 |
The pandas.read_csv class is extremely versatile, supporting column headers, custom delimiters, various compressed formats (for example, .gzip, .bz2, .zip, and .xz), different text encodings, and much more. Readers can consult the documentation page (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) for more information.
By calling the .head() function of the Pandas DataFrame object, we can quickly observe the first five rows of the data.
As we progress through this chapter, we are going to integrate this population dataset with other datasets in Quandl. However, Quandl uses three-letter country codes (ISO 3166 alpha-3) to denote geographical locations; therefore we need to reformat the location names accordingly.
The pycountry package is an excellent choice for conversion of country names according to ISO 3166 standards. Similarly, pycountry can be installed through PyPI:
pip install pycountry
Continuing the previous code example, we are going to add a new country column to the dataframe:
from pycountry import countries
def get_alpha_3(location):
"""Convert full country name to three letter code (ISO 3166 alpha-3)
Args:
location: Full location name
Returns:
three letter code or None if not found"""
try:
return countries.get(name=location).alpha_3
except:
return None
# Add a new country column to the dataframe
population_df['country'] = population_df['Location'].apply(lambda x: get_alpha_3(x))
population_df.head()
The expected output of the code is shown here:
- | LocID | Location | VarID | Variant | Time | MidPeriod | SexID | Sex | AgeGrp | AgeGrpStart | AgeGrpSpan | Value | country |
0 | 4 | Afghanistan | 2 | Medium | 1950 | 1950.5 | 1 | Male | 0-4 | 0 | 5 | 630.044 | AFG |
1 | 4 | Afghanistan | 2 | Medium | 1950 | 1950.5 | 1 | Male | 5-9 | 5 | 5 | 516.205 | AFG |
2 | 4 | Afghanistan | 2 | Medium | 1950 | 1950.5 | 1 | Male | 10-14 | 10 | 5 | 461.378 | AFG |
3 | 4 | Afghanistan | 2 | Medium | 1950 | 1950.5 | 1 | Male | 15-19 | 15 | 5 | 414.368 | AFG |
4 | 4 | Afghanistan | 2 | Medium | 1950 | 1950.5 | 1 | Male | 20-24 | 20 | 5 | 374.110 | AFG |