Importing online population data in the CSV format

Let's begin by looking at the steps to import an online CSV file as a pandas DataFrame. In this example, we are going to use the annual population summary published by the Department of Economic and Social Affairs, United Nations, in 2015. Projected population figures towards 2100 were also included in the dataset:

import numpy as np # Python scientific computing package
import pandas as pd # Python data analysis package

# URL for Annual Population by Age and Sex - Department of Economic
# and Social Affairs, United Nations
source = "https://github.com/PacktPublishing/Matplotlib-2.x-By-Example/blob/master/WPP2015_DB04_Population_Annual.zip"

# Pandas support both local or online files 
data = pd.read_csv(source, header=0, compression='zip', encoding='latin_1') 

# Show the first five rows of the DataFrame
data.head()

The expected output of the code is shown here:

LocID	Location	VarID	Variant	Time	MidPeriod	SexID	Sex	AgeGrp	AgeGrpStart	AgeGrpSpan	Value
0	4	Afghanistan	2	Medium	1950	1950.5	1	Male	0-4	0	5	630.044
1	4	Afghanistan	2	Medium	1950	1950.5	1	Male	5-9	5	5	516.205
2	4	Afghanistan	2	Medium	1950	1950.5	1	Male	10-14	10	5	461.378
3	4	Afghanistan	2	Medium	1950	1950.5	1	Male	15-19	15	5	414.368
4	4	Afghanistan	2	Medium	1950	1950.5	1	Male	20-24	20	5	374.110

The pandas.read_csv class is extremely versatile, supporting column headers, custom delimiters, various compressed formats (for example, .gzip, .bz2, .zip, and .xz), different text encodings, and much more. Readers can consult the documentation page (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) for more information.

By calling the .head() function of the Pandas DataFrame object, we can quickly observe the first five rows of the data.

As we progress through this chapter, we are going to integrate this population dataset with other datasets in Quandl. However, Quandl uses three-letter country codes (ISO 3166 alpha-3) to denote geographical locations; therefore we need to reformat the location names accordingly.

The pycountry package is an excellent choice for conversion of country names according to ISO 3166 standards. Similarly, pycountry can be installed through PyPI:

pip install pycountry

Continuing the previous code example, we are going to add a new country column to the dataframe:

from pycountry import countries


def get_alpha_3(location):
    """Convert full country name to three letter code (ISO 3166 alpha-3)
 
    Args:
        location: Full location name
    Returns:
        three letter code or None if not found"""

    try:
        return countries.get(name=location).alpha_3
    except:
        return None
 
# Add a new country column to the dataframe
population_df['country'] = population_df['Location'].apply(lambda x: get_alpha_3(x))
population_df.head()

The expected output of the code is shown here:

-	LocID	Location	VarID	Variant	Time	MidPeriod	SexID	Sex	AgeGrp	AgeGrpStart	AgeGrpSpan	Value	country
0	4	Afghanistan	2	Medium	1950	1950.5	1	Male	0-4	0	5	630.044	AFG
1	4	Afghanistan	2	Medium	1950	1950.5	1	Male	5-9	5	5	516.205	AFG
2	4	Afghanistan	2	Medium	1950	1950.5	1	Male	10-14	10	5	461.378	AFG
3	4	Afghanistan	2	Medium	1950	1950.5	1	Male	15-19	15	5	414.368	AFG
4	4	Afghanistan	2	Medium	1950	1950.5	1	Male	20-24	20	5	374.110	AFG

Table of Contents for Importing online population data in the CSV format

Create new playlist

Sign In

Sign Up

Table of Contents for
Importing online population data in the CSV format