Standardizing dates in Pandas

Along with Social Security numbers, there are almost infinite ways to write dates. In order to compare them properly, or use them in your own systems, you need to put them into a single format. This recipe handles commonly seen date formats in the United States, and dates with missing information.

Getting ready

Import Pandas and create a new DataFrame to work with:

import pandas as pd
lc = pd.DataFrame({
'people' : ["cole o'brien", "lise heidenreich", "zilpha skiles", "damion wisozk"],
'age' : [24, 35, 46, 57],
'ssn': ['6439', '689 24 9939', '306-05-2792', '992245832'],
'birth_date': ['2/15/54', '05/07/1958', '19XX-10-23', '01/26/0056'],
'customer_loyalty_level' : ['not at all', 'moderate', 'moderate', 'highly loyal']})

How to do it…

from time import strftime
from datetime import datetime

def standardize_date(the_date):
    """
    Standardizes a date
    :param the_date: the date to standardize
    :return formatted_date
    """

    # Convert what we have to a string, just in case
    the_date = str(the_date)
    
    # Handle missing dates, however pandas should have filled this in as missing
    if not the_date or the_date.lower() == "missing" or the_date == "nan":
        formatted_date = "MISSING"

    # Handle dates that end with 'XXXX', start with 'XX', or are less than 1900
    if the_date.lower().find('x') != -1:
        formatted_date = "Incomplete"

    # Handle dates that start with something like "0056"
    if the_date[0:2] == "00":
        formatted_date = the_date.replace("00", "19")

    # 03/03/15
    try:
        formatted_date = str(datetime.strptime(the_date, '%m/%d/%y').strftime('%m/%d/%y'))
    except:
        pass
    
    # 03/03/2015
    try:
        formatted_date = str(datetime.strptime(the_date, '%m/%d/%Y').strftime('%m/%d/%y'))
    except:
        pass

    # 0000-03-03
    try:
        if int(the_date[0:4]) < 1900:
            formatted_date = "Incomplete"
        else:
            formatted_date = str(datetime.strptime(the_date, '%Y-%m-%d').strftime('%m/%d/%y'))
    except:
        pass

    return formatted_date

# Apply the function to the DataFrame
lc.birth_date = lc.birth_date.apply(standardize_date)

How it works…

With our DataFrame created with a dataset, we import the additional Python libraries that we'll need to standardize dates—time and datetime. Next we define the standardize_date() function, which takes a date as an argument.

The first thing we do is convert the date to a plain string. The reason is that we can handle partial dates and dates containing one or more Xs.

the_date = str(the_date)

Next we check to see if the date passed in is a missing value:

if not the_date or the_date.lower() == "missing" or the_date == "nan":
        formatted_date = "MISSING"

After that, we check if the date contains an 'x'. If so, we return "Incomplete":

if the_date.lower().find('x') != -1:
        formatted_date = "Incomplete"

The next check we perform is for dates that start with 00. I include this check as I've seen a number of dates get backfilled with 0s.

if the_date[0:2] == "00":
        formatted_date = the_date.replace("00", "19")

After those checks are complete, we begin to try and convert our date to our standard format. First we handle the dates in the format 01/01/15:

try:
formatted_date = str(datetime.strptime(the_date, '%m/%d/%y').strftime('%m/%d/%y'))
except:
pass

Next we handle dates with a four-digit year:

try:
        formatted_date = str(datetime.strptime(the_date, '%m/%d/%Y').strftime('%m/%d/%y'))
except:
        pass

Our last and final attempt at converting the date is to handle dates with leading zeros in the format 0000-03-03:

try:
if int(the_date[0:4]) < 1900:
formatted_date = "Incomplete"
else:
            formatted_date = str(datetime.strptime(the_date, '%Y-%m-%d').strftime('%m/%d/%y'))
except:
pass

After all those checks are complete, we return the formatted date:

return formatted_date

With our function complete, we apply it to the birth_date column of our DataFrame:

lc.birth_date = lc.birth_date.apply(standardize_date)

The following is what the DataFrame looks like before standardization:

How it works…

The following image shows what it looks like after standardization:

How it works…
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset