Along with Social Security numbers, there are almost infinite ways to write dates. In order to compare them properly, or use them in your own systems, you need to put them into a single format. This recipe handles commonly seen date formats in the United States, and dates with missing information.
Import Pandas and create a new DataFrame to work with:
import pandas as pd lc = pd.DataFrame({ 'people' : ["cole o'brien", "lise heidenreich", "zilpha skiles", "damion wisozk"], 'age' : [24, 35, 46, 57], 'ssn': ['6439', '689 24 9939', '306-05-2792', '992245832'], 'birth_date': ['2/15/54', '05/07/1958', '19XX-10-23', '01/26/0056'], 'customer_loyalty_level' : ['not at all', 'moderate', 'moderate', 'highly loyal']})
from time import strftime from datetime import datetime def standardize_date(the_date): """ Standardizes a date :param the_date: the date to standardize :return formatted_date """ # Convert what we have to a string, just in case the_date = str(the_date) # Handle missing dates, however pandas should have filled this in as missing if not the_date or the_date.lower() == "missing" or the_date == "nan": formatted_date = "MISSING" # Handle dates that end with 'XXXX', start with 'XX', or are less than 1900 if the_date.lower().find('x') != -1: formatted_date = "Incomplete" # Handle dates that start with something like "0056" if the_date[0:2] == "00": formatted_date = the_date.replace("00", "19") # 03/03/15 try: formatted_date = str(datetime.strptime(the_date, '%m/%d/%y').strftime('%m/%d/%y')) except: pass # 03/03/2015 try: formatted_date = str(datetime.strptime(the_date, '%m/%d/%Y').strftime('%m/%d/%y')) except: pass # 0000-03-03 try: if int(the_date[0:4]) < 1900: formatted_date = "Incomplete" else: formatted_date = str(datetime.strptime(the_date, '%Y-%m-%d').strftime('%m/%d/%y')) except: pass return formatted_date # Apply the function to the DataFrame lc.birth_date = lc.birth_date.apply(standardize_date)
With our DataFrame created with a dataset, we import the additional Python libraries that we'll need to standardize dates—time and datetime. Next we define the standardize_date()
function, which takes a date as an argument.
The first thing we do is convert the date to a plain string. The reason is that we can handle partial dates and dates containing one or more Xs.
the_date = str(the_date)
Next we check to see if the date passed in is a missing value:
if not the_date or the_date.lower() == "missing" or the_date == "nan": formatted_date = "MISSING"
After that, we check if the date contains an 'x'
. If so, we return "Incomplete"
:
if the_date.lower().find('x') != -1: formatted_date = "Incomplete"
The next check we perform is for dates that start with 00
. I include this check as I've seen a number of dates get backfilled with 0s.
if the_date[0:2] == "00": formatted_date = the_date.replace("00", "19")
After those checks are complete, we begin to try and convert our date to our standard format. First we handle the dates in the format 01/01/15:
try: formatted_date = str(datetime.strptime(the_date, '%m/%d/%y').strftime('%m/%d/%y')) except: pass
Next we handle dates with a four-digit year:
try: formatted_date = str(datetime.strptime(the_date, '%m/%d/%Y').strftime('%m/%d/%y')) except: pass
Our last and final attempt at converting the date is to handle dates with leading zeros in the format 0000-03-03:
try: if int(the_date[0:4]) < 1900: formatted_date = "Incomplete" else: formatted_date = str(datetime.strptime(the_date, '%Y-%m-%d').strftime('%m/%d/%y')) except: pass
After all those checks are complete, we return the formatted date:
return formatted_date
With our function complete, we apply it to the birth_date
column of our DataFrame:
lc.birth_date = lc.birth_date.apply(standardize_date)
The following is what the DataFrame looks like before standardization:
The following image shows what it looks like after standardization: