Standardizing a Social Security number in Pandas

When working with Personally Identifiable Information, also known as PII, whether in the medical, human resources, or any other industry, you'll receive that data in various formats. There are many ways you might see a Social Security number written. In this recipe, you'll learn how to standardize the commonly seen formats.

Getting ready

Import Pandas, and create a new DataFrame to work with:

import pandas as pd
lc = pd.DataFrame({
'people' : ["cole o'brien", "lise heidenreich", "zilpha skiles", "damion wisozk"],
'age' : [24, 35, 46, 57],
'ssn': ['6439', '689 24 9939', '306-05-2792', '992245832'],
'birth_date': ['2/15/54', '05/07/1958', '19XX-10-23', '01/26/0056'],
'customer_loyalty_level' : ['not at all', 'moderate', 'moderate', 'highly loyal']})

How to do it…

def right(s, amount):
    """
    Returns a specified number of characters from a string starting on the right side
    :param s: string to extract the characters from
    :param amount: the number of characters to extract from the string
    """
    return s[-amount:]	

def standardize_ssn(ssn):
    """
    Standardizes the SSN by removing any spaces, "XXXX", and dashes
    :param ssn: ssn to standardize
    :return: formatted_ssn
    """
    try:
        ssn = ssn.replace("-","")
        ssn = "".join(ssn.split())
        if len(ssn) < 9 and ssn != 'Missing':
            ssn = "000000000" + ssn
            ssn = right(ssn, 9)
    except:
        pass

    return ssn

# Apply the function to the DataFrame
lc.ssn = lc.ssn.apply(standardize_ssn)

How it works…

The first thing we do is get our data into a DataFrame. Here we've created a new one containing a few records for people along with their social security numbers in various formats. Next we define a right function that we can use. The right function returns a specified number of characters from a string starting on the right side. Could we simply use the return s[-amount:] in our main function? Yes. However, having it as it's own function ensures that we can use it elsewhere.

With our right() function in place, we define our standardize_ssn() function, which takes an SSN as input and returns an SSN that is nine digits in length, unless the value passed in is Missing. If the SSN given is either more than 9 characters or is Missing due to filling in of missing values, we exit the function.

The perform three operations on the SSN:

  1. Replace any dashes.
  2. Replace any whitespace.
  3. Zero-pad the SSN to handle SSNs that are less than nine digits in length.
  4. Take the nine right-most digits and return them.

The following is our DataFrame before we apply the new function:

How it works…

Finally, we apply our new function to the ssn column of our DataFrame and see the following results:

How it works…
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset