When you have text categories in your data, you can dramatically speed up the processing of that data using Pandas categoricals. Categoricals encode the text as numerics, which allows us to take full advantage of Pandas' fast C code. Examples of times when you'd use categoricals are stock symbols, gender, experiment outcomes, states, and in this case, a customer loyalty level.
Import Pandas, and create a new DataFrame to work with.
import pandas as pd import numpy as np lc = pd.DataFrame({ 'people' : ["cole o'brien", "lise heidenreich", "zilpha skiles", "damion wisozk"], 'age' : [24, 35, 46, 57], 'ssn': ['6439', '689 24 9939', '306-05-2792', '992245832'], 'birth_date': ['2/15/54', '05/07/1958', '19XX-10-23', '01/26/0056'], 'customer_loyalty_level' : ['not at all', 'moderate', 'moderate', 'highly loyal']})
First, convert the customer_loyalty_level
column to a category
type column:
lc.customer_loyalty_level = lc.customer_loyalty_level.astype('category')
Next, print out the column:
lc.customer_loyalty_level
After we have created our DataFrame, we use a single line of code to convert the customer_loyalty_level
column to a categorical. When printing out the DataFrame, you see the original text. So how do you know if the conversion worked? Print out the dtypes
(data types), which shows the type of data in the column.
The following are the dtypes
in the original DataFrame:
And following are the dtypes
after we convert the customer_loyalty_level
column:
We can also print out the column to see how Pandas converted the text:
Finally, we can use the describe()
method to get more details on the column: