In this chapter we introduce a new library called pandas which in turn offers the DataFrame data structure. The pandas library evolved from the use of arrays in Python’s NumPy package used in mathematical and scientific computing applications. The pandas library opens a new world of possibilities for data analysis. The main structure from the pandas library is the Series and DataFrame offering a structure to organize dissimilar types of data (strings, integers, and floats) into a single data structure and has the ability to easily apply methods or functions to all or portions of the data.
Provisions two array-like data structures, Series and DataFrame
Easy detection and handling of missing data
Easy slicing and subsetting operations
Merging and joining of multiple DataFrames
Automatic index objects for both rows and columns with multi-level/hierarchical indexing
Date range capabilities for custom date frequencies used in time Series analysis
Loaders for input/output accepting a wide range of tabular data sources
Wide range of built-in analysis methods
Moving window statistics for rolling averages, rolling standard deviations, and so on
pandas and SAS Nomenclature
pandas | SAS |
---|---|
DataFrame | Dataset |
Row | Observation |
Column | Variable |
Groupby | By-Group |
NaN | . (period) |
Slice | Subset |
Importing Libraries
We begin by examining the Series data structure followed by the detailed examination of the DataFrame. The Series is a one-dimensional array container for a collection of either numerics or strings (but not both together). In a Series the numerics must be the same data type. A DataFrame is a container for a Series. DataFrames may contain strings and numerics with different data types.
Column Types
Common pandas Types
pandas Data Type | Python Data Type | Usage |
---|---|---|
Object | Str | Sequence of text |
int64 | Int | Integer value |
float64 | Float | Floating-point value |
bool | Boolean | True/False |
datetime64 | N/A | Date and time values |
timedelta | N/A | Difference between two datetimes |
category | N/A | Corresponds to statistical categorical variables, e.g, gender |
In general, we do not need to concern ourselves with types. Occasionally a program raises an error, perhaps as a result of an illegal operation using mismatched types. An example is summing numerical values from two different columns where unbeknownst to you, both columns have a type of “object”. Rather than resulting in the summing of values, the results are concatenated values since the plus (+) symbol is used for string concatenation. Clearly not the results you are excepting. Later in the book, we encounter the DataFrame astype() method for converting types.
In the following examples, we encounter the types object, int64, and float64. In Chapter 7, “Date and Time,” we discuss the datetime64 and timedelta types. The pandas type object is similar to a SAS character variable used to manipulate string values. The int64 and float64 are pandas types for integers and floating-point numbers respectfully. Both of these types are similar to a SAS numeric.
Series
From a SAS context, a Series can be thought of as a one-dimensional array with labels. A Series includes a default index used as a key to locate element values. Let’s begin by constructing a Series with ten randomly generated numeric elements. Listing 3-2 illustrates the construction of a Series by calling the numpy.random function randn(). The randn() function draws a sample from the “standard normal” distribution. In order to make the Series reproducible, the random number generator is seeded with an initial value of 54321.
In this example, s1 is the Series name containing ten numeric elements with a type of float64. Notice the assignment statement on the second line. pd.Series is being called using pd as the identifier assigned following the keyword as from the import statement in Listing 3-1.
Create Series of Random Values
SAS Array of Random Values
Unlike most other programming languages, an array in the SAS language is not a data structure. Rather, an array in SAS is used as a convenience to iterate over groups of similar variables or values assigned to the array elements. While our analogy is imperfect, it remains useful for contrasting the two languages.
In Listing 3-3 we use a _NULL_ Data Step with a DO/END loop to generate ten random values from the normal random number generator. Similar to the preceding Python example, the call STREAMINIT function is used to supply a seed value to enable a reproducible stream of random numbers.
defines a one-dimensional array called s2 containing ten elements.
iterates over the array where i is the loop index variable calling the SAS rand(" Normal") function and loads the generated values into the ten array elements.
Listing 3-4 illustrates the creation of a Series and with a user-defined index. Elements from a Series can be returned by the index value as illustrated here.
Series Index Labels
Returning Series Elements
SAS Array Indexed Retrieval
The DO/END block executes once based on the IF statement and writes the array element value to the SAS log.
Series Returning First Three Elements
The value to the left of the colon (:) separator is the start position for the Series’ index location, and values to the right identify the stop position for the element location. An empty value at the start position defaults to the first element (position 0), and an empty value for the stop value defaults to the last element (Series length –1).
Array Returning First Three Elements
Series Mathematical Operation
In the preceding example, the Boolean < operator is used to return any element in the Series whose value is less than the arithmetic mean value for all the Series elements.
DataFrames
You can think of a pandas DataFrame as a collection of Series into a relational-like structure with labels. There are a number of different constructor methods for creating DataFrames. DataFrames have a number of different readers to load values from a wide range of input sources including .csv files, DBMS tables, web scrapping, REST APIs, and SAS datasets (.sas7bdat files), to name a few. These DataFrame readers are similar to the SAS/Access product line.
The following is a basic introduction to the DataFrame reader read_csv() method . Chapter 6, “pandas Readers and Writers,” covers this topic in more detail. This example uses the publicly available UK Accidents Report Data from January 1, 2015, to December 31, 2015, available from
https://data.gov.uk/dataset/cb7ae6f0-4be6-4935-9277-47e5ce24a11f/road-safety-data
Listing 3-10 illustrates the syntax.
Note
Since the time of this writing, the data available at this web site has changed, and the remaining examples utilize a copy of this data at
https://raw.githubusercontent.com/RandyBetancourt/PythonForSASUsers/master/data/uk_accidents.csv
DataFrame read_csv Method, Example 1
Line 2 from this example defines the file_loc object to identify the path location to read the .csv file from the local filesystem. Note the double backslashes (\) in the pathname for normalizing the Windows pathname. Without the double backslashes, a syntax error is raised. Line 3 constructs the DataFrame df by calling the read_csv() method. Unlike SAS, the read operation is silent with no status information returned after the read is completed unless an error is raised.
DataFrame read_csv Method, Example 2
The remainder of the examples in this chapter are dependent on executing Listing 3-11.
The print() function displays the value from the DataFrame shape attribute to indicate the number of rows and columns in the DataFrame.
SAS PROC IMPORT
reads the .csv file downloaded by PROC HTTP .
DataFrame Validation
DataFrame Characteristics
In this example, the DataFrame df has 266,777 rows and 27 columns.
In order to understand how the data is physically organized, use the info() method to display information such as column names, column count, index structure, and so on.
Examining the DataFrame’s Organization
The first line of output identifies the DataFrame as belonging to the object class pandas.core.frame.DataFrame.
Next, information about the RangeIndex object is presented. The RangeIndex object is used as a row label. Like a SAS dataset, DataFrames can be subset by rows, by columns, or some combination of both. DataFrame row labels are identifiers for performing certain row-based subsetting operations similar to the behavior of the SAS automatic variable _N_. Chapter 5, “Data Management,” covers this topic in more detail.
Further, the DataFrame index can be altered dynamically. The DataFrame method set_index() is used to designate any column or collection of columns as the index after the DataFrame is created.
The bulk of the output from the info() method displays the column attribute information. The last column from the info() method output displays the DataFrame column types. In this example, there are 25 numeric columns with type int64 and 2 columns with type object, similar to a character variable in a SAS dataset.
DataFrame dtype Attribute
This rather sparse output indicates the Date column in the df DataFrame is type object; in other words the column values are strings. Of course, in order to do any meaningful time-based analysis with the Date column, it must have datetime64 as its type.
In Listing 3-10 and Listing 3-11 examples, the read_csv() method uses a default type mappings when constructing the DataFrame from the .csv file. In Chapter 6, “pandas Readers and Writers,” we will illustrate in detail how to take advantage of date parsers in order to load datetime values stored as strings directly into DataFrame columns with a datetime64 type.
uses the parse_dates= list data structure to designate the .csv column named Date is read using the default date parser and results in a datetime64 DataFrame column type without the need for an intermediate conversion step.
DataFrame Inspection
displays the last 24 rows of the df DataFrame.
DataFrame tail() method
Notice the first column of output from the tail(10) function displays the RangeIndex values. As an analogy, SAS uses the FIRSTOBS and OBS dataset option followed by a value for _N_ with most procedures as a means to determine which observations to include for processing.
PROC PRINT Firstobs =
Notice that the last row label for the DataFrame in Listing 3-16, DataFrame tail function is 266775, while the last observation number in the SAS output is 266776. Recall the default row index value for a DataFrame starts at 0, and the SAS automatic variable _N_ starts at 1.
DataFrame head() Function
DataFrame describe() Function
PROC MEANS
DataFrames have a variety of methods to slice and dice (subset) data. We will review these features in detail in Chapter 4, “Indexing and GroupBy.” For now we introduce the slicing operator [[ ]] to select a set of rows and/or columns from a DataFrame.
DataFrame Column Slicing
As one would expect, there are a large number of statistical and visualization techniques that can be applied to all or portions of a DataFrame to derive meaningful information and insights. One such example is the hist() method to render a histogram.
Histogram of Accident Rates by Gender
From the supplied metadata, we know the variable Sex_of_Driver value of 1 maps to males, 2 to females, 3 to not known and –1 to data missing or out of range.
From this simple histogram, we see males have an accident rate over twice that of females.
Missing Data
One of the more challenging aspects for data analysis is the treatment of missing data. Data may be missing for a wide range of reasons. It may be the data is not available or the data was not collected. Sometime values are recorded inaccurately. Whatever the reasons for encountering missing data, we need a good strategy for dealing with this challenge.
pandas use two built-in values or sentinel values for handling missing data, the Python None object and NaN (not a number) object. The None object is often used to indicate missing values in Python code. The None object is used as a missing value indicator for DataFrame columns with a type of object (character strings). In contrast NaN is a special floating-point value used to indicate missing values for the float64 type. This means NaN is specifically for floating-point values with no equivalent for strings, integers, or other pandas types.
However, this is generally not an issue. To use missing data with an integer column, the column will be upcasted to a type float. pandas does this conversion transparently for you if you assign the None object into a column of data type int.
Construct df1 DataFrame
Listing 3-23 creates the df1 DataFrame with two columns labeled Strings and Integers. The type for the Strings column is object and the type for the column Integers is int64.
The first print() function displays the df1 DataFrame. The second print() function returns the column types. Since all values for the Integers column are integer values, the type for the column is naturally int64. On row 2 the value for the Strings column is missing as indicated by the None object for its value.
Update df1 DataFrame
In Listing 3-23 the type for the Integers column is int64. As a result of the in-place update of the value 9 to the Python object None, the type for this column is now float64. Also notice how the output from the print() function in Listing 3-24 displays the updated value as a NaN, the missing value indicator for float64 even though the None object is assigned. This is a result of automatically upcasting the column type to accommodate missing values and how NaNs are used as a missing data indicator for columns with type float64.
pandas Missing Values
In this example, the Python None object is used to indicate NA’s or missing values for columns with type object (character data) and columns with the type float64 (numeric data). Even though we use the Python None object to indicate a missing value for columns Measure1–Measure4, the sentinel value NaN is used to represent these missing values. That is because columns Measure1–Measure4 have a type of float64.
Missing Value Detection
pandas Missing Replacement Functions
Function | Action Taken |
---|---|
isnull( ) | Generates a Boolean mask indicating missing values |
notnull( ) | Opposite of isnull( ) |
dropna( ) | Returns a filtered copy of the original DataFrame |
fillna( ) | Returns a copy of the original DataFrame with missing values filled or imputed |
Each of these functions is examined in detail in the following text.
In SAS there are a number of different approaches to use for missing value detection and replacement. One approach with SAS is to write a Data Step to traverse along variables and use IF/THEN logic to test for and replace missing values reading the entire dataset observation by observation.
Count Missing Values Iteratively
In this example, a Python for loop iterates over the columns in the df2 DataFrame (created in Listing 3-25) and uses the sum() method combined with the isnull() method to return a count of missing values for each column. While this example returns the correct results, there is a more Pythonic approach for achieving the same results.
As an aside, if you find yourself devising programming algorithms in Python using iterative methods, stop and take time to do research. Chances are a method or function for the object, in this case, a DataFrame, already exists.
isnull Returning Missing Count
Only three lines of Python code in Listing 3-26 are used to iterate over the df2 columns and call the isnull() method combined with the sum() method. In contrast, Listing 3-27 requires only one line of code.
PROC FORMAT to Identify Missing Values
The user-defined $missfmt. format bins character variable values into two buckets: those with a blank value (ASCII 32) to indicate a missing value labeled “Missing” and all other values labeled “Not Missing”. Similarly, the user-defined missfmt. format bins numeric variables into two buckets: those with a period (.) labeled “Missing” and all other values labeled “Not Missing”.
isnull() Method
isnull() Boolean Mask
The isnull() method returns a DataFrame of Boolean values with None objects and NaN’s returned as True while non-missing values are returned as False. Unlike SAS, empty or blank values for DataFrame columns with an object type are not considered missing.
The notnull() method is the inverse of the isnull() method. It produces a Boolean mask where null or missing values are returned as False and non-missing values are returned as True.
Row-wise operations on missing data propagate missing values.
For methods and function, missing values are treated as zero.
If all data values are missing, the results from methods and functions will be 0.
Addition Data with NaN’s
is a form of DataFrame slicing [[ ]], in this case, columns Measure3, Measure4, and Sum_M3_M4 are returned as a DataFrame, then passed to the print() method.
Also recall how variable names in Python are case-sensitive and case-insensitive in SAS.
Sum Function with NaN’s, Example 1
In this example, the sum() method is combined with the column df2[Sum_M3_M4] returning the value 70.6. Strictly speaking, df2[Sum_M3_M4] is a Series.
SAS Sum Function with Missing
as the second variable in the SELECT statement propagates a missing value. For observation 3, measure3 is missing and measure4 has a value of 6.8 resulting in a missing value for the variable Sum_M3_M4 displayed as the first piece of output in Figure 3-6.
Sum Method with NaN’s, Example 2
The parameter skipna=False forces the sum() method to return a NaN if a value of NaN is encountered in the operation.
Sum Method with All Missing Values
SAS Sum Function with All Missing Values
illustrates overriding this default behavior and having the SUM function return zero (0) when all input values to the function are missing.
Dropping Missing Values
dropna() Method
Calling the dropna() method without any parameters drops those rows where one or more values are missing regardless of the column’s type. The default behavior for dropna() is to operate along rows or axis 0 and return a new DataFrame with the original DataFrame unmodified. In this example, DataFrame df3 is derived from DataFrame df2 and excludes dropped values from DataFrame df2. Alternatively, the default inplace=False parameter can be set to True in order to perform an in-place update to the original DataFrame.
dropna() Update in Place
Unlike Listing 3-34, this example does not make an assignment to create a new DataFrame. Instead, the original DataFrame df4 is updated in place with the inplace=True parameter.
CMISS Delete Observation with Any Missing Values
The dropna() method also works along a column axis. DataFrames refer to rows as axis 0 and columns as axis 1. The default behavior for the dropna() method is to operate along axis 0 or rows.
dropna() Along Axis 1
The parameter axis = 'columns' is an alternative to axis = 1.
Create an ODS output table using PROC FREQ NLEVELS option to identify those variables in the SAS dataset containing one or more missing values. Figure 3-8 displays the results of this step.
Execute the PROC SQL SELECT INTO : to create the Macro variable &drop containing the variable names with missing values separated by a blank (ASCII 32). The FROM clause identifies the output dataset created by the ODS Table Output created with PROC FREQ.
A SAS Data Step creates the output dataset df2, in this case, by reading the input dataset df using the SAS dataset option DROP followed by the Macro variable &drop containing the DROP list created in the PROC SQL step. Figure 3-9 displays the resulting SAS dataset.
Dropping SAS Variables with Missing Values
opens a SAS dataset called nlvs. PROC FREQ with the NLEVELS option executes to produce the desired output. By default, PROC FREQ produces a number of different pieces of tabular output. The statement ODS SELECT NLEVELS selects the NLEVELS object from PROC FREQ and outputs its values as a Series of observations and variables into the SAS dataset nlvs.
selects those variable names containing one or more missing values. The variable tablevar holds the names of the variables in the SAS dataset df.
drop_duplicates()
This example creates the DataFrame df6 by applying the drop_duplicates() method to the df2 DataFrame. In DataFrame df2 rows 5 and 6 have identical values.
Drop Duplicate Records with SAS
The NODUPRECS option identifies observations with identical values for all variables and removes from the output dataset.
dropna(thresh=4)
The example creates the DataFrame df7 by setting the thresh= parameter to 4. The thresh=4 parameter iterates through all rows and keeps each row that has at least four non-missing values. Row 3 is dropped since it contains only three non-missing values.
Imputation
Replacing NaN’s with Zeros
In this example, the fillna() method is applied to the entire DataFrame. The None objects in columns Temp and Speed are replaced with the string zero (‘0’). The NaN’s in columns Measure1–Measure4 are replaced with the numeric value zero (0).
fillna() Method Over Column Subset
The fillna(0) method with a parameter value of zero (0) replaces NaN’s in columns Measure1–Measure4 with a numeric zero (0).
Listing 3-46 passes a Series of different replacement values into a single DataFrame to handle missing values. The fillna() method accepts a Dictionary of values as a parameter.
fillna() Method Using a Dictionary
In the Temp column, the string 'cold' replaces the None object, and in the Speed column, the string 'slow' replaces the None object. Columns Measure1–Measure4 map the NaN’s to zero (0).
fillna() with Arithmetic Mean
In this example, NaN’s in columns Measure1–Measure3 are replaced with the arithmetic mean value derived from the column Measure4 which is 5.0.
Replace Missing Values with Arithmetic Mean
If any observation value in the variables Measure1–Measure3 equals missing, the observation value of the variable is assigned the value from the Macro variable &M4_mean which is 5.0.
Summary
In this chapter we introduced the pandas library and the roles for DataFrame which is essential to using Python for data analysis tasks. We covered critical topics such as column types and how pandas handle missing values along with discussing methods for detecting and replacing missing values. With this foundation set, we can begin the exploration of new methods to expand the repertoire of data exploration and analysis.