In this chapter we discuss methods for reading data from a range of input data sources such as comma-separated values (.csv) files, database tables, JSON, and other sources of input to create DataFrames. The pandas readers are a collection of input/output methods for writing and loading values into DataFrames. These input/output methods are analogous to the family of SAS/Access Software used by SAS to read data values into SAS datasets and write SAS datasets into target output formats.
The pandas main facility for input and output to DataFrames are the readers like read_csv(), read_json() and read_sql_table(), among others, which we cover in detail. These readers use similar syntax. The first argument is the path or location to the target which can be fully qualified filenames, relative filenames (relative to the current working directory executing Python), database tables, URLs (including HTTPS, SFTP, and S3 access methods), and so on. In many cases, the defaults for the readers’ arguments are enough for read/write operations.
The reader methods have arguments to specify:
Input and output locations
Column and index location and names
Parsing rules for handling incoming data
Missing data rules
Datetime handling
Quoting rules
Compression and file formats
Error handling
We start by examining a Series of use cases for reading .csv files beginning with simple examples followed by additional examples combining various arguments to enable a more complex set of parsing rules.
In this example, the call to the read_csv() reader uses two arguments. The URL object holds the string value for the input location to be read followed by the skiprows= argument. As you can see from Figure 6-1, the column headers are on row 3.
Also observe how the ID values for column “A” are left aligned which is the default for character strings in Excel. Column “F” is missing its header label and there are several missing values.
Listing 6-1 displays the column types. These types are inferred from the input data read from the .csv file. The read_csv() reader in this example maps the ID column to an int64 type despite the fact that these values are stored as strings in the .csv file. In subsequent examples we will see how to explicitly control type mappings when reading values into DataFrames or unloading values from DataFrames.
Listing 6-1 displays the type for the Date column as object indicating these values are stored in the df1 DataFrame as strings.
The read_csv() reader provisions the na_values= argument to define how missing values are handled on input. The na_values= argument allows the definition of any value to represent missing for both character and numeric column types. This feature is illustrated in Listing 6-2.
creates the miss object as a Dictionary key:value pair where the key value Amount is the DataFrame column name followed by values blank ('') and NA used to designate both as missing values. The parameter na_values= uses this miss Dictionary to map values to missing when they are encountered in the Amount column.
In contrast to Listing 6-1, the value for the Amount column on row 3 in the df1 DataFrame is blank
>>> df1[3:4]
ID Date Amount Quantity Status Unnamed: 5
3 3013 12-Feb-17 15.0 Pending Harrison
whereas in Listing 6-2, the value for the Amount column on row 3 in the df2 Dataframe is NaN. These values are displayed with the row-slicing operations
>>> df2[3:4]
ID Date Amount Quantity Status Unnamed: 5
3 3013 12-Feb-17 NaN 15.0 Pending Harrison
The read_csv() reader has two parameters to explicitly set column types: the dtype= and the converters= arguments. Both arguments accept Dictionary key:value pairs with the keys identifying the target column(s) and values that are functions converting the read values into their corresponding column types. The dtype= argument allows you to specify how to treat incoming values, for example, either as strings or numeric types. The converters= argument allows you to call a conversion function which in turn maps the data onto the desired column type based on the function’s logic, for example, parsing a string value to be read as a datetime. The converters= argument takes precedence over the dtype= argument in the event both are used together.
Listing 6-3 illustrates using the dtype= argument to map the ID column type to object, in other words a sequence of characters.
In this example, the ID column in the df3 DataFrame has a type of object compared to the ID column type in the df2 DataFrame which is int64.
The converters= argument permits a function to convert the incoming data with the function results determining the column’s type. In this example, regardless of whether the incoming values for the ID column are numeric or character types, they are stored in the target DataFrame as strings.
To further illustrate this point, consider Listing 6-4. Here we define the strip_sign() function to remove the dollar sign ($) from the incoming values for the Amount column. The converters= argument contains the Dictionary key:value pair with the key identifying the Amount column and the corresponding value naming the converter function, in this case, strip_sign.
The strip_sign() function returns a float or a NaN when the incoming value is missing. Because the strip_sign() function returns numeric values, the resulting type for the Amount column in the df4 DataFrame is float64.
Listing 6-5 illustrates mapping the ID column as a string and setting this column as the DataFrame index at the end of the read/parse operation. Chapter 4, “Indexing and GroupBy,” contains details for working effectively with DataFrame indexes.
With the ID column as the index, rows can be sliced with their labeled values:
df5.loc['0042']
Date 16-Oct-17
Amount $23.99
Quantity 123
Status Closed
Unnamed: 5 Jansen
Name: 0042, dtype: object
Date Handling in .csv Files
A common requirement for reading data input is preserving date and datetime values rather than treating the values as strings. This can be done via the converters= argument, but there are special built-in converters. The read_csv() reader has specialized parameters for datetime handling. In most cases, the default datetime parser simply needs to know which column or list of columns compose the input date or datetime values. In those cases where date or datetime values are non-standard, the parse_dates= argument accepts a defined function to handle custom date and datetime formatting instructions.
Listing 6-6 illustrates the parse_dates= argument for the read_csv() method.
The parse_dates= parameter accepts a Python list of column names, in this case, the Date column. In Listing 6-5, the Date column in the df5 DataFrame has the default object type as shown:
>>> print(df5['Date'].dtype)
object
In contrast, with this example, the Date column is now mapped as a datetime64 column type as a result of using the parse_dates= argument.
>>> print(df6['Date'].dtype)
datetime64[ns]
Chapter 7, “Date and Time,” covers the details on date and datetime types and how they are manipulated.
Listing 6-7 illustrates the ability to use custom labels for column headings.
The names= argument accepts the cols object to label the DataFrame columns. You must set the header= argument to None to inform the reader not to use the column labels found in the .csv file. Next, the usecols= argument specifies the input columns to be read. Remember, Python uses a 0-based index for start positions, so the default column A in the .csv file maps to column 0 in the DataFrame, column B in the .csv file maps to column 1 in the DataFrame, and so on.
Finally, notice that since we are not reading the column labels found in the .csv file, the skiprows= argument is set to 3 for this example which is where we find the first line of data values (line 4 in the .csv file). Since Python uses a 0-based index as the start position, row 1 in the .csv file maps to row 0 in the DataFrame, row 2 in the .csv file maps to row 1 in the DataFrame, and so on.
Read .xls Files
As you might expect, pandas provides the read_excel() reader for reading Microsoft .xls files. The parameters and arguments for read_excel() reader are similar to the read_csv() reader.
and copy it into a location where you have read/write access on the local filesystem. In this example, we chose “C:datamessy_input.xlsx”.
In this example, we use the parse_dates= parameter to define the Date column with the syntax
parse_dates={'Date' :['Month', 'Day', 'Year']}
parse_dates= argument accepts a Dictionary key:value pair where the key is the name of the output datetime DataFrame column being defined, in this case, Date. The Dictionary values name the input columns composing the constituent input columns for the Date column values. In this case, the Year, Month, and Day columns are parsed together to define the Date column.
We also illustrate keeping the Year, Month, and Day columns with the syntax
keep_date_col=True
The keep_date_col= argument determines the keep disposition for the columns used with the parse_dates= argument. In this case, Year, Month, and Day columns are kept on the df8 DataFrame.
It is not uncommon to read multiple Excel files together to form a single DataFrame. Listing 6-9 illustrates this use case.
Consider sales transactions for the month are stored in separate Excel files each with identical layouts. In this example, we introduce the glob module from the Python Standard Library. The glob module finds all the filenames matching a specified pattern based on the rules used by the Unix shell. This is useful when you need to search for filenames using string expressions and wildcard characters.
The three .xlsx files for this example are located at
... final = final.append(df, ignore_index=False, sort=False)
...
>>> print(final)
Amount Quantity
ID
1042 99.89 21
3311 59.99 12
9846 12.99 25
2222 19.19 115
8931 79.99 2
0044 199.89 10
8731 49.99 2
7846 129.00 45
1111 89.19 15
2231 99.99 1
0002 79.89 43
2811 19.99 19
8468 112.99 25
3333 129.99 11
9318 69.99 12
Listing 6-9
Appending Multiple .xls Files into a DataFrame
The input object is defined by calling the glob module to search the filenames using the pattern
'C:\data*_2*.xlsx'
This call returns a Python list of filenames matching our pattern subsequently displayed by calling the first print() method. The files to append together are
In order to read these .xls files in a single operation, an empty DataFrame called final is defined with the syntax
final = pd.DataFrame()
Next a for loop calling the glob module is used to return the Python list of our input three files followed by the call to the read_excel() reader to read the current .xls file in the Python list into the df dataframe. Finally the panda .append attribute for the final DataFrame is called appending each df dataframe to the end of the final DataFrame. Also notice inside the for loop, the set_index() method is used to map the ID column as the DataFrame index.
An alternative to retrieving a Python list of fully qualified pathnames for this type of iterative processing is retrieving input filenames relative to the current working directory executing your Python session. This involves importing the os module from the Python Standard Library to first determine the location for the Python’s current working directory followed by a call to change the current working directory to a new location. This operation is illustrated in Listing 6-10.
creates the wd object holding the value for the location of the current working directory executing the Python session, in this case
C:Users
andyDesktopPy_Source
The syntax
os.chdir('C:\data')
calls the os.chdir attribute and changes the working directory for the current Python session to
C:data
The call to the glob module uses *.csv as an argument to define the csv_files object holding a Python list of all the filenames in the current directory have the filename extension .csv.
5 rows created in WORK.TEMP1 from c:dataFebruary_2018.csv.
5 rows created in WORK.TEMP2 from c:dataJanuary_2018.csv.
5 rows created in WORK.TEMP3 from c:dataMarch_2018.csv.
94 data final;
95 set &imported;
96 format amount dollar8.2;
97 run;
NOTE: There were 5 observations read from the dataset WORK.TEMP1.
NOTE: There were 5 observations read from the dataset WORK.TEMP2.
NOTE: There were 5 observations read from the dataset WORK.TEMP3.
NOTE: The dataset WORK.FINAL has 15 observations and 3 variables.
Listing 6-11
Appending Multiple .csv Files to a Dataset
The _null_ Data Step reads the list of filenames of the FILENAME PIPE statement and assigns these fully qualified filename values to the file_name variable. The CALL SYMPUT statement creates the Macro variable &imported. On each iteration of the Data Step, it concatenates the values “temp1”, “temp2”, and so on. This Macro variable value becomes the list of input datasets used on line 95 to create the output dataset final.
The syntax
call execute('proc import datafile="'
|| strip(file_name)
|| '" out=temp'
|| strip(put(_n_,best.))
|| ' dbms=csv replace; run;'
);
builds the PROC IMPORT statement and corresponding options as a parameter to the CALL EXECUTE statement. The CALL EXECUTE statement executes the call to PROC IMPORT three times, one for each record read from the FILENAME PIPE statement.
Figure 6-3 uses PROC PRINT to display the contents of the final dataset.
Write .csv Files
There are cases when you need to export the contents of a DataFrame to a .csv file or an .xls file. In those cases, use the to_csv() writer and the to_excel() writer. The to_csv() writer is illustrated in Listing 6-12.
The writer does not return any information to the console to indicate the operation’s success. Figure 6-4 displays the first five rows of the output file, in this case, C:datafinal.csv.
Use PROC EXPORT to write the contents of a SAS dataset to a .csv file. This capability is illustrated in Listing 6-13.
4 filename out_csv "c:datafinal_ds.csv";
5 proc export data = final
6 outfile = out_csv
7 dbms = csv;
8 run;
NOTE: The file OUT_CSV is:
Filename=c:datafinal_ds.csv,
RECFM=V,LRECL=32767,File Size (bytes)=0,
Last Modified=12Nov2018:10:13:47,
Create Time=12Nov2018:10:13:47
NOTE: 16 records were written to the file OUT_CSV.
The minimum record length was 11.
The maximum record length was 18.
NOTE: There were 15 observations read from the dataset WORK.FINAL.
15 records created in OUT_CSV from FINAL.
NOTE: "OUT_CSV" file was successfully created.
Listing 6-13
Output SAS Dataset to .csv
The SAS log indicates the location and physical file characteristics of the output .csv along with the number of records successfully written.
Write .xls Files
You can write a DataFrame to an Excel sheet multiple DataFrames to a book of sheets, using the to_excel() writer. The arguments are largely the same as the to_csv() writer. This capability is illustrated in Listing 6-14.
The results from the to_excel() operation are displayed in Figure 6-5.
There are multiple ways to output a SAS dataset to .xls files. PROC EXPORT provides a convenient approach; however, it limits the format control you can exert on the appearances of the output .xls file. Alternatively, if you want much finer control over the output format in Excel, then the ODS tagsets.ExcelXP can be used. Listing 6-15 illustrates calling tagsets.ExcelXP to write the contents of a SAS dataset to an Excel file.
4 ods tagsets.ExcelXP
5 file="c:datafinal_ds.xls"
6 style=statistical
7 options(frozen_headers='1'
8 embedded_titles='yes'
9 default_column_width='18');
NOTE: Writing TAGSETS.EXCELXP Body file: c:datafinal_ds.xls
NOTE: This is the Excel XP tagset (Compatible with SAS 9.1.3 and above, v1.131, 04/23/2015). Add
options(doc='help') to the ods statement for more information.
10
11 proc print data=final;
12 run;
NOTE: There were 15 observations read from the dataset WORK.FINAL.
NOTE: PROCEDURE PRINT used (Total process time):
real time 0.08 seconds
cpu time 0.01 seconds
13 ods tagsets.excelxp close;
14 run;
Listing 6-15
Output Final Dataset to .xls
Note that if you are using SAS Display Manager to generate this example, you may need to disable the “View results as they are generated feature”. This is found by going to the SAS Tools menu in Display Manager option and selecting Options ➤ Preferences and then selecting the Results tab. Uncheck the box labeled “View results as they are generated” as shown in Figure 6-6.
Read JSON
JSON stands for JavaScript Object Notation and is a well-defined structure for exchanging data among different applications. JSON is designed to be read by humans and easily parsed by programs. It is relied upon to transmit data through RESTful web services and APIs.
The flexibility for calling a REST API to acquire data is illustrated in Listing 6-16. This example creates the jobs DataFrame by calling GitHub Jobs API over HTTPS using read_json() reader to return a set of posted positions along with a detailed job description.
4 Norwegian Block Exchange AS Oslo, Norway or Remote
Listing 6-16
panda Read JSON API
In this example, the read_json() reader makes a call over HTTPS to the GitHub Jobs API returning records where the position field contains the string ‘Python’. At the time this script was executed, there were 41 job positions posted. This is shown by the call to the shape attribute showing 41 rows and 11 columns in the jobs DataFrame.
Similarly, SAS provides several different methods for reading JSON files. By far, the simplest method is to use the JSON LIBNAME access method, illustrated in Listing 6-17.
Along with a GET to return the records to the SAS session. The OUT=response statement stores incoming records in a temporary file associated with the response FILEREF.
3.
Copy the incoming JSON.ROOT and JSON.ALLDATA datasets to the WORK directory. Obviously, you can copy these datasets into a permanent SAS data library. The copy operation is needed since the JSON access method reads the incoming records only once.
4.
Continue the processing logic to manipulate the ROOT dataset created by the JSON LIBNAME engine. In our case we simply call PROC PRINT to display the first five observations from the dataset.
In order to make this particular example work, the SAS session must be executed with an encoding of UTF-8. By default, SAS sessions executing under Windows use WLATIN1 encoding which can lead to transcoding errors when calling the JSON LIBNAME engine to read UTF-8 formatted records. Use the following PROC OPTIONS statement to determine the encoding method used by the SAS session:
proc options option=encoding; run;
SAS (r) Proprietary Software Release 9.4 TS1M5
ENCODING=UTF-8 Specifies the default character-set encoding for the SAS session.
Note that the encoding option can only be changed at SAS initialization time. This is generally controlled by the sasv9.cfg configuration file. The default location on Windows is
The results of writing the df8 DataFrame to a JSON-formatted file is displayed in Figure 6-8.
In order to write a SAS dataset to a Series of JSON records, use PROC JSON as illustrated in Listing 6-19. PROC JSON was introduced with Base SAS release 9.4.
4 proc json out='c:datasas_final.json' pretty
5 nosastags;
6 export final;
7 run;
Listing 6-19
Output a SAS Dataset to JSON
The first two JSON records are displayed in Figure 6-9.
Read RDBMS Tables
Common input sources of data for analysis are relational database tables. The pandas.io.sql module provides a set of query wrappers to enable data retrieval while minimizing dependencies on RDBMS-specific APIs. Another way of saying this is, the clever folks who brought you pandas also figured out they can avoid re-inventing the wheel so they utilize a library called SQLAlchemy as an abstraction layer to the various databases needing to be read. This approach reduces the amount of database-dependent code pandas needed internally to read and write data using ODBC-compliant engines, which is to say, nearly all databases.
By using the SQLAlchemy library to read RDBMS tables (and queries), you pass SQLAlchemy Expression language constructs which are database-agnostic to the target database. This is analogous to PROC SQL’s behavior of using general SAS SQL constructs which in turn are translated on your behalf for a specific database without you having to know the RDBMS SQL dialect.
In order to execute these Read RDBMS Table examples, you need three components to be set up. They are
1.
The SQLAlchemy library
2.
The appropriate RDBMS driver library, in our case, the pyodbc library
3.
The appropriately configured Windows ODBC client connecting to a target database
The following examples use Microsoft SQL Server 2017 for Windows as the target RDBMS for reading and writing. With a little bit of effort, you can convert these examples to execute on a different RDBMS running either locally or remotely.
The SQLAlchemy library is a part of the Anaconda distribution described in Chapter 1, “Why Python?”
To confirm a version of the SQLAlchemy library is available, start a Python session and enter
>>> import sqlalchemy
>>> sqlalchemy.__version__
'1.2.7'
Notice the two consecutive underscores before and after the string version.
The second part of the setup is the client-side library, which is the pyodbc library in our case. To confirm a version of the pyodbc library is available, start a Python session and enter
>>> import pyodbc
>>> pyodbc.version
'4.0.23'
Our setup has both Python 3.6 and SQL Server 2017 executing on the same Windows 10 machine. If the SQL Server instance is running remotely, you will need to make the appropriate adjustments.
The third and final step is to configure an ODBC DSN to connect to the target RDBMS. In order to set up the appropriate ODBC client interface on Windows 10, launch the ODBC Data Source Administrator by going to the Windows Start Menu to click
Start ➤ Windows Administrative Tools ➤ ODBC Data Sources (64-bit) to launch the ODBC Data Source Administrator shown in Figure 6-10.
If you are using Windows 7, navigate to Control Panel ➤ System and Security ➤ Administrative Tools and select Data Sources (ODBC). These examples illustrate the ODBC setup for Windows 10.
Make sure that the System DSN tab is selected and then click the Add… button to select the driver for the SQL Server Database. See Figure 6-11.
Press the Finish button which displays the Create a New Data Source to SQL Server dialog box presented in Figure 6-12.
For Name: we choose AdventureWorksDW to indicate we are using the AdventureWorksDW2017 database SQL Server ships as an illustration for their Analytical Services. For Server: supply the instance name of the SQL Server Database you want to connect to. In our case, the SQL Server instance name is a local instance named RANDY-PC. Notice we use AdventureWorksDW as the name for ODBC source name which in turn connects to the SQL Server AdventureWorksDW2017 database.
For more information about the AdventureWorks2017 sample databases, see
Press the Next> dialog box and you are presented with the authentication dialog box shown in Figure 6-13.
In our case, we select Windows authentication using the network login ID. Press the Next> dialog box. This action presents the Change Default Database Dialog menu shown in Figure 6-14.
If your dialog box indicates “master” as the default database, then check the box labeled “Change the default database to:” and select AdventureWorksDW2017, assuming it is available and your account has been granted read access.
Check the Change the default database tick box and supply the database name AdventureWorksDW2017. This is illustrated in Figure 6-15.
Press Finish> to be presented with the “Change the language of SQL Server messages to:” tick box. In our case, we chose to log the ODBC driver statistics. This is illustrated in Figure 6-16.
Press Finish> to be presented with the ODBC Data Source Configuration panel shown in Figure 6-17.
Press Test Data Source... box to test the ODBC Connection. You should see the Test Results dialog box shown in Figure 6-18.
In order to connect to the SQL Server AdventureWorksDW2017 database with SQLAlchemy, we use the create_engine function to create an engine object from the database URI. Database-specific information for the create_engine function and URI formatting is at
You only need to create the engine once per database instance you are connecting to. The create_engine function containing the SQL Server Database URI is illustrated in Listing 6-20.
>>> import pyodbc
>>> from sqlalchemy import create_engine, MetaData, Table, select
where USER:PW are the userid and password pair and SERVER is the remote hostname or IP address running the SQL Server instances. Also notice the ODBC driver name could be different, depending on which driver you use. You may need to contact database administrator for additional information.
Once the engine object is correctly defined, use the read_sql_table() method to read all or subsets of database artifacts such as tables and views. Consider Listing 6-21, needing just two arguments: the target database table, in this case, DimCustomer table, and the engine object defining the connection string to the target database instance.
The info attribute for the t0 DataFrame displays the names of all the columns read from the database table, the number of rows with non-null values, and the columns’ corresponding types. For example, the BirthDate column is mapped to a datetime64 type. This indicates the BirthDate column can be used in datetime arithmetic expressions without needing cast it to datetime64 using the to_datetime() function.
If needed, the read_sql_table() reader accepts the parse_dates= argument to coerce date and datetime columns into a datetime64 type with
parse_dates={'BirthDate': {'format': '%Y-%m-%d'}}
using a nested Dictionary key:value pairs where the key is the name of the column followed by another Dictionary where the key is 'format' and the value is the date parsing instructions, known as format directives.
The to_datetime() function along with format directives is discussed in detail in Chapter 7, “Data and Time.”
To return a subset of columns from a database table, use the columns= argument illustrated in Listing 6-22.
The call to the read_sql_table() method contains three arguments. The first argument is the target table, DimCustomer; the second argument is the engine object containing the connection information needed to access the SQL Server Database instance; and the third argument is columns= which forms the SELECT list that is ultimately executed as a T-SQL query on the SQL Server instance.
Use the index_col= argument to map an input column as the DataFrame index to create row labels. This feature is shown in Listing 6-23. This example extends Listing 6-22 by adding a fourth parameter, index_col=, for the call to read_sql_table reader().
Observe how the CustomerKey column read from the database is now the row label for the t2 DataFrame.
Query RDBMS Tables
As part of an analysis effort, we often need to construct SQL queries against the target dataset to return rows and columns to construct a DataFrame. Use the read_sql_query() reader to send an SQL query to the database and form a DataFrame from the returned result set. This feature is illustrated in Listing 6-24.
In this example, the first argument to the read_sql_query() is a valid T-SQL query for SQL Server. This is followed by the engine object which holds the connection information to the database created previously. This call creates the q1 DataFrame. Notice how each line of the SQL query uses single quotes and a space before the close quote.
Listing 6-25 illustrates the SAS analog in Listing 6-22. This example has the same SELECT list as Listing 6-22. Use PROC PWENCODE to encode your password string to ensure your password string is not stored in clear text. Alternatively, you can assign your encoded password string to a SAS Macro variable with
proc pwencode in="YOUR_PASSWORD_HERE" out=pwtemp;
run;
This defines the Macro variable &pwtemp holding the encoded password string.
4 proc pwencode in=XXXXXXXXXXX;
5 run;
{SAS002}XXXXXXXXXXXXXXXXXXXXXXXXXXXXA2522
6 libname sqlsrvr odbc
7 uid=randy
8 pwd={SAS002}XXXXXXXXXXXXXXXXXXXXXXXXXXXXA2522
9 datasrc=AdventureWorksDW
10 bulkload=yes;
NOTE: Libref SQLSRVR was successfully assigned as follows:
Engine: ODBC
Physical Name: AdventureWorksDW
11 title1 "Default Informats and Formats";
12 proc sql;
13 create table customers as
14 select FirstName
15 ,LastName
16 ,BirthDate
17 ,Gender
18 ,YearlyIncome
19 ,CustomerKey
20 from sqlsrvr.DimCustomer;
NOTE: Table WORK.CUSTOMERS created, with 18484 rows and 6 columns.
21 select name
22 ,informat
23 ,format
24 from DICTIONARY.COLUMNS
25 where libname = 'WORK' &
26 memname = 'CUSTOMERS';
27 quit;
Listing 6-25
SAS ODBC Access to SQL Server
The second set of PROC SQL statements, lines 21–26, query the SAS-maintained table, DICTIONARY.COLUMNS, to return format and informat assignments to the WORK.CUSTOMERS columns. SAS formats and informats are analogous to column types, in that the informats direct how values are read on input and formats direct how values are written on output.
In this case, the incoming SQL Server table values for the BirthDate column are read using the SAS $10. informat treating the values as a 10-byte long character string. The output from PROC SQL is displayed in Figure 6-19.
In order to utilize the incomingBirthDate variable with SAS datetime expressions, a subsequent Data Step is needed to copy the BirthDate variable values into a different variable formatted for datetime handling. Listing 6-26 illustrates copying the customers dataset and using the INPUT function to load the original character values for the BirthDate variable to a numeric variable and assign it a permanent format of YYMMDD10.
Unlike Python, existing SAS variables cannot be recast. To output the customers dataset with the BirthDate variable with a datetime format requires copying the dataset, renaming the incoming BirthDate variable to, in this case, dob, and assigning dob’s values to the output BirthDate variable with the INPUT function.
4 data customers(drop = dob);
5 set customers(rename=(BirthDate = dob));
6 length BirthDate 8;
7
8 BirthDate = input(dob,yymmdd10.);
9 format BirthDate yymmdd10.;
10 run;
NOTE: There were 18484 observations read from the dataset WORK.CUSTOMERS.
NOTE: The dataset WORK.CUSTOMERS has 18484 observations and 6 variables.
11 title1 "Modified Informats and Formats";
12 proc sql;
13 select name
14 ,informat
15 ,format
16 from DICTIONARY.COLUMNS
17 where libname = 'WORK' &
18 memname = 'CUSTOMERS';
19 quit;
Listing 6-26
Mapping BirthDate Variable to Datetime Variable
Figure 6-20 uses PROC SQL to query the SAS DICTIONARY tables in order to display the informats and formats assigned to the modified customers dataset.
Not all SQL queries return a result set. Listing 6-27 illustrates using the sql.execute() function. This is useful for queries that do not return a result set such as CREATE TABLE, DROP, INSERT statements, and so on. The SQL statements are specific to the target RDBMS.
<sqlalchemy.engine.result.ResultProxy object at 0x0000020172E0DE80>
>>> sql.execute('DROP TABLE CustomerPy', engine)
<sqlalchemy.engine.result.ResultProxy object at 0x0000020172E0DDA0>
>>> sql.execute("CREATE TABLE CustomerPy (ID int,
... Name nvarchar(255),
... StartDate date);", engine)
<sqlalchemy.engine.result.ResultProxy object at 0x0000020172E0D898>
Listing 6-27
sql.execute() Statement
In this case, the T-SQL syntax for SQL Server DROPs the CustomerPy table with
sql.execute('DROP TABLE CustomerPy', engine)
followed by a CREATE TABLE statement to define the CustomerPy table
sql.execute("CREATE TABLE CustomerPy (ID int,
Name nvarchar(255),
StartDate date);", engine)
Figure 6-21 displays the created SQL Server table, CustomerPy physical attributes created with the sql.execute() statement.
Listing 6-28 illustrates the use of the SQL pass-thru facility as wrapper to pass T-SQL statements directly to SQL Server through the ODBC API. This example is the analog to Listing 6-27 for passing SQL to the database which does not return a result set.
4 proc sql;
5 connect to odbc as sqlsrvr
6 (dsn=AdventureWorksDW
7 uid=randy
8 password=
8! {SAS002}XXXXXXXXXXXXXXXXXXXXXXXXXXXXA2522);
9
10 exec
11 (CREATE TABLE CustomerPy (ID int,
12 Name nvarchar(255),
13 StartDate date)) by sqlsrvr;
14
15 %put Note: Return Code for SQL Server is: &sqlxrc;
Note: Return Code for SQL Server is: 0
16 %put Note: Return Message for SQL Server is: &sqlxmsg;
Note: Return Message for SQL Server is:
17
18 select * from connection to sqlsrvr
19 (ODBC::SQLTables (,,"CustomerPy",));
20 disconnect from sqlsrvr;
21
22 %put Note: Return Code for SQL Server is: &sqlxrc;
Note: Return Code for SQL Server is: 0
23 %put Note: Return Message for SQL Server is: &sqlxmsg;
Note: Return Message for SQL Server is:
24 quit;
Listing 6-28
SAS Pass-Thru to SQL Server
With SAS SQL Pass-Thru, any statement inside a parenthesized expression is passed directly to the database library API, in this case, ODBC.
(CREATE TABLE CustomerPy (ID int,
Name nvarchar(255),
StartDate date)
SAS/Access to ODBC supports calls to the ODBC::SQLAPI. This interface acts as an alternative method for querying the RDBMS catalog tables. In this example
select * from connection to sqlsrvr
(ODBC::SQLTables (,,"CustomerPy",));
returns information about the created CustomerPy table. Figure 6-22 displays the query results.
Read SAS Datasets
Organizations often have large numbers of permanent SAS datasets stored on the filesystem. pandas provide the read_sas() reader for creating DataFrames by reading permanent SAS datasets. Permanent SAS datasets are often referred to as .sas7bdat files (after the extension SAS uses to name dataset files on Windows and Unix filesystems). Listing 6-29 uses a Data Step to create the permanent out_data.to_df SAS dataset, where out_data maps to the filesystem location C:data and to_df is the name of dataset, which is recognized by Windows as the file: C:data o_df.sas7bdat.
4 libname out_data 'c:data';
NOTE: Libref OUT_DATA was successfully assigned as follows:
Engine: V9
Physical Name: c:data
5 data out_data.to_df;
6 length even_odd $ 4;
7 call streaminit(987650);
8 do datetime = '01Dec2018 00:00'dt to '02Dec2018 00:00'dt by 60;
9 amount = rand("Uniform", 50, 1200);
10 quantity = int(rand("Uniform", 1000, 5000));
12 if int(mod(quantity,2)) = 0 then even_odd = 'Even';
12 else even_odd = 'Odd';
13 output;
14 end;
15 format datetime datetime16.
16 amount dollar12.2
17 quantity comma10.;
18 run;
NOTE: The dataset OUT_DATA.TO_DF has 1441 observations and 4 variables.
19 proc contents data=out_data.to_df;
20 run;
21 proc print data=out_data.to_df(obs=5);
22 run;
NOTE: There were 5 observations read from the dataset OUT_DATA.TO_DF.
Listing 6-29
Create SAS Dataset for read_sas()
The SAS datetime variable is formatted as datetime constants; the amount variable is formatted to precede the value with a dollar sign ($) and the quantity variable to embed a comma (,) as part of its value. Figure 6-23 uses PROC CONTENTS to display the attribute information for the dataset. Notice that since we are using Windows, SAS dataset written to the filesystem uses wlatin1 encoding.
Figure 6-24 uses PROC PRINT to display the first five observations of the dataset.
With the OUT_DATA.TO_DF SAS dataset written to the filesystem, it can be used as input into the call to read_sas() reader to create a DataFrame. Listing 6-30 illustrates creating the from_sas DataFrame.
The call to the read_sas() reader has as its first argument the full Windows’s pathname to the OUT_DATA.TO_DF dataset (as it is known to SAS). The value for the second argument, format=, is SAS7BDAT. If the SAS dataset is in transport format, then this value is set to XPORT. The third argument, encoding=, is set to latin-1 to match the encoding for the to_df.sas7bdat dataset (as it is known to Windows).
The read_sas() reader issues a file lock for the target input in order to read the SAS dataset. This can cause file contention issues if you attempt to open the SAS dataset for output after executing the read_sas() reader. The SAS log will issue the following error:
ERROR: A lock is not available for OUT_DATA.TO_DF.DATA.
NOTE: The SAS System stopped processing this step because of errors.
WARNING: The dataset OUT_DATA.TO_DF was only partially opened and will not be saved.
If this situation is encountered, ending the Python session used to call the read_sas() reader releases the file lock.
The pandas library does not provide a write_sas writer. In Chapter 8, “SASPy Module,” we cover details for executing bi-directional interchange between panda DataFrames and SAS datasets using SAS Institute’s open source SASPy Module.
Write RDBMS Tables
The pandas library provisions the ability to write DataFrames as RDBMS tables with the to_sql() writer. Consider Listing 6-31 which uses the from_sas DataFrame created in Listing 6-30. In this example, the from_sas DataFrame calls the to_sql() method attempting to create the SQLTableFromDF SQL Server table. This call returns an ODBC error indicating the to_sql() syntax is correct, but the underlying SQLAlchemy call to the ODBC API failed.
pyodbc.ProgrammingError: ('42000', '[42000] [Microsoft][ODBC SQL Server Driver][SQL Server]The incoming request has too many parameters. The server supports a maximum of 2100 parameters. Reduce the number of parameters and resend the request. (8003)
Listing 6-31
Write DataFrame to Table
If you are using a different ODBC driver other than the SQLServer one used here, you may encounter a different error message indicating SQL inserts are rejected. Fortunately this error is easily eliminated by using the chunksize= argument for the to_sql() writer illustrated in Listing 6-32.
RDBMS target table to write, in this case, SQLTableFromDF.
2.
The engine object containing the RDBMS connection string defined in Listing 6-20.
3.
if_exists='replace' eliminates the need to explicitly DROP the target table if it already exists.
4.
chunksize=100 which enables writes in increments of 100 rows to avoid size limitations when attempting to load larger DataFrames to an RDBMS table.
5.
index=False does not define a primary key column when loading the DataFrame values into the RDBMS table.
Unlike SAS, when the to_sql()writer writes to the RDBMS table, it does not return messages or provide a return code to indicate success. Of course, we can send queries to the RDBMS to validate the existence of the SQLTableFromDF SQL Server table, which is the purpose of Listing 6-33.
>>> confirm1 = pd.read_sql_query("SELECT column_name as 'COL_NAME', "
Notice the quoting. Each physical line in the call to read_sql_query() reader requires double quotes. Single quotes are used to label column headings following the T-SQL AS keyword along with single quotes used in the T-SQL WHERE clause. The single quotes are passed to the RDBMS engine since they are a required element for a valid T-SQL query.
Figure 6-25 displays the output from SQL Server Management Studio execution of this query.
In the case of Listing 6-32, there are no missing values in the from_sas DataFrame. Missing values between data formats can be challenging at times since each data format uses different sentinel values to indicate “missing”.
In the case of a DataFrame, missing values can be NaN for float64s and NaT (for Not a Time) for datetime64 types. Consider Listing 6-34. The goal of this example is to load the resulting to_mssql DataFrame into an SQL Server table to ensure subsequent SQL queries properly handle these missing values.
This example uses the loc() indexer to find rows meeting the logical condition for the quantity column value less than 2000. If found, the amount column and the datetime column values are set to np.NaN. For more details on using the loc() indexer to conditionally update values, see section “Conditional Updates” in Chapter 5, “Data Management.”
In this example, notice the missing value for the datetime column whose type is datetime64 is NaT. In both cases the .isnull().sum() chained attributes indicate there are rows with missing values in the to_mssql DataFrame.
Consider Listing 6-35. In this example, we copy the from_sas DataFrame to the to_mssql DataFrame. This is followed by a call to the to_sql() writer to load the to_mssql DataFrame to the SQLTableFromDF2 SQL Server table.
creates the SQLTableFromDF2 SQL Server table. The syntax
confirm2 = pd.read_sql_query("SELECT TOP 5 * "
... "FROM SQLTableFromDF2 "
... "WHERE amount is NULL ", engine)
executes a T-SQL query returning the first five rows from the table using the T-SQL WHERE clause to create the confirm2 DataFrame
WHERE amount is NULL
The call to the print() function for the confirm2 DataFrame displays the values for the datetime and amount columns made the round trip from DataFrame to SQL Server Table back to a DataFrame maintaining the integrity of missing values between both data formats.
Summary
Data analysis tasks often require building a data processing pipeline to acquire and format data among a number of different data formats. This chapter provides the user with a solid grounding for acquiring data and loading them into a DataFrame using different panda readers as well as illustrates for writing DataFrame values to .csv, .xls, .sas7bdat, and RDBMS tables. By adding this know-how to one’s skill set, it enables the user to build more elaborate scripts to tackle a range of data analysis tasks.