Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

R. Betancourt, S. ChenPython for SAS Usershttps://doi.org/10.1007/978-1-4842-5001-3_6

6. pandas Readers and Writers

Randy Betancourt¹ and Sarah Chen²

(1)

Chadds Ford, PA, USA

(2)

Livingston, NJ, USA

In this chapter we discuss methods for reading data from a range of input data sources such as comma-separated values (.csv) files, database tables, JSON, and other sources of input to create DataFrames. The pandas readers are a collection of input/output methods for writing and loading values into DataFrames. These input/output methods are analogous to the family of SAS/Access Software used by SAS to read data values into SAS datasets and write SAS datasets into target output formats.

The pandas main facility for input and output to DataFrames are the readers like read_csv(), read_json() and read_sql_table(), among others, which we cover in detail. These readers use similar syntax. The first argument is the path or location to the target which can be fully qualified filenames, relative filenames (relative to the current working directory executing Python), database tables, URLs (including HTTPS, SFTP, and S3 access methods), and so on. In many cases, the defaults for the readers’ arguments are enough for read/write operations.

The reader methods have arguments to specify:

Input and output locations
Column and index location and names
Parsing rules for handling incoming data
Missing data rules
Datetime handling
Quoting rules
Compression and file formats
Error handling

We start by examining a Series of use cases for reading .csv files beginning with simple examples followed by additional examples combining various arguments to enable a more complex set of parsing rules.

Reading .csv Files

Consider Listing 6-1.

>>> import pandas as pd

>>> url = "https://raw.githubusercontent.com/RandyBetancourt/PythonForSASUsers/master/data/messy_input.csv"

>>> df1 = pd.read_csv(url, skiprows=2)

>>> print(df1)

ID Date Amount Quantity Status Unnamed: 5

0 42 16-Oct-17 $23.99 123.0 Closed Jansen

1 7731 15-Jan-17 $49.99 NaN Pending Rho

2 8843 9-Mar-17 129 45.0 NaN Gupta

3 3013 12-Feb-17 15.0 Pending Harrison

4 4431 1-Jul-17 $99.99 1.0 Closed Yang

>>> print(df1.dtypes)

ID int64

Date object

Amount object

Quantity float64

Status object

Unnamed: 5 object

dtype: object

Listing 6-1

Basic .csv Read

In this example, the call to the read_csv() reader uses two arguments. The URL object holds the string value for the input location to be read followed by the skiprows= argument. As you can see from Figure 6-1, the column headers are on row 3.

../images/440803_1_En_6_Chapter/440803_1_En_6_Fig1_HTML.jpg — Figure 6-1
.csv File Input

Also observe how the ID values for column “A” are left aligned which is the default for character strings in Excel. Column “F” is missing its header label and there are several missing values.

Listing 6-1 displays the column types. These types are inferred from the input data read from the .csv file. The read_csv() reader in this example maps the ID column to an int64 type despite the fact that these values are stored as strings in the .csv file. In subsequent examples we will see how to explicitly control type mappings when reading values into DataFrames or unloading values from DataFrames.

Listing 6-1 displays the type for the Date column as object indicating these values are stored in the df1 DataFrame as strings.

The read_csv() reader provisions the na_values= argument to define how missing values are handled on input. The na_values= argument allows the definition of any value to represent missing for both character and numeric column types. This feature is illustrated in Listing 6-2.

>>> miss = {'Amount' : [' ', 'NA']}

>>> df2 = pd.read_csv(url, skiprows=2, na_values=miss)

>>> print(df2)

ID Date Amount Quantity Status Unnamed: 5

0 42 16-Oct-17 $23.99 123.0 Closed Jansen

1 7731 15-Jan-17 $49.99 NaN Pending Rho

2 8843 9-Mar-17 129 45.0 NaN Gupta

3 3013 12-Feb-17 NaN 15.0 Pending Harrison

4 4431 1-Jul-17 $99.99 1.0 Closed Yang

>>> print(df2.dtypes)

ID int64

Date object

Amount object

Quantity float64

Status object

Unnamed: 5 object

dtype: object

>>>

>>> df1[3:4]

ID Date Amount Quantity Status Unnamed: 5

3 3013 12-Feb-17 15.0 Pending Harrison

>>> df2[3:4]

ID Date Amount Quantity Status Unnamed: 5

3 3013 12-Feb-17 NaN 15.0 Pending Harrison

Listing 6-2

Setting na_values for read_csv()

In this example, the syntax

miss = {'Amount' : [' ', 'NA']}

creates the miss object as a Dictionary key:value pair where the key value Amount is the DataFrame column name followed by values blank ('') and NA used to designate both as missing values. The parameter na_values= uses this miss Dictionary to map values to missing when they are encountered in the Amount column.

In contrast to Listing 6-1, the value for the Amount column on row 3 in the df1 DataFrame is blank

>>> df1[3:4]

ID Date Amount Quantity Status Unnamed: 5

3 3013 12-Feb-17 15.0 Pending Harrison

whereas in Listing 6-2, the value for the Amount column on row 3 in the df2 Dataframe is NaN. These values are displayed with the row-slicing operations

>>> df2[3:4]

ID Date Amount Quantity Status Unnamed: 5

3 3013 12-Feb-17 NaN 15.0 Pending Harrison

The read_csv() reader has two parameters to explicitly set column types: the dtype= and the converters= arguments. Both arguments accept Dictionary key:value pairs with the keys identifying the target column(s) and values that are functions converting the read values into their corresponding column types. The dtype= argument allows you to specify how to treat incoming values, for example, either as strings or numeric types. The converters= argument allows you to call a conversion function which in turn maps the data onto the desired column type based on the function’s logic, for example, parsing a string value to be read as a datetime. The converters= argument takes precedence over the dtype= argument in the event both are used together.

Listing 6-3 illustrates using the dtype= argument to map the ID column type to object, in other words a sequence of characters.

>>> df3 = pd.read_csv(url, skiprows=2, na_values=miss, dtype={'ID' : object})

>>> print(df3)

ID Date Amount Quantity Status Unnamed: 5

0 0042 16-Oct-17 $23.99 123.0 Closed Jansen

1 7731 15-Jan-17 $49.99 NaN Pending Rho

2 8843 9-Mar-17 129 45.0 NaN Gupta

3 3013 12-Feb-17 NaN 15.0 Pending Harrison

4 4431 1-Jul-17 $99.99 1.0 Closed Yang

>>> print(df3.dtypes)

ID object

Date object

Amount object

Quantity float64

Status object

Unnamed: 5 object

dtype: object

>>>

>>> print(df2['ID'].dtype)

int64

>>> print(df3['ID'].dtype)

object

Listing 6-3

Explicit Type Mapping for read_csv() Reader

In this example, the ID column in the df3 DataFrame has a type of object compared to the ID column type in the df2 DataFrame which is int64.

The converters= argument permits a function to convert the incoming data with the function results determining the column’s type. In this example, regardless of whether the incoming values for the ID column are numeric or character types, they are stored in the target DataFrame as strings.

To further illustrate this point, consider Listing 6-4. Here we define the strip_sign() function to remove the dollar sign ($) from the incoming values for the Amount column. The converters= argument contains the Dictionary key:value pair with the key identifying the Amount column and the corresponding value naming the converter function, in this case, strip_sign.

>>> import math

>>>

>>> def strip_sign(x):

... y = x.strip()

... if not y:

... return math.nan

... else:

... if y[0] == '$':

... return float(y[1:])

... else:

... return float(y)

...

>>> df4 = pd.read_csv(url, skiprows=2, converters={'ID' : str, 'Amount': strip_sign})

>>>

>>> print(df4,

... ' ', ' ',

... 'Amount dtype:', df4['Amount'].dtype)

ID Date Amount Quantity Status Unnamed: 5

0 0042 16-Oct-17 23.99 123.0 Closed Jansen

1 7731 15-Jan-17 49.99 NaN Pending Rho

2 8843 9-Mar-17 129.00 45.0 NaN Gupta

3 3013 12-Feb-17 NaN 15.0 Pending Harrison

4 4431 1-Jul-17 99.99 1.0 Closed Yang

Amount dtype: float64

Listing 6-4

read_csv() with convert=

The strip_sign() function returns a float or a NaN when the incoming value is missing. Because the strip_sign() function returns numeric values, the resulting type for the Amount column in the df4 DataFrame is float64.

Listing 6-5 illustrates mapping the ID column as a string and setting this column as the DataFrame index at the end of the read/parse operation. Chapter 4, “Indexing and GroupBy,” contains details for working effectively with DataFrame indexes.

>>> df5 = pd.read_csv(url, skiprows=2, na_values=miss, converters={'ID' : str}).set_index('ID')

>>> print(df5)

Date Amount Quantity Status Unnamed: 5

0042 16-Oct-17 $23.99 123.0 Closed Jansen

7731 15-Jan-17 $49.99 NaN Pending Rho

8843 9-Mar-17 129 45.0 NaN Gupta

3013 12-Feb-17 NaN 15.0 Pending Harrison

4431 1-Jul-17 $99.99 1.0 Closed Yang

>>> print(df5.dtypes)

Date object

Amount object

Quantity float64

Status object

Unnamed: 5 object

dtype: object

>>>

>>> df5.loc['0042']

Date 16-Oct-17

Amount $23.99

Quantity 123

Status Closed

Unnamed: 5 Jansen

Name: 0042, dtype: object

Listing 6-5

read_csv() set_index Attribute

With the ID column as the index, rows can be sliced with their labeled values:

df5.loc['0042']

Date 16-Oct-17

Amount $23.99

Quantity 123

Status Closed

Unnamed: 5 Jansen

Name: 0042, dtype: object

Date Handling in .csv Files

A common requirement for reading data input is preserving date and datetime values rather than treating the values as strings. This can be done via the converters= argument, but there are special built-in converters. The read_csv() reader has specialized parameters for datetime handling. In most cases, the default datetime parser simply needs to know which column or list of columns compose the input date or datetime values. In those cases where date or datetime values are non-standard, the parse_dates= argument accepts a defined function to handle custom date and datetime formatting instructions.

Listing 6-6 illustrates the parse_dates= argument for the read_csv() method.

>>> miss = {'Amount' : [' ', 'NA']}

>>> df5 = pd.read_csv(url, skiprows=2, na_values=miss, converters={'ID' : str}).set_index('ID')

>>>

>>> df6 = pd.read_csv(url, skiprows=2, na_values=miss, converters={'ID' : str}, parse_dates=['Date']).set_index('ID')

>>> print(df6)

Date Amount Quantity Status Unnamed: 5

0042 2017-10-16 $23.99 123.0 Closed Jansen

7731 2017-01-15 $49.99 NaN Pending Rho

8843 2017-03-09 129 45.0 NaN Gupta

3013 2017-02-12 NaN 15.0 Pending Harrison

4431 2017-07-01 $99.99 1.0 Closed Yang

>>>

>>> print(df6.dtypes)

Date datetime64[ns]

Amount object

Quantity float64

Status object

Unnamed: 5 object

dtype: object

>>>

>>> print(df5['Date'].dtype)

object

>>> print(df6['Date'].dtype)

datetime64[ns]

Listing 6-6

Parsing Dates with parse_dates=

The parse_dates= parameter accepts a Python list of column names, in this case, the Date column. In Listing 6-5, the Date column in the df5 DataFrame has the default object type as shown:

>>> print(df5['Date'].dtype)

object

In contrast, with this example, the Date column is now mapped as a datetime64 column type as a result of using the parse_dates= argument.

>>> print(df6['Date'].dtype)

datetime64[ns]

Chapter 7, “Date and Time,” covers the details on date and datetime types and how they are manipulated.

Listing 6-7 illustrates the ability to use custom labels for column headings.

>>> cols=['ID', 'Trans_Date', 'Amt', 'Quantity', 'Status', 'Name']

>>> df7 = pd.read_csv(url, skiprows=3, na_values=miss,

converters={'ID' : str},

parse_dates=['Trans_Date'], header=None, names=cols,

usecols=[0, 1, 2, 3, 4, 5]).set_index('ID')

>>> print(df7)

Trans_Date Amt Quantity Status Name

0042 2017-10-16 $23.99 123.0 Closed Jansen

7731 2017-01-15 $49.99 NaN Pending Rho

8843 2017-03-09 129 45.0 NaN Gupta

3013 2017-02-12 15.0 Pending Harrison

4431 2017-07-01 $99.99 1.0 Closed Yang

>>> print(df7.dtypes)

Trans_Date datetime64[ns]

Amt object

Quantity float64

Status object

Name object

dtype: object

Listing 6-7

Controlling Column Heading

The example begins by defining a Python list of column headings in the order in which they are read from the .csv file with the syntax

cols=['ID', 'Trans_Date', 'Amt', 'Quantity', 'Status', 'Name']

The names= argument accepts the cols object to label the DataFrame columns. You must set the header= argument to None to inform the reader not to use the column labels found in the .csv file. Next, the usecols= argument specifies the input columns to be read. Remember, Python uses a 0-based index for start positions, so the default column A in the .csv file maps to column 0 in the DataFrame, column B in the .csv file maps to column 1 in the DataFrame, and so on.

Finally, notice that since we are not reading the column labels found in the .csv file, the skiprows= argument is set to 3 for this example which is where we find the first line of data values (line 4 in the .csv file). Since Python uses a 0-based index as the start position, row 1 in the .csv file maps to row 0 in the DataFrame, row 2 in the .csv file maps to row 1 in the DataFrame, and so on.

Read .xls Files

As you might expect, pandas provides the read_excel() reader for reading Microsoft .xls files. The parameters and arguments for read_excel() reader are similar to the read_csv() reader.

Take a look at Listing 6-8.

df8 = pd.read_excel('C:\data\messy_input.xlsx', sheet_name="Trans1", skiprows=2, converters={'ID' : str}, parse_dates={'Date' :['Month', 'Day’,

'Year']}, keep_date_col=True).set_index('ID')

print(df8)

print(df8.dtypes)

Listing 6-8

Read .xls File into DataFrame

The results are rendered in Figure 6-2, executed in a Jupyter notebook. In order to execute this example, download the messy_input.xlsx file from

https://github.com/RandyBetancourt/PythonForSASUsers/blob/master/data/messy_input.xlsx

and copy it into a location where you have read/write access on the local filesystem. In this example, we chose “C:datamessy_input.xlsx”.

../images/440803_1_En_6_Chapter/440803_1_En_6_Fig2_HTML.jpg — Figure 6-2
Read .xls File DataFrame Output

In this example, we use the parse_dates= parameter to define the Date column with the syntax

parse_dates={'Date' :['Month', 'Day', 'Year']}

parse_dates= argument accepts a Dictionary key:value pair where the key is the name of the output datetime DataFrame column being defined, in this case, Date. The Dictionary values name the input columns composing the constituent input columns for the Date column values. In this case, the Year, Month, and Day columns are parsed together to define the Date column.

We also illustrate keeping the Year, Month, and Day columns with the syntax

keep_date_col=True

The keep_date_col= argument determines the keep disposition for the columns used with the parse_dates= argument. In this case, Year, Month, and Day columns are kept on the df8 DataFrame.

It is not uncommon to read multiple Excel files together to form a single DataFrame. Listing 6-9 illustrates this use case.

Consider sales transactions for the month are stored in separate Excel files each with identical layouts. In this example, we introduce the glob module from the Python Standard Library. The glob module finds all the filenames matching a specified pattern based on the rules used by the Unix shell. This is useful when you need to search for filenames using string expressions and wildcard characters.

The three .xlsx files for this example are located at

https://github.com/RandyBetancourt/PythonForSASUsers/tree/master/data

and in this case are copied to the folder C:data.

>>> import glob

>>> input = glob.glob('C:\data*_2*.xlsx')

>>> print(input)

['C:\data\February_2018.xlsx', 'C:\data\January_2018.xlsx', 'C:\data\March_2018.xlsx']

>>>

>>> final = pd.DataFrame()

>>> for f in glob.glob('C:\data*_2*.xlsx'):

... df = pd.read_excel(f, converters={'ID' : str}).set_index("ID")

... final = final.append(df, ignore_index=False, sort=False)

...

>>> print(final)

Amount Quantity

1042 99.89 21

3311 59.99 12

9846 12.99 25

2222 19.19 115

8931 79.99 2

0044 199.89 10

8731 49.99 2

7846 129.00 45

1111 89.19 15

2231 99.99 1

0002 79.89 43

2811 19.99 19

8468 112.99 25

3333 129.99 11

9318 69.99 12

Listing 6-9

Appending Multiple .xls Files into a DataFrame

The input object is defined by calling the glob module to search the filenames using the pattern

'C:\data*_2*.xlsx'

This call returns a Python list of filenames matching our pattern subsequently displayed by calling the first print() method. The files to append together are

['C:\data\February_2018.xlsx', 'C:\data\January_2018.xlsx', 'C:\data\March_2018.xlsx']

In order to read these .xls files in a single operation, an empty DataFrame called final is defined with the syntax

final = pd.DataFrame()

Next a for loop calling the glob module is used to return the Python list of our input three files followed by the call to the read_excel() reader to read the current .xls file in the Python list into the df dataframe. Finally the panda .append attribute for the final DataFrame is called appending each df dataframe to the end of the final DataFrame. Also notice inside the for loop, the set_index() method is used to map the ID column as the DataFrame index.

An alternative to retrieving a Python list of fully qualified pathnames for this type of iterative processing is retrieving input filenames relative to the current working directory executing your Python session. This involves importing the os module from the Python Standard Library to first determine the location for the Python’s current working directory followed by a call to change the current working directory to a new location. This operation is illustrated in Listing 6-10.

>>> import os

>>>

>>> wd = os.getcwd()

>>> print (wd)

C:Users andyDesktopPy_Source

>>> os.chdir('C:\data')

>>> wd = os.getcwd()

>>> print(wd)

C:data

>>> csv_files = glob.glob('*.csv')

>>> print(csv_files)

['dataframe.csv', 'dataframe2.csv', 'Left.csv', 'messy_input.csv', 'Right.csv', 'Sales_Detail.csv', 'School_Scores.csv', 'scores.csv', 'Tickets.csv']

Listing 6-10

Absolute vs. Relative Filenames

The syntax

wd = os.getcwd()

creates the wd object holding the value for the location of the current working directory executing the Python session, in this case

C:Users andyDesktopPy_Source

The syntax

os.chdir('C:\data')

calls the os.chdir attribute and changes the working directory for the current Python session to

C:data

The call to the glob module uses *.csv as an argument to define the csv_files object holding a Python list of all the filenames in the current directory have the filename extension .csv.

A similar approach for reading and appending multiple .csv files with SAS is illustrated in Listing 6-11. The .csv files in this example can be downloaded from https://github.com/RandyBetancourt/PythonForSASUsers/tree/master/data

The SAS FILENAME statement calls the FILENAME PIPE access method to return the fully qualified .csv files found in the directory

c:data

1 filename tmp pipe 'dir "c:data*_2*.csv" /s/b';

3 data _null_;

4 infile tmp;

5 length file_name $ 128;

6 input file_name $;

8 length imported $500;

9 retain imported ' ';

11 imported = catx(' ',imported,cats('temp',put(_n_,best.)));

12 call symput('imported',strip(imported));

14 call execute('proc import datafile="'

15 || strip(file_name)

16 || '" out=temp'

17 || strip(put(_n_,best.))

18 || ' dbms=csv replace; run;'

19 );

20 run;

NOTE: The infile TMP is:

Unnamed Pipe Access Device,

PROCESS=dir "c:data*_2*.csv" /s/b,RECFM=V,

LRECL=32767

NOTE: 3 records were read from the infile TMP.

The minimum record length was 22.

The maximum record length was 25.

5 rows created in WORK.TEMP1 from c:dataFebruary_2018.csv.

5 rows created in WORK.TEMP2 from c:dataJanuary_2018.csv.

5 rows created in WORK.TEMP3 from c:dataMarch_2018.csv.

94 data final;

95 set &imported;

96 format amount dollar8.2;

97 run;

NOTE: There were 5 observations read from the dataset WORK.TEMP1.

NOTE: There were 5 observations read from the dataset WORK.TEMP2.

NOTE: There were 5 observations read from the dataset WORK.TEMP3.

NOTE: The dataset WORK.FINAL has 15 observations and 3 variables.

Listing 6-11

Appending Multiple .csv Files to a Dataset

The _null_ Data Step reads the list of filenames of the FILENAME PIPE statement and assigns these fully qualified filename values to the file_name variable. The CALL SYMPUT statement creates the Macro variable &imported. On each iteration of the Data Step, it concatenates the values “temp1”, “temp2”, and so on. This Macro variable value becomes the list of input datasets used on line 95 to create the output dataset final.

The syntax

call execute('proc import datafile="'

|| strip(file_name)

|| '" out=temp'

|| strip(put(_n_,best.))

|| ' dbms=csv replace; run;'

);

builds the PROC IMPORT statement and corresponding options as a parameter to the CALL EXECUTE statement. The CALL EXECUTE statement executes the call to PROC IMPORT three times, one for each record read from the FILENAME PIPE statement.

Figure 6-3 uses PROC PRINT to display the contents of the final dataset.

../images/440803_1_En_6_Chapter/440803_1_En_6_Fig3_HTML.jpg — Figure 6-3
Final Dataset Output

Write .csv Files

There are cases when you need to export the contents of a DataFrame to a .csv file or an .xls file. In those cases, use the to_csv() writer and the to_excel() writer. The to_csv() writer is illustrated in Listing 6-12.

>>> final.to_csv('C:\data\final.csv', header=True)

Listing 6-12

Output DataFrame to .csv

The writer does not return any information to the console to indicate the operation’s success. Figure 6-4 displays the first five rows of the output file, in this case, C:datafinal.csv.

../images/440803_1_En_6_Chapter/440803_1_En_6_Fig4_HTML.jpg — Figure 6-4
Results from to_csv()

Use PROC EXPORT to write the contents of a SAS dataset to a .csv file. This capability is illustrated in Listing 6-13.

4 filename out_csv "c:datafinal_ds.csv";

5 proc export data = final

6 outfile = out_csv

7 dbms = csv;

8 run;

NOTE: The file OUT_CSV is:

Filename=c:datafinal_ds.csv,

RECFM=V,LRECL=32767,File Size (bytes)=0,

Last Modified=12Nov2018:10:13:47,

Create Time=12Nov2018:10:13:47

NOTE: 16 records were written to the file OUT_CSV.

The minimum record length was 11.

The maximum record length was 18.

NOTE: There were 15 observations read from the dataset WORK.FINAL.

15 records created in OUT_CSV from FINAL.

NOTE: "OUT_CSV" file was successfully created.

Listing 6-13

Output SAS Dataset to .csv

The SAS log indicates the location and physical file characteristics of the output .csv along with the number of records successfully written.

Write .xls Files

You can write a DataFrame to an Excel sheet multiple DataFrames to a book of sheets, using the to_excel() writer. The arguments are largely the same as the to_csv() writer. This capability is illustrated in Listing 6-14.

>>> final.to_excel('C:\data\final.xls', merge_cells=False)

Listing 6-14

Output DataFrame to .xls File

The results from the to_excel() operation are displayed in Figure 6-5.

../images/440803_1_En_6_Chapter/440803_1_En_6_Fig5_HTML.jpg — Figure 6-5
Results from write_excel()

There are multiple ways to output a SAS dataset to .xls files. PROC EXPORT provides a convenient approach; however, it limits the format control you can exert on the appearances of the output .xls file. Alternatively, if you want much finer control over the output format in Excel, then the ODS tagsets.ExcelXP can be used. Listing 6-15 illustrates calling tagsets.ExcelXP to write the contents of a SAS dataset to an Excel file.

4 ods tagsets.ExcelXP

5 file="c:datafinal_ds.xls"

6 style=statistical

7 options(frozen_headers='1'

8 embedded_titles='yes'

9 default_column_width='18');

NOTE: Writing TAGSETS.EXCELXP Body file: c:datafinal_ds.xls

NOTE: This is the Excel XP tagset (Compatible with SAS 9.1.3 and above, v1.131, 04/23/2015). Add

options(doc='help') to the ods statement for more information.

11 proc print data=final;

12 run;

NOTE: There were 15 observations read from the dataset WORK.FINAL.

NOTE: PROCEDURE PRINT used (Total process time):

real time 0.08 seconds

cpu time 0.01 seconds

13 ods tagsets.excelxp close;

14 run;

Listing 6-15

Output Final Dataset to .xls

Note that if you are using SAS Display Manager to generate this example, you may need to disable the “View results as they are generated feature”. This is found by going to the SAS Tools menu in Display Manager option and selecting Options ➤ Preferences and then selecting the Results tab. Uncheck the box labeled “View results as they are generated” as shown in Figure 6-6.

../images/440803_1_En_6_Chapter/440803_1_En_6_Fig6_HTML.jpg — Figure 6-6
SAS Preferences Menu

Read JSON

JSON stands for JavaScript Object Notation and is a well-defined structure for exchanging data among different applications. JSON is designed to be read by humans and easily parsed by programs. It is relied upon to transmit data through RESTful web services and APIs.

The flexibility for calling a REST API to acquire data is illustrated in Listing 6-16. This example creates the jobs DataFrame by calling GitHub Jobs API over HTTPS using read_json() reader to return a set of posted positions along with a detailed job description.

Details on the GitHub Jobs API are at

https://jobs.github.com/api

>>> jobs = pd.read_json("https://jobs.github.com/positions.json?description=python")

>>> print(jobs.shape)

(41, 11)

>>> print(jobs.columns)

Index(['company', 'company_logo', 'company_url', 'created_at', 'description',

'how_to_apply', 'id', 'location', 'title', 'type', 'url'],

dtype='object')

>>> print(jobs[['company', 'location']].head(5))

company location

0 Sesame New York; Berlin

1 BlueVine Redwood City, Ca

2 New York University New York, New York 10001

3 University of Cambridge Cambridge, UK

4 Norwegian Block Exchange AS Oslo, Norway or Remote

Listing 6-16

panda Read JSON API

In this example, the read_json() reader makes a call over HTTPS to the GitHub Jobs API returning records where the position field contains the string ‘Python’. At the time this script was executed, there were 41 job positions posted. This is shown by the call to the shape attribute showing 41 rows and 11 columns in the jobs DataFrame.

Similarly, SAS provides several different methods for reading JSON files. By far, the simplest method is to use the JSON LIBNAME access method, illustrated in Listing 6-17.

4 filename response temp;

6 proc http

7 url="https://jobs.github.com/positions.json?description=python"

8 method= "GET"

9 out=response;

10 run;

NOTE: 200 OK

12 libname in_json JSON fileref=response;

NOTE: JSON data is only read once. To read the JSON again, reassign the JSON LIBNAME.

NOTE: Libref IN_JSON was successfully assigned as follows:

Engine: JSON

Physical Name: C:Users andyAppDataLocalTempSAS

Temporary Files\_TD9088_RANDY-PC_#LN00063

13 proc copy in=in_json

14 out=work;run;

NOTE: Copying IN_JSON.ALLDATA to WORK.ALLDATA (memtype=DATA).

NOTE: There were 462 observations read from the dataset IN_JSON.ALLDATA.

NOTE: The dataset WORK.ALLDATA has 462 observations and 4 variables.

NOTE: Copying IN_JSON.ROOT to WORK.ROOT (memtype=DATA).

NOTE: There were 42 observations read from the dataset IN_JSON.ROOT.

NOTE: The dataset WORK.ROOT has 42 observations and 12 variables.

16 proc print data=work.root (obs=5);

17 id company;

18 var location;

19 run;

NOTE: There were 5 observations read from the dataset WORK.ROOT.

Listing 6-17

SAS Read JSON Access Method

Figure 6-7 uses PROC PRINT to display the first five observations read from the GitHub Jobs API.

../images/440803_1_En_6_Chapter/440803_1_En_6_Fig7_HTML.jpg — Figure 6-7
SAS Read of JSON Records

The SAS program has three steps in order to load the JSON records into a SAS dataset. They are

1.
Call PROC HTTP to fetch the records with the query:
https://jobs.github.com/positions.json?description=python
2.
Along with a GET to return the records to the SAS session. The OUT=response statement stores incoming records in a temporary file associated with the response FILEREF.
3.
Copy the incoming JSON.ROOT and JSON.ALLDATA datasets to the WORK directory. Obviously, you can copy these datasets into a permanent SAS data library. The copy operation is needed since the JSON access method reads the incoming records only once.
4.
Continue the processing logic to manipulate the ROOT dataset created by the JSON LIBNAME engine. In our case we simply call PROC PRINT to display the first five observations from the dataset.

In order to make this particular example work, the SAS session must be executed with an encoding of UTF-8. By default, SAS sessions executing under Windows use WLATIN1 encoding which can lead to transcoding errors when calling the JSON LIBNAME engine to read UTF-8 formatted records. Use the following PROC OPTIONS statement to determine the encoding method used by the SAS session:

proc options option=encoding; run;

SAS (r) Proprietary Software Release 9.4 TS1M5

ENCODING=UTF-8 Specifies the default character-set encoding for the SAS session.

Note that the encoding option can only be changed at SAS initialization time. This is generally controlled by the sasv9.cfg configuration file. The default location on Windows is

C:Program FilesSASHomeSASFoundation9.4 lsensasv9.cfg

Write JSON

The pandas library provides the to_json() writer to write the contents of a DataFrame to a JSON file. This feature is illustrated in Listing 6-18.

>>> df8.drop(columns = ['Day', 'Month', 'Year'], inplace = True)

>>> df8.to_json("c:\data\df8_output.json", orient="records", lines=True)

Listing 6-18

Write DataFrame to JSON

The results of writing the df8 DataFrame to a JSON-formatted file is displayed in Figure 6-8.

../images/440803_1_En_6_Chapter/440803_1_En_6_Fig8_HTML.jpg — Figure 6-8
Write DataFrame to JSON

In order to write a SAS dataset to a Series of JSON records, use PROC JSON as illustrated in Listing 6-19. PROC JSON was introduced with Base SAS release 9.4.

4 proc json out='c:datasas_final.json' pretty

5 nosastags;

6 export final;

7 run;

Listing 6-19

Output a SAS Dataset to JSON

The first two JSON records are displayed in Figure 6-9.

../images/440803_1_En_6_Chapter/440803_1_En_6_Fig9_HTML.jpg — Figure 6-9
Write SAS Dataset to JSON

Read RDBMS Tables

Common input sources of data for analysis are relational database tables. The pandas.io.sql module provides a set of query wrappers to enable data retrieval while minimizing dependencies on RDBMS-specific APIs. Another way of saying this is, the clever folks who brought you pandas also figured out they can avoid re-inventing the wheel so they utilize a library called SQLAlchemy as an abstraction layer to the various databases needing to be read. This approach reduces the amount of database-dependent code pandas needed internally to read and write data using ODBC-compliant engines, which is to say, nearly all databases.

By using the SQLAlchemy library to read RDBMS tables (and queries), you pass SQLAlchemy Expression language constructs which are database-agnostic to the target database. This is analogous to PROC SQL’s behavior of using general SAS SQL constructs which in turn are translated on your behalf for a specific database without you having to know the RDBMS SQL dialect.

In order to execute these Read RDBMS Table examples, you need three components to be set up. They are

1.
The SQLAlchemy library
2.
The appropriate RDBMS driver library, in our case, the pyodbc library
3.
The appropriately configured Windows ODBC client connecting to a target database

The following examples use Microsoft SQL Server 2017 for Windows as the target RDBMS for reading and writing. With a little bit of effort, you can convert these examples to execute on a different RDBMS running either locally or remotely.

The SQLAlchemy library is a part of the Anaconda distribution described in Chapter 1, “Why Python?”

To confirm a version of the SQLAlchemy library is available, start a Python session and enter

>>> import sqlalchemy

>>> sqlalchemy.__version__

'1.2.7'

Notice the two consecutive underscores before and after the string version.

The second part of the setup is the client-side library, which is the pyodbc library in our case. To confirm a version of the pyodbc library is available, start a Python session and enter

>>> import pyodbc

>>> pyodbc.version

'4.0.23'

Our setup has both Python 3.6 and SQL Server 2017 executing on the same Windows 10 machine. If the SQL Server instance is running remotely, you will need to make the appropriate adjustments.

The third and final step is to configure an ODBC DSN to connect to the target RDBMS. In order to set up the appropriate ODBC client interface on Windows 10, launch the ODBC Data Source Administrator by going to the Windows Start Menu to click

Start ➤ Windows Administrative Tools ➤ ODBC Data Sources (64-bit) to launch the ODBC Data Source Administrator shown in Figure 6-10.

../images/440803_1_En_6_Chapter/440803_1_En_6_Fig10_HTML.jpg — Figure 6-10
ODBC Data Source Administrator

If you are using Windows 7, navigate to Control Panel ➤ System and Security ➤ Administrative Tools and select Data Sources (ODBC). These examples illustrate the ODBC setup for Windows 10.

Make sure that the System DSN tab is selected and then click the Add… button to select the driver for the SQL Server Database. See Figure 6-11.

../images/440803_1_En_6_Chapter/440803_1_En_6_Fig11_HTML.jpg — Figure 6-11
Create New Data Source

Press the Finish button which displays the Create a New Data Source to SQL Server dialog box presented in Figure 6-12.

../images/440803_1_En_6_Chapter/440803_1_En_6_Fig12_HTML.jpg — Figure 6-12
Create a New Data Source to SQL Server

For Name: we choose AdventureWorksDW to indicate we are using the AdventureWorksDW2017 database SQL Server ships as an illustration for their Analytical Services. For Server: supply the instance name of the SQL Server Database you want to connect to. In our case, the SQL Server instance name is a local instance named RANDY-PC. Notice we use AdventureWorksDW as the name for ODBC source name which in turn connects to the SQL Server AdventureWorksDW2017 database.

For more information about the AdventureWorks2017 sample databases, see

https://github.com/Microsoft/sql-server-samples/releases/tag/adventureworks

Press the Next> dialog box and you are presented with the authentication dialog box shown in Figure 6-13.

../images/440803_1_En_6_Chapter/440803_1_En_6_Fig13_HTML.jpg — Figure 6-13
SQL Server Authentication

In our case, we select Windows authentication using the network login ID. Press the Next> dialog box. This action presents the Change Default Database Dialog menu shown in Figure 6-14.

If your dialog box indicates “master” as the default database, then check the box labeled “Change the default database to:” and select AdventureWorksDW2017, assuming it is available and your account has been granted read access.

../images/440803_1_En_6_Chapter/440803_1_En_6_Fig14_HTML.jpg — Figure 6-14
Change Default Database

Check the Change the default database tick box and supply the database name AdventureWorksDW2017. This is illustrated in Figure 6-15.

../images/440803_1_En_6_Chapter/440803_1_En_6_Fig15_HTML.jpg — Figure 6-15
Change the Default Database

Press Finish> to be presented with the “Change the language of SQL Server messages to:” tick box. In our case, we chose to log the ODBC driver statistics. This is illustrated in Figure 6-16.

../images/440803_1_En_6_Chapter/440803_1_En_6_Fig16_HTML.jpg — Figure 6-16
Change SQL Server Language

Press Finish> to be presented with the ODBC Data Source Configuration panel shown in Figure 6-17.

../images/440803_1_En_6_Chapter/440803_1_En_6_Fig17_HTML.jpg — Figure 6-17
Confirm Connection Details

Press Test Data Source... box to test the ODBC Connection. You should see the Test Results dialog box shown in Figure 6-18.

../images/440803_1_En_6_Chapter/440803_1_En_6_Fig18_HTML.jpg — Figure 6-18
Confirming Test Results

In order to connect to the SQL Server AdventureWorksDW2017 database with SQLAlchemy, we use the create_engine function to create an engine object from the database URI. Database-specific information for the create_engine function and URI formatting is at

https://docs.sqlalchemy.org/en/latest/core/engines.html

You only need to create the engine once per database instance you are connecting to. The create_engine function containing the SQL Server Database URI is illustrated in Listing 6-20.

>>> import pyodbc

>>> from sqlalchemy import create_engine, MetaData, Table, select

>>>

>>> ServerName = "RANDY-PC"

>>> Database = "AdventureWorksDW2017?driver=SQL+Server"

>>> TableName = "DimCustomer"

>>>

>>> engine = create_engine('mssql+pyodbc://' + ServerName + '/' + Database)

>>> print(engine)

Engine(mssql+pyodbc://RANDY-PC/AdventureWorksDW2017?driver=SQL Server)

Listing 6-20

create_engine Function

The engine object is defined using tuples to concatenate the objects ServerName and Database to form the database connection information:

mssql+pyodbc://RANDY-PC/AdventureWorksDW2017?driver=SQL Server

In cases where the SQL Server instance is remote to the Python session, the engine object string will be similar to

mssql+pyodbc://USER:PW@SERVER/AdventureWorksDW?driver=ODBC+Driver+13+for+SQL+Server

where USER:PW are the userid and password pair and SERVER is the remote hostname or IP address running the SQL Server instances. Also notice the ODBC driver name could be different, depending on which driver you use. You may need to contact database administrator for additional information.

Once the engine object is correctly defined, use the read_sql_table() method to read all or subsets of database artifacts such as tables and views. Consider Listing 6-21, needing just two arguments: the target database table, in this case, DimCustomer table, and the engine object defining the connection string to the target database instance.

>>> import pandas as pd

>>> t0 = pd.read_sql_table('DimCustomer', engine)

>>> t0.info()

RangeIndex: 18484 entries, 0 to 18483

Data columns (total 29 columns):

CustomerKey 18484 non-null int64

GeographyKey 18484 non-null int64

CustomerAlternateKey 18484 non-null object

Title 101 non-null object

FirstName 18484 non-null object

MiddleName 10654 non-null object

LastName 18484 non-null object

NameStyle 18484 non-null bool

BirthDate 18484 non-null datetime64[ns]

MaritalStatus 18484 non-null object

Suffix 3 non-null object

Gender 18484 non-null object

EmailAddress 18484 non-null object

YearlyIncome 18484 non-null float64

TotalChildren 18484 non-null int64

NumberChildrenAtHome 18484 non-null int64

EnglishEducation 18484 non-null object

SpanishEducation 18484 non-null object

FrenchEducation 18484 non-null object

EnglishOccupation 18484 non-null object

SpanishOccupation 18484 non-null object

FrenchOccupation 18484 non-null object

HouseOwnerFlag 18484 non-null object

NumberCarsOwned 18484 non-null int64

AddressLine1 18484 non-null object

AddressLine2 312 non-null object

Phone 18484 non-null object

DateFirstPurchase 18484 non-null datetime64[ns]

CommuteDistance 18484 non-null object

dtypes: bool(1), datetime64[ns](2), float64(1), int64(5), object(20)

memory usage: 4.0+ MB

Listing 6-21

Basic Read with pd.read_sql_table

The info attribute for the t0 DataFrame displays the names of all the columns read from the database table, the number of rows with non-null values, and the columns’ corresponding types. For example, the BirthDate column is mapped to a datetime64 type. This indicates the BirthDate column can be used in datetime arithmetic expressions without needing cast it to datetime64 using the to_datetime() function .

If needed, the read_sql_table() reader accepts the parse_dates= argument to coerce date and datetime columns into a datetime64 type with

parse_dates={'BirthDate': {'format': '%Y-%m-%d'}}

using a nested Dictionary key:value pairs where the key is the name of the column followed by another Dictionary where the key is 'format' and the value is the date parsing instructions, known as format directives.

The to_datetime() function along with format directives is discussed in detail in Chapter 7, “Data and Time.”

To return a subset of columns from a database table, use the columns= argument illustrated in Listing 6-22.

The call to the read_sql_table() method contains three arguments. The first argument is the target table, DimCustomer; the second argument is the engine object containing the connection information needed to access the SQL Server Database instance; and the third argument is columns= which forms the SELECT list that is ultimately executed as a T-SQL query on the SQL Server instance.

>>> col_names = ['FirstName', 'LastName', 'BirthDate', 'Gender', 'YearlyIncome', 'CustomerKey']

>>> tbl = 'DimCustomer'

>>> t1 = pd.read_sql_table(tbl, engine, columns=col_names)

>>> t1.info()

RangeIndex: 18484 entries, 0 to 18483

Data columns (total 6 columns):

FirstName 18484 non-null object

LastName 18484 non-null object

BirthDate 18484 non-null datetime64[ns]

Gender 18484 non-null object

YearlyIncome 18484 non-null float64

CustomerKey 18484 non-null int64

dtypes: datetime64[ns](1), float64(1), int64(1), object(3)

memory usage: 866.5+ KB

Listing 6-22

Returning Column Subset

Use the index_col= argument to map an input column as the DataFrame index to create row labels. This feature is shown in Listing 6-23. This example extends Listing 6-22 by adding a fourth parameter, index_col=, for the call to read_sql_table reader().

>>> t2 = pd.read_sql_table(tbl, engine, columns=col_names, index_col="CustomerKey")

>>> print(t2[['FirstName', 'LastName', 'BirthDate']].head(5))

FirstName LastName BirthDate

CustomerKey

11000 Jon Yang 1971-10-06

11001 Eugene Huang 1976-05-10

11002 Ruben Torres 1971-02-09

11003 Christy Zhu 1973-08-14

11004 Elizabeth Johnson 1979-08-05

>>> t2.index

Int64Index([11000, 11001, 11002, 11003, 11004, 11005, 11006, 11007, 11008, 11009,

. . .

29474, 29475, 29476, 29477, 29478, 29479, 29480, 29481, 29482,

29483],

dtype='int64', name="CustomerKey", length=18484)

Listing 6-23

Mapping Column to an Index

Observe how the CustomerKey column read from the database is now the row label for the t2 DataFrame.

Query RDBMS Tables

As part of an analysis effort, we often need to construct SQL queries against the target dataset to return rows and columns to construct a DataFrame. Use the read_sql_query() reader to send an SQL query to the database and form a DataFrame from the returned result set. This feature is illustrated in Listing 6-24.

>>> q1 = pd.read_sql_query('SELECT FirstName, LastName, Gender, BirthDate, YearlyIncome '

... 'FROM dbo.DimCustomer '

... 'WHERE YearlyIncome > 50000; '

... , engine)

>>> print(q1[['FirstName', 'LastName', 'BirthDate']].tail(5))

FirstName LastName BirthDate

9853 Edgar Perez 1964-05-17

9854 Alvin Pal 1963-01-11

9855 Wesley Huang 1971-02-10

9856 Roger Zheng 1966-03-02

9857 Isaiah Edwards 1971-03-11

Listing 6-24

DataFrame from Query to RDBMS

In this example, the first argument to the read_sql_query() is a valid T-SQL query for SQL Server. This is followed by the engine object which holds the connection information to the database created previously. This call creates the q1 DataFrame. Notice how each line of the SQL query uses single quotes and a space before the close quote.

Listing 6-25 illustrates the SAS analog in Listing 6-22. This example has the same SELECT list as Listing 6-22. Use PROC PWENCODE to encode your password string to ensure your password string is not stored in clear text. Alternatively, you can assign your encoded password string to a SAS Macro variable with

proc pwencode in="YOUR_PASSWORD_HERE" out=pwtemp;

run;

This defines the Macro variable &pwtemp holding the encoded password string.

4 proc pwencode in=XXXXXXXXXXX;

5 run;

{SAS002}XXXXXXXXXXXXXXXXXXXXXXXXXXXXA2522

6 libname sqlsrvr odbc

7 uid=randy

8 pwd={SAS002}XXXXXXXXXXXXXXXXXXXXXXXXXXXXA2522

9 datasrc=AdventureWorksDW

10 bulkload=yes;

NOTE: Libref SQLSRVR was successfully assigned as follows:

Engine: ODBC

Physical Name: AdventureWorksDW

11 title1 "Default Informats and Formats";

12 proc sql;

13 create table customers as

14 select FirstName

15 ,LastName

16 ,BirthDate

17 ,Gender

18 ,YearlyIncome

19 ,CustomerKey

20 from sqlsrvr.DimCustomer;

NOTE: Table WORK.CUSTOMERS created, with 18484 rows and 6 columns.

21 select name

22 ,informat

23 ,format

24 from DICTIONARY.COLUMNS

25 where libname = 'WORK' &

26 memname = 'CUSTOMERS';

27 quit;

Listing 6-25

SAS ODBC Access to SQL Server

The second set of PROC SQL statements , lines 21–26, query the SAS-maintained table, DICTIONARY.COLUMNS, to return format and informat assignments to the WORK.CUSTOMERS columns. SAS formats and informats are analogous to column types, in that the informats direct how values are read on input and formats direct how values are written on output.

In this case, the incoming SQL Server table values for the BirthDate column are read using the SAS $10. informat treating the values as a 10-byte long character string. The output from PROC SQL is displayed in Figure 6-19.

../images/440803_1_En_6_Chapter/440803_1_En_6_Fig19_HTML.jpg — Figure 6-19
Customer Table Default Formats and Informats Assigned by SAS

In order to utilize the incoming BirthDate variable with SAS datetime expressions, a subsequent Data Step is needed to copy the BirthDate variable values into a different variable formatted for datetime handling. Listing 6-26 illustrates copying the customers dataset and using the INPUT function to load the original character values for the BirthDate variable to a numeric variable and assign it a permanent format of YYMMDD10.

Unlike Python, existing SAS variables cannot be recast. To output the customers dataset with the BirthDate variable with a datetime format requires copying the dataset, renaming the incoming BirthDate variable to, in this case, dob, and assigning dob’s values to the output BirthDate variable with the INPUT function.

4 data customers(drop = dob);

5 set customers(rename=(BirthDate = dob));

6 length BirthDate 8;

8 BirthDate = input(dob,yymmdd10.);

9 format BirthDate yymmdd10.;

10 run;

NOTE: There were 18484 observations read from the dataset WORK.CUSTOMERS.

NOTE: The dataset WORK.CUSTOMERS has 18484 observations and 6 variables.

11 title1 "Modified Informats and Formats";

12 proc sql;

13 select name

14 ,informat

15 ,format

16 from DICTIONARY.COLUMNS

17 where libname = 'WORK' &

18 memname = 'CUSTOMERS';

19 quit;

Listing 6-26

Mapping BirthDate Variable to Datetime Variable

Figure 6-20 uses PROC SQL to query the SAS DICTIONARY tables in order to display the informats and formats assigned to the modified customers dataset.

../images/440803_1_En_6_Chapter/440803_1_En_6_Fig20_HTML.jpg — Figure 6-20
Customer Dataset Modified Formats and Informats

Not all SQL queries return a result set. Listing 6-27 illustrates using the sql.execute() function. This is useful for queries that do not return a result set such as CREATE TABLE, DROP, INSERT statements, and so on. The SQL statements are specific to the target RDBMS.

>>> from pandas.io import sql

>>> sql.execute('USE AdventureWorksDW2017; ', engine)

<sqlalchemy.engine.result.ResultProxy object at 0x0000020172E0DE80>

>>> sql.execute('DROP TABLE CustomerPy', engine)

<sqlalchemy.engine.result.ResultProxy object at 0x0000020172E0DDA0>

>>> sql.execute("CREATE TABLE CustomerPy (ID int,

... Name nvarchar(255),

... StartDate date);", engine)

<sqlalchemy.engine.result.ResultProxy object at 0x0000020172E0D898>

Listing 6-27

sql.execute() Statement

In this case, the T-SQL syntax for SQL Server DROPs the CustomerPy table with

sql.execute('DROP TABLE CustomerPy', engine)

followed by a CREATE TABLE statement to define the CustomerPy table

sql.execute("CREATE TABLE CustomerPy (ID int,

Name nvarchar(255),

StartDate date);", engine)

Figure 6-21 displays the created SQL Server table, CustomerPy physical attributes created with the sql.execute() statement.

../images/440803_1_En_6_Chapter/440803_1_En_6_Fig21_HTML.jpg — Figure 6-21
sql.execute() Results

Listing 6-28 illustrates the use of the SQL pass-thru facility as wrapper to pass T-SQL statements directly to SQL Server through the ODBC API. This example is the analog to Listing 6-27 for passing SQL to the database which does not return a result set.

4 proc sql;

5 connect to odbc as sqlsrvr

6 (dsn=AdventureWorksDW

7 uid=randy

8 password=

8! {SAS002}XXXXXXXXXXXXXXXXXXXXXXXXXXXXA2522);

10 exec

11 (CREATE TABLE CustomerPy (ID int,

12 Name nvarchar(255),

13 StartDate date)) by sqlsrvr;

15 %put Note: Return Code for SQL Server is: &sqlxrc;

Note: Return Code for SQL Server is: 0

16 %put Note: Return Message for SQL Server is: &sqlxmsg;

Note: Return Message for SQL Server is:

18 select * from connection to sqlsrvr

19 (ODBC::SQLTables (,,"CustomerPy",));

20 disconnect from sqlsrvr;

22 %put Note: Return Code for SQL Server is: &sqlxrc;

Note: Return Code for SQL Server is: 0

23 %put Note: Return Message for SQL Server is: &sqlxmsg;

Note: Return Message for SQL Server is:

24 quit;

Listing 6-28

SAS Pass-Thru to SQL Server

With SAS SQL Pass-Thru, any statement inside a parenthesized expression is passed directly to the database library API, in this case, ODBC.

(CREATE TABLE CustomerPy (ID int,

Name nvarchar(255),

StartDate date)

SAS/Access to ODBC supports calls to the ODBC::SQLAPI. This interface acts as an alternative method for querying the RDBMS catalog tables. In this example

select * from connection to sqlsrvr

(ODBC::SQLTables (,,"CustomerPy",));

returns information about the created CustomerPy table. Figure 6-22 displays the query results.

../images/440803_1_En_6_Chapter/440803_1_En_6_Fig22_HTML.jpg — Figure 6-22
CustomerPy Attributes

Read SAS Datasets

Organizations often have large numbers of permanent SAS datasets stored on the filesystem. pandas provide the read_sas() reader for creating DataFrames by reading permanent SAS datasets. Permanent SAS datasets are often referred to as .sas7bdat files (after the extension SAS uses to name dataset files on Windows and Unix filesystems). Listing 6-29 uses a Data Step to create the permanent out_data.to_df SAS dataset, where out_data maps to the filesystem location C:data and to_df is the name of dataset, which is recognized by Windows as the file: C:data o_df.sas7bdat.

4 libname out_data 'c:data';

NOTE: Libref OUT_DATA was successfully assigned as follows:

Engine: V9

Physical Name: c:data

5 data out_data.to_df;

6 length even_odd $ 4;

7 call streaminit(987650);

8 do datetime = '01Dec2018 00:00'dt to '02Dec2018 00:00'dt by 60;

9 amount = rand("Uniform", 50, 1200);

10 quantity = int(rand("Uniform", 1000, 5000));

12 if int(mod(quantity,2)) = 0 then even_odd = 'Even';

12 else even_odd = 'Odd';

13 output;

14 end;

15 format datetime datetime16.

16 amount dollar12.2

17 quantity comma10.;

18 run;

NOTE: The dataset OUT_DATA.TO_DF has 1441 observations and 4 variables.

19 proc contents data=out_data.to_df;

20 run;

21 proc print data=out_data.to_df(obs=5);

22 run;

NOTE: There were 5 observations read from the dataset OUT_DATA.TO_DF.

Listing 6-29

Create SAS Dataset for read_sas()

The SAS datetime variable is formatted as datetime constants; the amount variable is formatted to precede the value with a dollar sign ($) and the quantity variable to embed a comma (,) as part of its value. Figure 6-23 uses PROC CONTENTS to display the attribute information for the dataset. Notice that since we are using Windows, SAS dataset written to the filesystem uses wlatin1 encoding.

../images/440803_1_En_6_Chapter/440803_1_En_6_Fig23_HTML.jpg — Figure 6-23
Attributes for OUT_DATA.TO_DF

Figure 6-24 uses PROC PRINT to display the first five observations of the dataset.

../images/440803_1_En_6_Chapter/440803_1_En_6_Fig24_HTML.jpg — Figure 6-24
Output for OUT_DATA.TO_DF

With the OUT_DATA.TO_DF SAS dataset written to the filesystem, it can be used as input into the call to read_sas() reader to create a DataFrame. Listing 6-30 illustrates creating the from_sas DataFrame.

>>> from_sas = pd.read_sas("c:\Data\to_df.sas7bdat",

... format='SAS7BDAT',

... encoding = 'latin-1')

>>> from_sas.info()

RangeIndex: 1441 entries, 0 to 1440

Data columns (total 4 columns):

even_odd 1441 non-null object

datetime 1441 non-null datetime64[ns]

amount 1441 non-null float64

quantity 1441 non-null float64

dtypes: datetime64[ns](1), float64(2), object(1)

memory usage: 45.1+ KB

>>> print(from_sas.head(5))

even_odd datetime amount quantity

0 Odd 2018-12-01 00:00:00 340.629296 3955.0

1 Even 2018-12-01 00:01:00 1143.526378 1036.0

2 Even 2018-12-01 00:02:00 209.394104 2846.0

3 Even 2018-12-01 00:03:00 348.086155 1930.0

4 Even 2018-12-01 00:04:00 805.929860 3076.0

Listing 6-30

Create DataFrame from SAS Dataset

The call to the read_sas() reader has as its first argument the full Windows’s pathname to the OUT_DATA.TO_DF dataset (as it is known to SAS). The value for the second argument, format=, is SAS7BDAT. If the SAS dataset is in transport format, then this value is set to XPORT. The third argument, encoding=, is set to latin-1 to match the encoding for the to_df.sas7bdat dataset (as it is known to Windows).

The read_sas() reader issues a file lock for the target input in order to read the SAS dataset. This can cause file contention issues if you attempt to open the SAS dataset for output after executing the read_sas() reader. The SAS log will issue the following error:

ERROR: A lock is not available for OUT_DATA.TO_DF.DATA.

NOTE: The SAS System stopped processing this step because of errors.

WARNING: The dataset OUT_DATA.TO_DF was only partially opened and will not be saved.

If this situation is encountered, ending the Python session used to call the read_sas() reader releases the file lock.

The pandas library does not provide a write_sas writer. In Chapter 8, “SASPy Module,” we cover details for executing bi-directional interchange between panda DataFrames and SAS datasets using SAS Institute’s open source SASPy Module.

Write RDBMS Tables

The pandas library provisions the ability to write DataFrames as RDBMS tables with the to_sql() writer. Consider Listing 6-31 which uses the from_sas DataFrame created in Listing 6-30. In this example, the from_sas DataFrame calls the to_sql() method attempting to create the SQLTableFromDF SQL Server table. This call returns an ODBC error indicating the to_sql() syntax is correct, but the underlying SQLAlchemy call to the ODBC API failed.

>>> from_sas.to_sql('SQLTableFromDF', engine, if_exists="replace", index=False)

pyodbc.ProgrammingError: ('42000', '[42000] [Microsoft][ODBC SQL Server Driver][SQL Server]The incoming request has too many parameters. The server supports a maximum of 2100 parameters. Reduce the number of parameters and resend the request. (8003)

Listing 6-31

Write DataFrame to Table

If you are using a different ODBC driver other than the SQLServer one used here, you may encounter a different error message indicating SQL inserts are rejected. Fortunately this error is easily eliminated by using the chunksize= argument for the to_sql() writer illustrated in Listing 6-32.

>>> from_sas.to_sql('SQLTableFromDF', engine, if_exists="replace", chunksize=100, index=False)

Listing 6-32

Chunk DataFrame Writes to Table

The arguments to the to_sql() method call are

1.
RDBMS target table to write, in this case, SQLTableFromDF.
2.
The engine object containing the RDBMS connection string defined in Listing 6-20.
3.
if_exists='replace' eliminates the need to explicitly DROP the target table if it already exists.
4.
chunksize=100 which enables writes in increments of 100 rows to avoid size limitations when attempting to load larger DataFrames to an RDBMS table.
5.
index=False does not define a primary key column when loading the DataFrame values into the RDBMS table.

Unlike SAS, when the to_sql() writer writes to the RDBMS table, it does not return messages or provide a return code to indicate success. Of course, we can send queries to the RDBMS to validate the existence of the SQLTableFromDF SQL Server table, which is the purpose of Listing 6-33.

>>> confirm1 = pd.read_sql_query("SELECT column_name as 'COL_NAME', "

... "data_type as 'Data_Type', "

... "IS_NULLABLE as 'Nulls Valid' "

... "FROM information_schema.columns "

... "WHERE table_name = 'SQLTableFromDF' ", engine)

>>> print(confirm1)

COL_NAME Data_Type Nulls Valid

0 even_odd varchar YES

1 datetime datetime YES

2 amount float YES

3 quantity float YES

Listing 6-33

Confirm to_sql() Writes

Notice the quoting. Each physical line in the call to read_sql_query() reader requires double quotes. Single quotes are used to label column headings following the T-SQL AS keyword along with single quotes used in the T-SQL WHERE clause. The single quotes are passed to the RDBMS engine since they are a required element for a valid T-SQL query.

Figure 6-25 displays the output from SQL Server Management Studio execution of this query.

../images/440803_1_En_6_Chapter/440803_1_En_6_Fig25_HTML.jpg — Figure 6-25
SQL Server Table Write Confirmation

In the case of Listing 6-32, there are no missing values in the from_sas DataFrame. Missing values between data formats can be challenging at times since each data format uses different sentinel values to indicate “missing”.

In the case of a DataFrame, missing values can be NaN for float64s and NaT (for Not a Time) for datetime64 types. Consider Listing 6-34. The goal of this example is to load the resulting to_mssql DataFrame into an SQL Server table to ensure subsequent SQL queries properly handle these missing values.

>>> import numpy as np

>>> from_sas.loc[from_sas['quantity'] < 2000, 'amount'] = np.NaN

>>> from_sas.loc[from_sas['quantity'] < 2000, 'datetime'] = np.NaN

>>>

>>> to_mssql = from_sas.copy()

>>>

>>> from_sas.isnull().sum()

even_odd 0

datetime 363

amount 363

quantity 0

dtype: int64

>>> print(to_mssql.head(5))

even_odd datetime amount quantity

0 Odd 2018-12-01 00:00:00 340.629296 3955.0

1 Even NaT NaN 1036.0

2 Even 2018-12-01 00:02:00 209.394104 2846.0

3 Even NaT NaN 1930.0

4 Even 2018-12-01 00:04:00 805.929860 3076.0

Listing 6-34

Insert Missing Values into DataFrame

This example uses the loc() indexer to find rows meeting the logical condition for the quantity column value less than 2000. If found, the amount column and the datetime column values are set to np.NaN. For more details on using the loc() indexer to conditionally update values, see section “Conditional Updates” in Chapter 5, “Data Management.”

In this example, notice the missing value for the datetime column whose type is datetime64 is NaT. In both cases the .isnull().sum() chained attributes indicate there are rows with missing values in the to_mssql DataFrame.

Consider Listing 6-35. In this example, we copy the from_sas DataFrame to the to_mssql DataFrame. This is followed by a call to the to_sql() writer to load the to_mssql DataFrame to the SQLTableFromDF2 SQL Server table.

>>> to_mssql = from_sas.copy()

>>>

>>> to_mssql.to_sql('SQLTableFromDF2', engine, if_exists="replace", chunksize=100, index=False)

>>> confirm2 = pd.read_sql_query("SELECT TOP 5 * "

... "FROM SQLTableFromDF2 "

... "WHERE amount is NULL ", engine)

>>>

>>> print(confirm2)

even_odd datetime amount quantity

0 Even None None 1036.0

1 Even None None 1930.0

2 Odd None None 1677.0

3 Even None None 1352.0

4 Even None None 1876.0

Listing 6-35

Validate Reading Missing Values

The syntax

to_mssql.to_sql('SQLTableFromDF2', engine, if_exists="replace", chunksize=100, index=False)

creates the SQLTableFromDF2 SQL Server table. The syntax

confirm2 = pd.read_sql_query("SELECT TOP 5 * "

... "FROM SQLTableFromDF2 "

... "WHERE amount is NULL ", engine)

executes a T-SQL query returning the first five rows from the table using the T-SQL WHERE clause to create the confirm2 DataFrame

WHERE amount is NULL

The call to the print() function for the confirm2 DataFrame displays the values for the datetime and amount columns made the round trip from DataFrame to SQL Server Table back to a DataFrame maintaining the integrity of missing values between both data formats.

Summary

Data analysis tasks often require building a data processing pipeline to acquire and format data among a number of different data formats. This chapter provides the user with a solid grounding for acquiring data and loading them into a DataFrame using different panda readers as well as illustrates for writing DataFrame values to .csv, .xls, .sas7bdat, and RDBMS tables. By adding this know-how to one’s skill set, it enables the user to build more elaborate scripts to tackle a range of data analysis tasks.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 6. pandas Readers and Writers

Create new playlist

Sign In

Sign Up

6. pandas Readers and Writers

Reading .csv Files

Date Handling in .csv Files

Read .xls Files

Write .csv Files

Write .xls Files

Read JSON

Write JSON

Read RDBMS Tables

Query RDBMS Tables

Read SAS Datasets

Write RDBMS Tables

Summary

Table of Contents for
6. pandas Readers and Writers