Back Matter

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Appendix A: Generating the Tickets DataFrame

The following Python script is associated with Listing 4-27 in Chapter 4 , “Indexing and GroupBy.”

The tickets DataFrame is generated by calling the numpy randn random number generator to create data values, defines rows and columns, and assigns both the row labels and the column labels with a MultiIndex object . The syntax

index = pd.MultiIndex.from_product([[2015, 2016, 2017, 2018], [1, 2, 3]], names = ['Year', 'Month'])

creates the Python object index by calling the MultiIndex.from_product() constructor to form the MultiIndex structure as the Cartesian product for Year values by Month values. This structure provides row labels for the tickets DataFrame shown in Listing 4-27 . The MultiIndex object enables the “nesting” of the Month levels inside the Year levels for the rows.

Similarly, the syntax

columns = pd.MultiIndex.from_product([['City', 'Suburbs', 'Rural'], ['Day', 'Night']], names = ['Area', 'When'])

creates the Python object columns by calling the MultiIndex.from_product() constructor to form the MultiIndex structure as the Cartesian product for Area levels by When levels. This structure provides the column labels nesting Area levels (‘City’, ‘Rural’, ‘Suburbs’) inside the When levels (‘Day’, ‘Night’).

Data values are generated with the syntax

data = np.round(np.random.randn(12, 6),2)

data is a 12 row by 6 column array created by calling the numpy random number generator randn. The resulting values are rounded two places to the right of the decimal.

The second call to the data = assignment

data = abs(np.floor_divide(data[:] * 100, 5))

updates values in the array using the syntax data[:] replacing them by multiplying array elements by 100 and dividing this product by 5. The abs() function returns their absolute values.

Finally, the syntax

tickets = pd.DataFrame(data, index=index, columns=columns).sort_index().sort_index(axis = 1)

constructs the tickets DataFrame. The attribute sort_index sorts the “nested” row label values (created with the index object described previously), and the attribute sort_index(axis = 1) sorts the “nested” column values (created with the columns object described previously). Recall that axis = 0 refers to rows and axis = 1 refers to columns.

>>> import pandas as pd

>>> import numpy as np

>>> np.random.seed(654321)

>>> idx = pd.MultiIndex.from_product([[2015, 2016, 2017, 2018],

... [1, 2, 3]],

... names = ['Year', 'Month'])

>>> columns=pd.MultiIndex.from_product([['City' , 'Suburbs', 'Rural'],

... ['Day' , 'Night']],

... names = ['Area', 'When'])

>>>

>>> data = np.round(np.random.randn(12, 6),2)

>>> data = abs(np.floor_divide(data[:] * 100, 5))

>>>

>>> tickets = pd.DataFrame(data, index=idx, columns = columns).sort_index().sort_index(axis=1)

>>> print(tickets)

Area City Rural Suburbs

When Day Night Day Night Day Night

Year Month

2015 1 15.0 18.0 9.0 3.0 3.0 3.0

2 11.0 18.0 3.0 30.0 42.0 15.0

3 5.0 54.0 7.0 6.0 14.0 18.0

2016 1 11.0 17.0 1.0 0.0 11.0 26.0

2 7.0 23.0 3.0 5.0 19.0 2.0

3 9.0 17.0 31.0 48.0 2.0 17.0

2017 1 21.0 5.0 22.0 10.0 12.0 2.0

2 5.0 33.0 19.0 2.0 7.0 10.0

3 31.0 12.0 19.0 17.0 14.0 2.0

2018 1 25.0 10.0 8.0 4.0 20.0 15.0

2 35.0 14.0 9.0 14.0 10.0 1.0

3 3.0 32.0 33.0 21.0 24.0 6.0

Appendix B: Many-to-Many Use Case

In Chapter 5 , “Data Management,” we discuss table relationships as being one-to-one, one-to-many, or many-to-many. In a one-to-one relationship, or data model, with respect to the key columns, there is exactly one row in a table that is associated with exactly one row in the other table. Said another way, the key column values in both tables must be unique.

In a one-to-many data model , with respect to the key columns, there is exactly one row that is associated with multiple rows in the other table. All the examples in Chapter 5 , “Data Management,” illustrate a one-to-many relationship among the tables.

And of course, there is the case where neither key column values among tables are unique, in this case, a many-to-many data model.

In this appendix, we illustrate the results of a many-to-many join with the SAS Sort/Merge logic and with PROC SQL . This is followed by the pandas merge() and corresponding join() methods.

With SAS, in cases where the table relationships are one-to-one or one-to-many, a SORT/MERGE (match-merge) and a PROC SQL outer join produce the same result set. In the case where the table relationship is many-to-many, these techniques return different result sets.

This example illustrates differences between results created by the SAS Sort/Merge operation and a PROC SQL outer join in those cases where the table relationship is many-to-many. Observe the note in the log

NOTE: MERGE statement has more than one dataset with repeats of BY values.

This indicates a many-to-many relationship exists between the tables in the match-merge operation . The default SORT/MERGE operation for tables with a many-to-many relationship is illustrated in Listing B-1 .

4 data left;

5 infile datalines dlm=',';

6 input id $3.

7 value_l;

8 list;

9 datalines;

RULE: + 1 + 2 + 3 + 4 +

1064 001, 4314

1065 001, 4855

1066 001, 4761

1067 002, 4991

1068 003, 5001

1069 004, 3999

1070 004, 4175

1071 004, 4101

NOTE: The dataset WORK.LEFT has 8 observations and 2 variables.

10 ;;;;

12 data right;

13 infile datalines dlm=',';

14 input id $3.

15 value_r;

16 list;

17 datalines;

RULE: + 1 + 2 + 3 + 4 +

1080 004, 1133

1081 004, 1234

1082 004, 1111

1083 002, 1921

1084 003, 2001

1085 001, 2222

NOTE: The dataset WORK.RIGHT has 6 observations and 2 variables.

18 ;;;;

20 proc sort data=left;

21 by id;

22 run;

NOTE: There were 8 observations read from the dataset WORK.LEFT.

NOTE: The dataset WORK.LEFT has 8 observations and 2 variables.

24 proc sort data=right;

25 by id;

26 run;

NOTE: There were 6 observations read from the dataset WORK.RIGHT.

NOTE: The dataset WORK.RIGHT has 6 observations and 2 variables.

28 data merge_lr;

29 merge left

30 right;

31 by id;

32 run;

NOTE: MERGE statement has more than one dataset with repeats of BY values.

NOTE: There were 8 observations read from the dataset WORK.LEFT.

NOTE: There were 6 observations read from the dataset WORK.RIGHT.

NOTE: The dataset WORK.MERGE_LR has 8 observations and 3 variables.

34 proc print data=left;

35 id id;

36 run;

NOTE: There were 8 observations read from the dataset WORK.LEFT.

37 proc print data=right;

38 id id;

39 run;

Listing B-1

SORT/MERGE for Tables with Many-to-Many Relationship

Figure B-1 displays the left dataset by calling PROC PRINT .

Figure B-1

Left Dataset

Figure B-2 displays the right dataset.

Figure B-2

Right Dataset

In a SORT/MERGE with tables having a many-to-many relationship, the number of observations with duplicate values returned is equal to the maximum of duplicates from both tables. Figure B-3 displays the merge_lr dataset , created by sorting and merging the input left and right dataset.

Figure B-3

Results of SORT/MERGE with Tables Having a Many-to-Many Relationship

To understand the behavior for this SORT/MERGE example carefully, observe the values for id 001 in Figure B-3 . To begin, the SAS Data Step SORT/MERGE logic does not produce a Cartesian product for the left and right tables and instead uses a row-by-row merge operation . This processing logic is

1.
SAS reads the descriptor portion (header) of the left and right dataset and creates a program data vector (PDV) containing all variables from both datasets for the output merge_lr dataset.
1. a.
  ID and value_l are contributed from the left dataset.
2. b.
  value_r is contributed from the right dataset.
2.
SAS determines which BY group should appear first. In this case, observation with the value 001 for ID is the same for both input datasets.
3.
SAS reads and copies the first observation from the left dataset into the PDV.
4.
SAS reads and copies the first observation from the right dataset into the PDV.
5.
SAS writes this observation to the output dataset, merge_lr .
6.
SAS looks for the second observation in the BY group in the left and right dataset. The left dataset has one; the right dataset does not. The MERGE statement reads the second observation in the BY group from the left dataset. And since the right dataset has only one observation in the BY group, the value 001 is retained in the PDV for the second observation in the output dataset.
7.
SAS writes the observation to the output dataset. When both input datasets contain no further observations for the BY group, SAS sets all values in PDV to missing and begins processing the next BY group. It continues processing observations until it exhausts all observations from the input datasets .

Next consider Listing B-2 . In this case, we call PROC SQL to execute a full outer join on the left and right tables. In this example, an outer join with the keywords FULL JOIN and ON returns both matched and unmatched rows from the left and right tables.

5 proc sql;

6 select coalesce(left.id, right.id)

7 ,value_l

8 ,value_r

9 from left

10 full join

11 right

12 on left.id = right.id;

13 quit;

Listing B-2

PROC SQL Outer Join

The results are displayed in Figure B-4 .

Figure B-4

SAS Outer Join for Left and Right Tables

The analog to the outer join with PROC SQL is illustrated in Listing B-3 .

In this example, the left and right DataFrames are created by calling the DataFrame create method. Also notice how both DataFrames are created with the default RangeIndex.

>>> left = pd.DataFrame([['001', 4123],

... ['001', 4855],

... ['001', 4761],

... ['002', 4991],

... ['003', 5001],

... ['004', 3999],

... ['004', 4175],

... ['004', 4101]],

... columns=['ID', 'Value_l'])

>>> right = pd.DataFrame([['004', 1111],

... ['004', 1234],

... ['004', 1133],

... ['002', 1921],

... ['003', 2001],

... ['001', 2222]],

... columns=['ID', 'Value_r'])

>>> nl = ' '

>>>

>>> print(nl ,

... left ,

... nl ,

... right)

ID Value_l

0 001 4123

1 001 4855

2 001 4761

3 002 4991

4 003 5001

5 004 3999

6 004 4175

7 004 4101

ID Value_r

0 004 1111

1 004 1234

2 004 1133

3 002 1921

4 003 2001

5 001 2222

>>> merge_lr = pd.merge(left, right, how="outer", sort=True)

>>> print(merge_lr)

ID Value_l Value_r

0 001 4123 2222

1 001 4855 2222

2 001 4761 2222

3 002 4991 1921

4 003 5001 2001

5 004 3999 1111

6 004 3999 1234

7 004 3999 1133

8 004 4175 1111

9 004 4175 1234

10 004 4175 1133

11 004 4101 1111

12 004 4101 1234

13 004 4101 1133

Listing B-3

pandas merge with Many-to-Many Relationship

The syntax

merge_lr = pd.merge(left, right, how="outer", sort=True)

creates the merge_lr DataFrame as an outer join for the left and right DataFrames, creating the same table output as the SAS PROC SQL logic from Listing B-2 .

Index

A

Anaconda3

adding path

environmental variables

Linux installation

PC control panel

PC properties

system properties

troubles in installation

windows installation

SeeWindows installation

append() method

asctime() function

astimezone() method

astype() attribute

B

beg_day variable

Boolean data type

and operator

AND/OR/NOT

chained comparisons

comparison operations

SeeComparison operators

empty and non-empty sets

FINDC function

IN/NOT IN

numeric inequality

or operation

SAS AND operator

precedence

Python objects

string equality

True and False

Boolean string equality

C

Calendar.day_name attribute

calendar.weekday() function

call streaminit function

Cloud Analytic Services (CAS)

CMISS function

Columns, return

by position

DataFrame constructor method

default RangeIndex

df SAS dataset

Column types

combine() function

Comma-separated values (.csv) files

basic code

convert= argument

date handling in

dtype= argument

index attribute

input

setting na_values

strip_sign function

Comparison operators

chained comparisons

empty and non-empty sets

equivalence test

is comparison

numeric inequality

string equality

Concatenation

append method

concat method

DataFrame method

hierarchical index

IN= dataset

PROC APPEND

PROC SQL UNION ALL set operator

SET statement

concat method

Count method

country_timezones() function

create_engine function

D

DataFrame Inspection

dropping missing values

SeeMissing values, drop

head Function

missing data

df1 DataFrame

df2 DataFrame

NaN

None object

missing values

SeeMissing value detection

SAS

tail Method

DataFrames

construction

datasets

describe function

dtype attribute

histogram

ID column

info method

JSON

SeeJavaScript Object Notation (JSON)

RDBMS

SeeRDBMS Tables

read_csv Method

reading .csv files

SeeComma-separated values (.csv) files

read SAS database

read .xls files

absolute vs. relative filenames

dataset output

glob module

keep_date_col= argument

local filesystem

multiple files

write .csv files

write .xls files

DataFrame to SAS dataset

bar() attribute

loansdf.describe() method

loan status histogram

return column information

sas.df2sd()method

SASPy Returns Log

Teach_me_SAS Attribute

Data management

concatenation

SeeConcatenation

conditional update

Boolean mask

elif keyword

.loc indexer

SAS

with function

convert types

dropping duplicates

finding duplicates

INNER JOIN

pandas

PROC SQL

joining on index

INNER JOIN

key column

LEFT JOIN

methods

outer join

LEFT JOIN

data step

pandas

PROC SQL

results

unmatched keys

map column values

outer join

pandas

PROC SQL

unmatched keys

rename columns

RIGHT JOIN

COALESCE function

data step

pandas

PROC SQL

results

unmatched keys

SAS sort/merge

datasets

merge method

transpose

update operations

DataFrame

ID column

merge method

SAS

validate keys

Data wrangling

date() and time() methods

date() constructor

Date formatting

calling strftime() and strptime() methods

date object to strings

print() function

Python datetime format directives

Python strings to date object

SAS character variable to date

SAS dates to character variable

strftime() method

string formatting with strftime() method

strptime() method

Date manipulation

Calendar.day_name Attribute

count days until birthday

Python

SAS

date replace method

.isocalendar attribute

isoweekday() function

next_birth_day variable

Python

return week day names

SAS

SAS weekday and week functions

toordinal() method

weekday() function

Date Object

manipulation

SeeDate manipulation

return today’s date

date attributes

date range

Python today() function

SAS Date functions

SAS TODAY function

shifting dates

datetime arithmetic

datetime() constructor

datetime.datetime() constructor method

Datetime format directives

DATETIME function

Datetime handling

default datetime format, SASPy

export SAS Datetimes to DataFrame

SASPy

SASPy Submit Method

datetime.now() method

Datetime object

combining times and dates

Python

SAS

Datetime Constructor methods

Python datetime object to strings

Python strings to datetime objects

returning datetime components

Python attributes

SAS functions

SAS character variable to datetime

SAS datetime constants

SAS datetime to character variables

datetime.today() constructor method

datetime.utc() constructor

datetime.utcnow() method

df_dates DataFrame

df.describe() method

DO/END block

dropna method

E

Equivalence test

F

fillna method

arithmetic mean

column subset

dictionary

FINDC function

first_biz_day_of_month function

format() method

Formatting

datetime

floats

integers

specifications

strings

fromordinal() function

f-string, formatting

G

glob module

gmtime() function

GroupBy

actions

columns, continuous values

d_grby_sum DataFrame

district variable

filtering

gb object

GroupBy object

iteration

keys and groups

steps

summary statistics

sum over

transform

H

Hierarchical indexing

I

Index

DataFrame constructor method

df.set_index method

.iloc indexer

.loc indexer

print function

set_index method

info method

INTCK function

Integrated Development Environment (IDE)

loop.py

SAS

Spyder

INTNX function

isnull method

Boolean values

mathematical operations

NaN

SAS sum function

SELECT statement

sum method

isoweekday() function

J, K

JavaScript Object Notation (JSON)

read

API

records

SAS

write

Jupyter notebook

Home page

Linux

loop.py Script

path on Windows

L

last_biz_day_of_month() function

LIBNAME statement

Libref to Python

loandf DataFrame

loandf.info() method

loandf initial attributes

loansdf.describe() method

localize() method

Local Mean Time (LMT)

local nxt_mn datetime object

localtime() function

M

Match-merge operation

max and min methods, column

merge method

Missing value detection

count values

IF/THEN logic

isnull

Seeisnull method

pandas function

PROC FORMAT

PROC FREQ

SAS

Missing values, drop

CMISS function

df.dropna method

dropna method

dropna update

duplicate rows

PROC FREQ

PROC SORT

SAS variables

thresh parameter

Missing values, imputation

arithmetic mean

fillna method

PROC SQL

MultiIndexing

ABS function

advanced indexing with

conditional slicing

slicing rows and columns

cross sections

df DataFrame

INT function

month variable

pandas index

print (tickets.index) statement

Python tuple

RAND function

subsets with

tickets DataFrame

MultiIndex object

N

Naïve and aware datetime objects

NaN (not a number) object

NODUPRECS option

None object

normalize() method

notnull method

Numeric data types

Python operators

SAS

numpy.random library

O

One-to-many data model

P, Q

pandas library

key features

many-to-many operation

SAS

parse_dates= argument

pd.MultiIndex.from_product constructor

pd.read_sql_table method

pd.to_datetime function

pivot_table function

arguments

basics

improvements

read_csv method

sales

SAS

print() function

PROC FORMAT PICTURE formatting directives

PROC IMPORT

PROC MEANS

PROC PRINT

PROC SQL

PROC SQL SELECT statement

PUT function

py_fmt. format

Python date range

Python installation, troubles in Windows

add Anaconda3 to path

environmental variables

PC control panel

PC properties

system properties

Python script

case sensitivity

execution, Windows

IDE

indented block error

line continuation

Linux

loop.py

pytz library

astimezone() method

country_timezones() function

datetime astimezone() conversion function

datetime manipulation logic

handling DST transition

interaction with tzinfo attribute

localize() method

normalize() method

print() function

random pytz Common Time zone

replace() function

tm_end datetime object

UTC conversion, datetime arithmetic

R

RangeIndex object

RDBMS tables

query

customer dataset

customerPy Attributes

PROC SQL statements

SQL pass-thru

SQL query to database

SQL statements

read

change SQL server language

components

create data source

create_engine function

default database, change

ODBC

pd.read_sql_table method

SQL server authentication

test results

write

missing values

SQL server

to_sql syntax

.to_sql writer

read_csv method

replace() function

round() function

Rows and columns, return

by label

conditionals

DataFrame, add index

default RangeIndex object

.loc indexer

print function

set_index method

slices

updation

by position

dataset df

.iloc indexer

noobs option

slice object

Rows, return

by position

DataFrame row slicing

PROC PRINT displays

start position

S

Sampling

capability

from dataset

IF-THEN/DO block

random number generator

sascfg_personal.py configuration file

classpath Variable

.jar files

winlocal object definition

sas_code Docstring

SAS code, execution

sas_code object

SAS.submit() method

HTML

output

SASdata object methods

SAS dataset to DataFrame

cars_df DataFrame

credit risk grades

ds_options object

export

plot.bar() method

SAS Python Pipeline

sas.sasdata2dataframe()method

sas.sd2df()method

WORK.grade_sum

SAS Date functions

sas.df2sd() method

SAS language

assignment and concatenation

data types

datetime

outer join

percent format

plussign. format

round function

row-by-row merge operation

SORT/MERGE

string equality

SUBSTR function

UPCASE function

SAS Macro variable return codes

SASPy module

attributes, character value columns

autoexec processing

build loandf DataFrame

df.describe() method

execute SAS code

installation

.jar files

passing SAS macro variables

prompting

Python and SAS session

sascfg_personal.py configuration file

SASPy assigned_librefs method

scripting

saspy.SASsession() argument

SASPy Session

sas.saslib() method

sas.sd2df()method

SASsession() method

sas.submit() method

sas.symget() method

sas.symput() method

SAS Time zone

conversions

differences

formats

functions

option

setting, option

Scripting SASPy

automating Python scripts

bar() method

loandf DataFrame

non-interactive mode

sas.df2sd() method

sas.saslib() method

set_batch() method

SASPy sd2df method

Sequence indexing

Series

construction

index retrieval

index values

mathematical operations

random values

set_batch() method

set_index method

Shifting dates

Slicing operator

Sorting

ascending and descending

NaN’s

PROC SORT

PROC SQL with ORDER BY

sort_values attribute

SORT/MERGE

many-to-many

row-by-row

strftime() function

Strings

assignment and concatenation

formatting

SeeFormatting

multiline

quoting

slicing

strip_sign function

strptime() function

str.strip() method

struct_time object

Sum operation

SYSLIBRC method

T

TableVar variable

teach_me_SAS() attribute

tickets DataFrame

timedelta() method

Timedelta object

addition and subtraction operations

approach, dates finding

beg_day and end_day variables

calling functions

f_day_mo variable

fd_mn object

first and last day of month

Python

SAS

first_biz_day_of_month function

INTCK and INTNX functions

last_biz_day_of_month functions

l_day_mo variable

SAS first and last business day of month

timedelta arithmetic

weekday() function

Time formatting

Time object

formatting

gmtime() function

Python strings to time object

Python Time Epoch

Python time object to strings

Python time_struct Object

SAS character variable

SAS time constants to strings

Time of Day

Time of Day, return

Python

SAS

time_struct object

time.time() function

timezone() function

Time zone object

naïve and aware datetimes

pytz library

Seepytz library

TIMEZONE options

today() function

toordinal() method

type() method

tzinfo attribute

TZONEDSTOFF function

TZONEDTNAME function

TZONENAME function

TZONEOFF function

TZONESTTOFF function

tzones2u() function

U, V

upper() method

W, X, Y

weekday() function

WEEK() function

Windows installation

advanced options

license agreement

select location

select type

start

troubles

SeePython installation, troubles in Windows

WORK.grade_sum dataset

WORK library

Z

Zero-based offset

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Back Matter

Create new playlist

Sign In

Sign Up

Index

A

B

C

D

E

F

G

H

I

J, K

L

M

N

O

P, Q

R

S

T

U, V

W, X, Y

Z

Table of Contents for
Back Matter