Appendix B: Many-to-Many Use Case
In Chapter
, “Data Management,” we discuss table relationships as being one-to-one, one-to-many, or many-to-many. In a one-to-one relationship, or data model, with respect to the key columns, there is exactly one row in a table that is associated with exactly one row in the other table. Said another way, the key column values in both tables must be unique.
In a
one-to-many data model
, with respect to the key columns, there is exactly one row that is associated with multiple rows in the other table. All the examples in Chapter
, “Data Management,” illustrate a one-to-many relationship among the tables.
And of course, there is the case where neither key column values among tables are unique, in this case, a many-to-many data model.
In this appendix, we illustrate the results of a many-to-many join with the SAS Sort/Merge logic and with
PROC SQL
. This is followed by the pandas
merge()
and corresponding
join()
methods.
With SAS, in cases where the table relationships are one-to-one or one-to-many, a
SORT/MERGE (match-merge) and a PROC SQL outer join produce the same result set. In the case where the table relationship is many-to-many, these techniques return different result sets.
This example illustrates differences between results created by the SAS Sort/Merge operation and a
PROC SQL
outer join in those cases where the table relationship is many-to-many. Observe the note in the log
NOTE: MERGE statement has more than one dataset with repeats of BY values.
This indicates a many-to-many relationship exists between the tables in the
match-merge operation
. The default SORT/MERGE operation for tables with a many-to-many relationship is illustrated in Listing
B-1
.
4 data left;
5 infile datalines dlm=',';
6 input id $3.
7 value_l;
8 list;
9 datalines;
RULE: + 1 + 2 + 3 + 4 +
1064 001, 4314
1065 001, 4855
1066 001, 4761
1067 002, 4991
1068 003, 5001
1069 004, 3999
1070 004, 4175
1071 004, 4101
NOTE: The dataset WORK.LEFT has 8 observations and 2 variables.
10 ;;;;
11
12 data right;
13 infile datalines dlm=',';
14 input id $3.
15 value_r;
16 list;
17 datalines;
RULE: + 1 + 2 + 3 + 4 +
1080 004, 1133
1081 004, 1234
1082 004, 1111
1083 002, 1921
1084 003, 2001
1085 001, 2222
NOTE: The dataset WORK.RIGHT has 6 observations and 2 variables.
18 ;;;;
19
20 proc sort data=left;
21 by id;
22 run;
NOTE: There were 8 observations read from the dataset WORK.LEFT.
NOTE: The dataset WORK.LEFT has 8 observations and 2 variables.
23
24 proc sort data=right;
25 by id;
26 run;
NOTE: There were 6 observations read from the dataset WORK.RIGHT.
NOTE: The dataset WORK.RIGHT has 6 observations and 2 variables.
27
28 data merge_lr;
29 merge left
30 right;
31 by id;
32 run;
NOTE: MERGE statement has more than one dataset with repeats of BY values.
NOTE: There were 8 observations read from the dataset WORK.LEFT.
NOTE: There were 6 observations read from the dataset WORK.RIGHT.
NOTE: The dataset WORK.MERGE_LR has 8 observations and 3 variables.
33
34 proc print data=left;
35 id id;
36 run;
NOTE: There were 8 observations read from the dataset WORK.LEFT.
37 proc print data=right;
38 id id;
39 run;
Listing B-1
SORT/MERGE for Tables with
Many-to-Many
Relationship
Figure
B-1
displays the
left
dataset by calling
PROC PRINT
.
Figure
B-2
displays the
right
dataset.
In a SORT/MERGE with tables having a many-to-many relationship, the number of observations with duplicate values returned is equal to the maximum of duplicates from both tables. Figure
B-3
displays the
merge_lr dataset
, created by sorting and merging the input
left
and
right
dataset.
To understand the behavior for this SORT/MERGE example carefully, observe the values for
id
001 in Figure
B-3
. To begin, the SAS Data Step SORT/MERGE logic does not produce a Cartesian product for the
left
and
right
tables and instead uses a
row-by-row merge operation
. This processing logic is
- 1.
SAS reads the descriptor portion (header) of the
left
and
right
dataset and creates a program data vector (PDV) containing all variables from both datasets for the output
merge_lr
dataset.
- a.
ID
and
value_l
are contributed from the
left
dataset.
- b.
value_r
is contributed from the
right
dataset.
- 2.
SAS determines which BY group should appear first. In this case, observation with the value 001 for
ID
is the same for both input datasets.
- 3.
SAS reads and copies the first observation from the
left
dataset into the PDV.
- 4.
SAS reads and copies the first observation from the
right
dataset into the PDV.
- 5.
SAS writes this observation to the output dataset,
merge_lr
.
- 6.
SAS looks for the second observation in the BY group in the
left
and
right
dataset. The
left
dataset has one; the
right
dataset does not. The
MERGE
statement reads the second observation in the BY group from the
left
dataset. And since the
right
dataset has only one observation in the BY group, the value 001 is retained in the PDV for the second observation in the output dataset.
- 7.
SAS writes the observation to the output dataset. When both input datasets contain no further observations for the BY group, SAS sets all values in PDV to missing and begins processing the next BY group. It continues processing observations until it exhausts all observations from the input
datasets
.
Next consider Listing
B-2
. In this case, we call
PROC SQL
to execute a full outer join on the
left
and
right
tables. In this example, an outer join with the keywords
FULL JOIN
and
ON
returns both matched and unmatched rows from the
left
and
right
tables.
5 proc sql;
6 select coalesce(left.id, right.id)
7 ,value_l
8 ,value_r
9 from left
10 full join
11 right
12 on left.id = right.id;
13 quit;
Listing B-2PROC SQL Outer Join
The results are displayed in Figure
B-4
.
The analog to the outer join with PROC SQL is illustrated in Listing
B-3
.
In this example, the
left
and
right
DataFrames are created by calling the
DataFrame
create method. Also notice how both DataFrames are created with the default RangeIndex.
>>> left = pd.DataFrame([['001', 4123],
... ['001', 4855],
... ['001', 4761],
... ['002', 4991],
... ['003', 5001],
... ['004', 3999],
... ['004', 4175],
... ['004', 4101]],
... columns=['ID', 'Value_l'])
>>> right = pd.DataFrame([['004', 1111],
... ['004', 1234],
... ['004', 1133],
... ['002', 1921],
... ['003', 2001],
... ['001', 2222]],
... columns=['ID', 'Value_r'])
>>> nl = '
'
>>>
>>> print(nl ,
... left ,
... nl ,
... right)
ID Value_l
0 001 4123
1 001 4855
2 001 4761
3 002 4991
4 003 5001
5 004 3999
6 004 4175
7 004 4101
ID Value_r
0 004 1111
1 004 1234
2 004 1133
3 002 1921
4 003 2001
5 001 2222
>>> merge_lr = pd.merge(left, right, how="outer", sort=True)
>>> print(merge_lr)
ID Value_l Value_r
0 001 4123 2222
1 001 4855 2222
2 001 4761 2222
3 002 4991 1921
4 003 5001 2001
5 004 3999 1111
6 004 3999 1234
7 004 3999 1133
8 004 4175 1111
9 004 4175 1234
10 004 4175 1133
11 004 4101 1111
12 004 4101 1234
13 004 4101 1133
Listing B-3
pandas
merge with Many-to-Many Relationship
The syntax
merge_lr = pd.merge(left, right, how="outer", sort=True)
creates the
merge_lr
DataFrame as an outer join for the
left
and
right
DataFrames, creating the same table output as the SAS
PROC SQL
logic from Listing
B-2
.