Slowly Changing Dimensions (SCD), as the name suggests, allows you to maintain changes in the Dimension
table in Datawarehouse
. Before you read this chapter, make sure that you have a complete understanding of data warehousing concepts, especially SCD. Also, make sure you know the SCD1, SCD2, and SCD3 types. For your reference, we have described each SCD in detail in this chapter. For more details, refer to The Data Warehouse Toolkit, Ralph Kimball, Wiley India Private Limited. Before we move ahead with the implementation of the SCD in Informatica Power Center, let's discuss the different types of SCDs.
Note that we are talking about the general SCDs in our discussion, that is, SCD1, SCD2, and SCD3. Apart from these, there will always be Hybrid SCDs, which you will come across as well. Hybrid SCDs are nothing but a combination of multiple SCDs that serve your complex business requirements.
The various types of SCDs are described as follows:
PM_VERSION_NUMBER
) by maintaining the version number in the table to track the changes. We use a new PM_PRIMARYKEY
column to maintain the history.PM_CURRENT_FLAG
) by maintaining the flag in the table to track the changes. We use a new PRIMARY_KEY
column to maintain the history.PM_BEGIN_DATE
and PM_END_DATE
) by maintaining the date range in the table to track the changes. We use a new PRIMARY_KEY
column to maintain the history.PM_PREV_COLUMN_NAME
column; that is, we do not maintain full history.Let's take an example to understand the different SCDs.
Consider that there is a LOCATION
column in the EMPLOYEE
table and you wish to track the changes in the location of the employees. Consider a record for the 1001
employee ID that is present in your EMPLOYEE
dimension table. STEVE
was initially working in India and then was shifted to USA. We want to maintain the history in the LOCATION
field.
EMPLOYEE_ID |
NAME |
LOCATION |
---|---|---|
1001 |
STEVE |
INDIA |
Your datawarehouse
table should reflect the current status of STEVE
. To implement this, we have different types of SCDs.
Take a look at the following table of type SCD1:
PM_PRIMARY_KEY |
EMPLOYEE_ID |
NAME |
LOCATION |
---|---|---|---|
100 |
1001 |
STEVE |
USA |
As you can see, INDIA
will be replaced with USA
, so we end up having only current data, and we lose historical data. Now, if STEVE
is again shifted to JAPAN
, the LOCATION
data will be replaced from USA
to JAPAN
, as follows:
PM_PRIMARY_KEY |
EMPLOYEE_ID |
NAME |
LOCATION |
---|---|---|---|
100 |
1001 |
STEVE |
JAPAN |
The advantage of SCD1 is that we do not consume a lot of space to maintain the data; the disadvantage is we don't have the historical data.
Take a look at the following table of type SCD2, where we have added the version number:
PM_PRIMARYKEY |
EMPLOYEE_ID |
NAME |
LOCATION |
PM_VERSION_NUMBER |
---|---|---|---|---|
100 |
1001 |
STEVE |
INDIA |
0 |
101 |
1001 |
STEVE |
USA |
1 |
102 |
1001 |
STEVE |
JAPAN |
2 |
200 |
1002 |
MIKE |
UK |
0 |
As you can see, we are maintaining the full history by adding new records to maintain the history of the previous records. We add two new columns in the table, that is, PM_PRIMARYKEY
to handle the issues of duplicate records in the primary key in the EMPLOYEE_ID
(supposed to be the primary key) column, and PM_VERSION_NUMBER
to understand the current and historical records.
The following SCD2 table has the flag column added to it:
PM_PRIMARYKEY |
EMPLOYEE_ID |
NAME |
LOCATION |
PM_CURRENT_FLAG |
---|---|---|---|---|
100 |
1001 |
STEVE |
INDIA |
0 |
101 |
1001 |
STEVE |
USA |
1 |
As you can see, we are maintaining the full history by adding new records to maintain the history of the previous records. We add two new columns in the table, that is, PM_PRIMARYKEY
to handle the issues of duplicate records in the primary key in the EMPLOYEE_ID
column, and PM_CURRENT_FLAG
to understand the current and history record.
Again, if STEVE
is shifted, the data would look like this:
PM_PRIMARYKEY |
EMPLOYEE_ID |
NAME |
LOCATION |
PM_CURRENT_FLAG |
---|---|---|---|---|
100 |
1001 |
STEVE |
INDIA |
0 |
101 |
1001 |
STEVE |
USA |
0 |
102 |
1001 |
STEVE |
JAPAN |
1 |
The following table of type SCD2 shows you the data range added to it:
PM_PRIMARYKEY |
EMPLOYEE_ID |
NAME |
LOCATION |
PM_BEGIN_DATE |
PM_END_DATE |
---|---|---|---|---|---|
100 |
1001 |
STEVE |
INDIA |
01-01-14 |
31-05-14 |
101 |
1001 |
STEVE |
USA |
01-06-14 |
99-99-9999 |
As you can see, we are maintaining the full history by adding new records to maintain the history of the previous records. We add three new columns in the table, that is, PM_PRIMARYKEY
to handle the issues of duplicate records in the primary key in the EMPLOYEE_ID
column and PM_BEGIN_DATE
and PM_END_DATE
to understand the versions in the data.
The advantage of SCD2 is that you have the complete history of the data, which is a must for data warehouses, whereas the disadvantage of SCD2 is that it consumes a lot of space.
Take a look at the following SCD3 table:
PM_PRIMARYKEY |
EMPLOYEE_ID |
NAME |
LOCATION |
PM_PREV_LOCATION |
---|---|---|---|---|
100 |
1001 |
STEVE |
USA |
INDIA |
As you can see, we are maintaining the history by adding a new column to maintain the history. An optional PM_PRIMARYKEY
column can be added to maintain the primary key constraints. We add a new PM_PREV_LOCATION
column in the table to store the changes in the data. As you can see, we added a new column to store data as against SCD2, where we added rows to maintain the history.
If STEVE
is now shifted to JAPAN
, the data changes to:
PM_PRIMARYKEY |
EMPLOYEE_ID |
NAME |
LOCATION |
PM_PREV_LOCATION |
---|---|---|---|---|
100 |
1001 |
STEVE |
JAPAN |
USA |
As you can see, we lost INDIA
from the data warehouse, and that is why we say we are maintaining partial history.
SCD3 is best when you are not interested in maintaining the complete history but are interested in maintaining only partial history. The drawback of SCD3 is that it doesn't store the full history.
At this point, you should be very clear with the different types of SCD. We need to practically implement these concepts in Informatica PowerCenter. Informatica PowerCenter provides a utility called wizard to implement the SCD. Using this wizard, you can easily implement any SCD. In the upcoming chapters, we will learn how to use the wizard to implement SCD1, SCD2, and SCD3.
Before you proceed to the next section, make sure you have a proper understanding of the transformations in Informatica PowerCenter. You should be clear about the source qualifier, expression, filter, router, lookup, update strategy, and sequence generator transformations. The wizard creates a mapping using all these transformations to implement the SCD functionality.
When we implement SCD, there will be some new records that will need to be loaded into the target table, and there will be some existing records for which we need to maintain the history.
The record that appears for the first time in the table will be referred to as the NEW
record, and the record for which we need to maintain the history will be referred to as the CHANGED
record. Based on the comparison of the source data with the target data, we will decide which one is the NEW
record and which one is the CHANGED
record.
To start, we will use a sample file as our source and an Oracle table as our target in order to implement SCDs. Before we implement an SCD, let's talk about the logic that will serve our purpose, and then we will fine-tune the logic for each type of SCD:
NEW
and CHANGED
records.NEW
and CHANGED
flags.NEW
record into the table and update the existing record, if needed.Based on the specific SCD, the preceding logic will be modified to a certain extent.
To implement SCD1 using wizard, perform the following steps:
m_SCD1
) of the new SCD mapping. Select Type 1 Dimension - keep most recent values in the target, as we are implementing SCD1. Click on Next as follows:EMP_FILE.txt
as the source file for our reference. Also, specify the name of the target you wish to create. We will name the target as EMPLOYEE_SCD1
in this book for our reference. Click on Next.Key
column of the source. Also, add LOCATION under Fields to compare the changes. This specifies the column for which you wish to maintain the history. Click on Finish.Before we proceed further, we need to make some points clear:
Oracle
table as a reference. You can do this in the Target Designer. Drag the target (EMPLOYEE_SCD1
) created by the wizard in the Target Designer, double-click to open the properties, and change the database type of Oracle
. This will change the type of target from the file to Oracle. Once you modify the target table to the Oracle database, the mapping will look like the following figure:NEW
and CHANGED
flows, respectively. Understand clearly that these two structures refer to the same Oracle table EMPLOYEE_SCD1
. Even though the name (EMPLOYEE_SCD1
and EMPLOYEE_SCD11
) is different in the view, when you double-click on the target instances in the Table tab, you can see Table Name as EMPLOYEE_SCD1
.As we are done with the mapping, it's time to analyze it. It is very important to understand each component of the mapping.
The Informatica PowerCenter SCD1 mapping uses a lookup transformation to look up the data in the target table and uses expression transformation to compare the target data with the source data. Based on the comparison, the expression transformation marks a record as a NEW
flag or a CHANGED
flag. The mapping is divided into two flows:
FIL_InsertNewRecord
filter transformation allows only the NEW
record to pass further and filters the records marked as CHANGED
from the first flow. It passes new records to UPD_ForceInserts
, which inserts these records into the target. The sequence generator generates the primary key for each NEW
record.FIL_UpdateChangedRecord
filter transformation allows only the CHANGED
record to pass further and filters the records marked as NEW
from the second flow. It passes the changed records to UPD_ChangedUpdate
, which replaces existing rows in the target to reflect the latest changes.Let's understand each transformation that is used in the SCD1 mapping:
SQ_EMP_FILE
): This extracts the data from the file or table you used as the source in the mapping. It passes data to the downstream transformations, that is, lookup, expression, and filter transformation.LKP_GetData
): This is used to look up the target table. It caches the existing data from the EMPLOYEE_SCD1
table.The EMPLOYEE_ID=IN_EMPLOYEE_ID
condition in the Condition tab will compare the data with the source table and the target table. Based on the comparison, it passes the required data to the expression transformation.
EXP_DetectChanges
): This receives the data from the upstream transformation, and based on the comparison, it creates two flags, which are NewFlag
and ChangedFlag
. In our case, we are using the LOCATION
field for comparison.For every record that comes from a source, if there is no matching record in target, we can flag that record as NewFlag
; that is, the EMPLOYEE_ID != EMPLOYEE_ID
condition signifies NewFlag
. If no matching record is present for EMPLOYEE_ID
in the target, it signifies that PM_PRIMARYKEY
will not be available. So, the lookup transformation will return NULL
for the PM_PRIMARYKEY
column.
For every record that comes from a source, if there is a matching record in the target and if the location from source does not match the location for a particular EMPLOYEE_ID
from the target, we can flag that record as ChangedFlag
, that is, EMPLOYEE_ID = EMPLOYEE_ID AND LOCATION != PM_PREV_LOCATION
:
NewFlag
as IIF(ISNULL(PM_PRIMARYKEY), TRUE, FALSE)
ChangedFlag
is IIF(NOT ISNULL(PM_PRIMARYKEY) AND (DECODE(LOCATION,PM_PREV_LOCATION,1,0)=0), TRUE, FALSE)
Based on the condition, it passes the data to downstream filter transformations.
FIL_InsertNewRecord
): This filters the records that come from an upstream expression transformation and are marked as ChangedFlag
; it only allows records with NewFlag
to get passed to the UPD_ForceInserts
update strategy.FIL_UpdateChangedRecord
): This filters the records that come from an upstream expression transformation and are marked as NewFlag
; it only allows records with ChangedFlag
to get passed to the UPD_ChangedUpdate
update strategy.UPD_ForceInserts
): This uses the DD_INSERT
condition to insert data into the target, which is EMPLOYEE_SCD1
.UPD_ChangedUpdate
): This uses the DD_UPDATE
condition to overwrite the existing LOCATION
field into the EMPLOYEE_SCD11
target instance.SEQ_GenerateKeys
): This generates a sequence of values for each row marked as NewFlag
, which is then incrementally loaded into the target by 1. It populates the value into PM_PRIMARYKEY
in the EMPLOYEE_SCD1
target instance.EMPLOYEE_SCD1
): This is the target table instance that accepts the NewFlag
records into the target table.EMPLOYEE_SCD11
): This is the target table instance that accepts the ChangedFlag
records into the target table.