Chapter 3. Implementing SCD – Using Designer Screen Wizards

Slowly Changing Dimensions (SCD), as the name suggests, allows you to maintain changes in the Dimension table in Datawarehouse. Before you read this chapter, make sure that you have a complete understanding of data warehousing concepts, especially SCD. Also, make sure you know the SCD1, SCD2, and SCD3 types. For your reference, we have described each SCD in detail in this chapter. For more details, refer to The Data Warehouse Toolkit, Ralph Kimball, Wiley India Private Limited. Before we move ahead with the implementation of the SCD in Informatica Power Center, let's discuss the different types of SCDs.

Note that we are talking about the general SCDs in our discussion, that is, SCD1, SCD2, and SCD3. Apart from these, there will always be Hybrid SCDs, which you will come across as well. Hybrid SCDs are nothing but a combination of multiple SCDs that serve your complex business requirements.

The various types of SCDs are described as follows:

  • Type 1 Dimension mapping (SCD1): This keeps only the current data and does not maintain historical data.

    Note

    Use SCD1 mapping when you do not want to keep the history of the previous data.

  • Type 2 Dimension/Version Number mapping (SCD2): This keeps current as well as historical data in the table. SCD2 allows you to insert new records and changed records using a new column (PM_VERSION_NUMBER) by maintaining the version number in the table to track the changes. We use a new PM_PRIMARYKEY column to maintain the history.

    Note

    Use SCD2 mapping when you want to keep the full history of the dimension data and track the progression of changes using a version number.

  • Type 2 Dimension/Flag mapping: This keeps the current as well as historical data in the table. SCD2 allows you to insert new records and changed records using a new column (PM_CURRENT_FLAG) by maintaining the flag in the table to track the changes. We use a new PRIMARY_KEY column to maintain the history.

    Note

    Use SCD2 mapping when you want to keep the full history of dimension data and track the progression of changes using a flag.

  • Type 2 Dimension/Effective date range mapping: This keeps current as well as historical data in the table. SCD2 allows you to insert new records and changed records using two new columns (PM_BEGIN_DATE and PM_END_DATE) by maintaining the date range in the table to track the changes. We use a new PRIMARY_KEY column to maintain the history.

    Note

    Use SCD2 mapping when you want to keep the full history of dimension data and track the progression of changes using Start Date and End Date.

  • Type 3 Dimension mapping: This keeps the current as well as historical data in the table. We maintain only partial history by adding a new PM_PREV_COLUMN_NAME column; that is, we do not maintain full history.

    Note

    Use SCD3 mapping when you wish to maintain only partial history.

Let's take an example to understand the different SCDs.

Consider that there is a LOCATION column in the EMPLOYEE table and you wish to track the changes in the location of the employees. Consider a record for the 1001 employee ID that is present in your EMPLOYEE dimension table. STEVE was initially working in India and then was shifted to USA. We want to maintain the history in the LOCATION field.

EMPLOYEE_ID

NAME

LOCATION

1001

STEVE

INDIA

Your datawarehouse table should reflect the current status of STEVE. To implement this, we have different types of SCDs.

Take a look at the following table of type SCD1:

PM_PRIMARY_KEY

EMPLOYEE_ID

NAME

LOCATION

100

1001

STEVE

USA

As you can see, INDIA will be replaced with USA, so we end up having only current data, and we lose historical data. Now, if STEVE is again shifted to JAPAN, the LOCATION data will be replaced from USA to JAPAN, as follows:

PM_PRIMARY_KEY

EMPLOYEE_ID

NAME

LOCATION

100

1001

STEVE

JAPAN

The advantage of SCD1 is that we do not consume a lot of space to maintain the data; the disadvantage is we don't have the historical data.

Take a look at the following table of type SCD2, where we have added the version number:

PM_PRIMARYKEY

EMPLOYEE_ID

NAME

LOCATION

PM_VERSION_NUMBER

100

1001

STEVE

INDIA

0

101

1001

STEVE

USA

1

102

1001

STEVE

JAPAN

2

200

1002

MIKE

UK

0

As you can see, we are maintaining the full history by adding new records to maintain the history of the previous records. We add two new columns in the table, that is, PM_PRIMARYKEY to handle the issues of duplicate records in the primary key in the EMPLOYEE_ID (supposed to be the primary key) column, and PM_VERSION_NUMBER to understand the current and historical records.

The following SCD2 table has the flag column added to it:

PM_PRIMARYKEY

EMPLOYEE_ID

NAME

LOCATION

PM_CURRENT_FLAG

100

1001

STEVE

INDIA

0

101

1001

STEVE

USA

1

As you can see, we are maintaining the full history by adding new records to maintain the history of the previous records. We add two new columns in the table, that is, PM_PRIMARYKEY to handle the issues of duplicate records in the primary key in the EMPLOYEE_ID column, and PM_CURRENT_FLAG to understand the current and history record.

Again, if STEVE is shifted, the data would look like this:

PM_PRIMARYKEY

EMPLOYEE_ID

NAME

LOCATION

PM_CURRENT_FLAG

100

1001

STEVE

INDIA

0

101

1001

STEVE

USA

0

102

1001

STEVE

JAPAN

1

The following table of type SCD2 shows you the data range added to it:

PM_PRIMARYKEY

EMPLOYEE_ID

NAME

LOCATION

PM_BEGIN_DATE

PM_END_DATE

100

1001

STEVE

INDIA

01-01-14

31-05-14

101

1001

STEVE

USA

01-06-14

99-99-9999

As you can see, we are maintaining the full history by adding new records to maintain the history of the previous records. We add three new columns in the table, that is, PM_PRIMARYKEY to handle the issues of duplicate records in the primary key in the EMPLOYEE_ID column and PM_BEGIN_DATE and PM_END_DATE to understand the versions in the data.

The advantage of SCD2 is that you have the complete history of the data, which is a must for data warehouses, whereas the disadvantage of SCD2 is that it consumes a lot of space.

Take a look at the following SCD3 table:

PM_PRIMARYKEY

EMPLOYEE_ID

NAME

LOCATION

PM_PREV_LOCATION

100

1001

STEVE

USA

INDIA

As you can see, we are maintaining the history by adding a new column to maintain the history. An optional PM_PRIMARYKEY column can be added to maintain the primary key constraints. We add a new PM_PREV_LOCATION column in the table to store the changes in the data. As you can see, we added a new column to store data as against SCD2, where we added rows to maintain the history.

If STEVE is now shifted to JAPAN, the data changes to:

PM_PRIMARYKEY

EMPLOYEE_ID

NAME

LOCATION

PM_PREV_LOCATION

100

1001

STEVE

JAPAN

USA

As you can see, we lost INDIA from the data warehouse, and that is why we say we are maintaining partial history.

Note

To implement SCD3, decide how many versions of a particular column you wish to maintain. Based on this, the columns will be added in the table.

SCD3 is best when you are not interested in maintaining the complete history but are interested in maintaining only partial history. The drawback of SCD3 is that it doesn't store the full history.

At this point, you should be very clear with the different types of SCD. We need to practically implement these concepts in Informatica PowerCenter. Informatica PowerCenter provides a utility called wizard to implement the SCD. Using this wizard, you can easily implement any SCD. In the upcoming chapters, we will learn how to use the wizard to implement SCD1, SCD2, and SCD3.

Before you proceed to the next section, make sure you have a proper understanding of the transformations in Informatica PowerCenter. You should be clear about the source qualifier, expression, filter, router, lookup, update strategy, and sequence generator transformations. The wizard creates a mapping using all these transformations to implement the SCD functionality.

When we implement SCD, there will be some new records that will need to be loaded into the target table, and there will be some existing records for which we need to maintain the history.

Note

The record that appears for the first time in the table will be referred to as the NEW record, and the record for which we need to maintain the history will be referred to as the CHANGED record. Based on the comparison of the source data with the target data, we will decide which one is the NEW record and which one is the CHANGED record.

To start, we will use a sample file as our source and an Oracle table as our target in order to implement SCDs. Before we implement an SCD, let's talk about the logic that will serve our purpose, and then we will fine-tune the logic for each type of SCD:

  1. Extract all records from the source.
  2. Look up at the target table and cache all the data.
  3. Compare the source data with the target data to flag the NEW and CHANGED records.
  4. Filter the data based on the NEW and CHANGED flags.
  5. Generate the primary key for every new row inserted into the table.
  6. Load the NEW record into the table and update the existing record, if needed.

Based on the specific SCD, the preceding logic will be modified to a certain extent.

SCD1 – I hate history!

To implement SCD1 using wizard, perform the following steps:

  1. In the designer, navigate to Tools | Mapping Designer | Mapping | Wizard | Slowly Changing Dimensions, as shown in the following screenshot:
    SCD1 – I hate history!
  2. A new window will pop up, asking you the name (m_SCD1) of the new SCD mapping. Select Type 1 Dimension - keep most recent values in the target, as we are implementing SCD1. Click on Next as follows:
    SCD1 – I hate history!
  3. The next screen will ask you to select the source. Select a source from the dropdown. All the sources present in your repository will be listed in this drop-down list. We are using EMP_FILE.txt as the source file for our reference. Also, specify the name of the target you wish to create. We will name the target as EMPLOYEE_SCD1 in this book for our reference. Click on Next.
    SCD1 – I hate history!
  4. In the next window, select EMPLOYEE_ID as Logical Key Field. This specifies which column will be used to check for the existence of data in the target. Make sure that the column you use is the Key column of the source. Also, add LOCATION under Fields to compare the changes. This specifies the column for which you wish to maintain the history. Click on Finish.
    SCD1 – I hate history!
  5. The wizard creates a complete mapping in your Mapping Designer Workspace. Make necessary changes to the mapping if required. An example of what your mapping could look like is as follows:
    SCD1 – I hate history!

Before we proceed further, we need to make some points clear:

  • As we have used flat file as a source, the Informatica PowerCenter wizard generates the target as a file as well. We cannot maintain SCD on files, so make sure you change the target type to the database. We will be changing this to the Oracle table as a reference. You can do this in the Target Designer. Drag the target (EMPLOYEE_SCD1) created by the wizard in the Target Designer, double-click to open the properties, and change the database type of Oracle. This will change the type of target from the file to Oracle. Once you modify the target table to the Oracle database, the mapping will look like the following figure:
    SCD1 – I hate history!
  • The wizard creates two instances of the same Oracle target table in the mapping. Load the data from the NEW and CHANGED flows, respectively. Understand clearly that these two structures refer to the same Oracle table EMPLOYEE_SCD1. Even though the name (EMPLOYEE_SCD1 and EMPLOYEE_SCD11) is different in the view, when you double-click on the target instances in the Table tab, you can see Table Name as EMPLOYEE_SCD1.

As we are done with the mapping, it's time to analyze it. It is very important to understand each component of the mapping.

The Informatica PowerCenter SCD1 mapping uses a lookup transformation to look up the data in the target table and uses expression transformation to compare the target data with the source data. Based on the comparison, the expression transformation marks a record as a NEW flag or a CHANGED flag. The mapping is divided into two flows:

  • The FIL_InsertNewRecord filter transformation allows only the NEW record to pass further and filters the records marked as CHANGED from the first flow. It passes new records to UPD_ForceInserts, which inserts these records into the target. The sequence generator generates the primary key for each NEW record.
  • The FIL_UpdateChangedRecord filter transformation allows only the CHANGED record to pass further and filters the records marked as NEW from the second flow. It passes the changed records to UPD_ChangedUpdate, which replaces existing rows in the target to reflect the latest changes.

Let's understand each transformation that is used in the SCD1 mapping:

  • The source qualifier (SQ_EMP_FILE): This extracts the data from the file or table you used as the source in the mapping. It passes data to the downstream transformations, that is, lookup, expression, and filter transformation.
  • Lookup (LKP_GetData): This is used to look up the target table. It caches the existing data from the EMPLOYEE_SCD1 table.

    The EMPLOYEE_ID=IN_EMPLOYEE_ID condition in the Condition tab will compare the data with the source table and the target table. Based on the comparison, it passes the required data to the expression transformation.

  • Expression (EXP_DetectChanges): This receives the data from the upstream transformation, and based on the comparison, it creates two flags, which are NewFlag and ChangedFlag. In our case, we are using the LOCATION field for comparison.

    For every record that comes from a source, if there is no matching record in target, we can flag that record as NewFlag; that is, the EMPLOYEE_ID != EMPLOYEE_ID condition signifies NewFlag. If no matching record is present for EMPLOYEE_ID in the target, it signifies that PM_PRIMARYKEY will not be available. So, the lookup transformation will return NULL for the PM_PRIMARYKEY column.

    For every record that comes from a source, if there is a matching record in the target and if the location from source does not match the location for a particular EMPLOYEE_ID from the target, we can flag that record as ChangedFlag, that is, EMPLOYEE_ID = EMPLOYEE_ID AND LOCATION != PM_PREV_LOCATION:

    • The wizard created the condition for NewFlag as IIF(ISNULL(PM_PRIMARYKEY), TRUE, FALSE)
    • The condition for ChangedFlag is IIF(NOT ISNULL(PM_PRIMARYKEY) AND (DECODE(LOCATION,PM_PREV_LOCATION,1,0)=0), TRUE, FALSE)

    Based on the condition, it passes the data to downstream filter transformations.

  • Filter (FIL_InsertNewRecord): This filters the records that come from an upstream expression transformation and are marked as ChangedFlag; it only allows records with NewFlag to get passed to the UPD_ForceInserts update strategy.
  • Filter (FIL_UpdateChangedRecord): This filters the records that come from an upstream expression transformation and are marked as NewFlag; it only allows records with ChangedFlag to get passed to the UPD_ChangedUpdate update strategy.
  • Update strategy (UPD_ForceInserts): This uses the DD_INSERT condition to insert data into the target, which is EMPLOYEE_SCD1.
  • Update strategy (UPD_ChangedUpdate): This uses the DD_UPDATE condition to overwrite the existing LOCATION field into the EMPLOYEE_SCD11 target instance.
  • Sequence generator (SEQ_GenerateKeys): This generates a sequence of values for each row marked as NewFlag, which is then incrementally loaded into the target by 1. It populates the value into PM_PRIMARYKEY in the EMPLOYEE_SCD1 target instance.
  • Target (EMPLOYEE_SCD1): This is the target table instance that accepts the NewFlag records into the target table.
  • Target (EMPLOYEE_SCD11): This is the target table instance that accepts the ChangedFlag records into the target table.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset