9 Data Warehousing and Business Intelligence Management

Data warehouse and business intelligence management is the seventh Data Management Function in the data management framework shown in Figures 1.3 and 1.4. It is the sixth data management function that interacts with and is influenced by the Data Governance function. Chapter 9 defines the data warehousing and business intelligence management function and explains the concepts and activities involved in data warehousing and business intelligence management.

9.1 Introduction

A Data Warehouse (DW) is a combination of two primary components. The first is an integrated decision support database. The second is the related software programs used to collect, cleanse, transform, and store data from a variety of operational and external sources. Both of these parts combine to support historical, analytical, and business intelligence (BI) requirements. A data warehouse may also include dependent data marts, which are subset copies of a data warehouse database. In its broadest context, a data warehouse includes any data stores or extracts used to support the delivery of data for BI purposes.

An Enterprise Data Warehouse (EDW) is a centralized data warehouse designed to service the business intelligence needs of the entire organization. An EDW adheres to an enterprise data model to ensure consistency of decision support activities across the enterprise.

Data Warehousing is the term used to describe the operational extract, cleansing, transformation, and load processes—and associated control processes—that maintain the data contained within a data warehouse. The data warehousing process focuses on enabling an integrated and historical business context on operational data by enforcing business rules and maintaining appropriate business data relationships. Data Warehousing also includes processes that interact with meta-data repositories.

Data warehousing is a technology solution supporting Business Intelligence (BI). Business Intelligence is a set of business capabilities. BI means many things, including:

  1. Query, analysis, and reporting activity by knowledge workers to monitor and understand the financial operation health of, and make business decisions about, the enterprise.
  2. Query, analysis, and reporting processes and procedures.
  3. A synonym for the business intelligence environment.
  4. The market segment for business intelligence software tools.
  5. Strategic and operational analytics and reporting on corporate operational data to support business decisions, risk management, and compliance.
  6. A synonym for Decision Support Systems (DSS).

Data Warehousing and Business Intelligence Management (DW-BIM) is the collection, integration, and presentation of data to knowledge workers for the purpose of business analysis and decision-making. DW-BIM is composed of activities supporting all phases of the decision support life cycle that provides context, moves and transforms data from sources to a common target data store, and then provides knowledge workers various means of access, manipulation, and reporting of the integrated target data.

Figure 9.1 outlines the context of Data Warehousing and Business Intelligence Management.

Figure 9.1 DW and BI Management Context Diagram

Objectives for DW-BIM include:

  • Providing integrated storage of required current and historical data, organized by subject areas.
  • Ensuring credible, quality data for all appropriate access capabilities.
  • Ensuring a stable, high-performance, reliable environment for data acquisition, data management, and data access.
  • Providing an easy-to-use, flexible, and comprehensive data access environment.
  • Delivering both content and access to the content in increments appropriate to the organization’s objectives.
  • Leveraging, rather than duplicating, relevant data management component functions such as Reference and Master Data Management, Data Governance (DG), Data Quality (DQ), and Meta-data (MD).
  • Providing an enterprise focal point for data delivery in support of the decisions, policies, procedures, definitions, and standards that arise from DG.
  • Defining, building, and supporting all data stores, data processes, data infrastructure, and data tools that contain integrated, post-transactional, and refined data used for information viewing, analysis, or data request fulfillment.
  • Integrating newly discovered data as a result of BI processes into the DW for further analytics and BI use.

9.2 Concepts and Activities

The purpose of this section is to provide some foundational DW-BIM concepts and definitions, before diving into the details of the specific DW-BIM activities. It presents a quick tour of the history of DW-BIM and an overview of typical DW-BIM components. An explanation of some general DW-BIM terminology follows, and a brief introduction and overview of dimensional modeling and its terminology leads into the activities identified in Figure 9.1.

9.2.1 Data Warehousing—A Brief Retrospective and Historical Tour

In a discussion of any length about data warehousing, two names will invariably come up—Bill Inmon and Ralph Kimball. Each has made significant contributions that have both advanced and shaped the practice of data warehousing. Here is a brief introduction to their major contributions along with some comparisons and contrasts of their approaches.

9.2.1.1 Classic Characteristics of a Data Warehouse—Inmon Version

In the early 1990’s Bill Inmon defined the data warehouse as “a subject oriented, integrated, time variant, and non-volatile collection of summary and detailed historical data used to support the strategic decision-making processes for the corporation.”

These key characteristics give a clear distinction of the nature of the data warehouse data compared to typical operational systems, and still hold largely true as distinctive characteristics of data warehouses.

  • Subject Oriented: Subject orientation of the data warehouse refers to the organization of data along the lines of the major entities of the corporation. The data warehouse is neither functional nor application oriented. Design the data warehouse to meet the data needs of the corporation and not the specific analytical requirements of a particular department.
  • Integrated: Integration refers to the unification and cohesiveness of the data stored in the data warehouse, and covers many aspects, including key structures, encoding and decoding of structures, definitions of the data, naming conventions, and so on. Implicit in this integration is establishing system(s) of record for all data to be included in the scope of the DW. Building the data warehouse is not merely copying data from the operational environment to the data warehouse. Simply consolidating data from multiple sources into a single data source results in a warehouse of data, and not a data warehouse.
  • Time Variant: Time variance of the data warehouse refers to how every record in the data warehouse is accurate relative to a moment in time, and often shows up as an element of time in the key structure. As such, think of the data warehouse as a historical record of snapshots of data, where each snapshot has one moment in time when the record is accurate.
  • Non-Volatile: Non-volatility of the data warehouse refers to the fact that updates to records during normal processing do not occur, and if updates occur at all, they occur on an exception basis. The blending of current operational data with deep, detailed historical data in the data warehouse challenges the non-volatile nature of the data warehouse. Blending is necessary to support both the tactical as well as the strategic decision-making processes. The historical trend and its impacts are covered in section 9.2.4.1, Active Data Warehousing.
  • Summarized and Detail Data: The data in the data warehouse must contain detailed data, representing the atomic level transactions of the enterprise, as well as summarized data. Note: In early versions of the approach, cost and space considerations drove the need for summarized data. Today, performance considerations almost exclusively drive data summarization.
  • Historical: Where operational systems rightfully focus on current-valued data, a hallmark of data warehouses is that they contain a vast amount of historical data (typically 5 to 10 years worth of data). The bulk of this data is typically at a summarized level. The older the data is, the more summarized it usually is.

9.2.1.2 Classic Characteristics of a Data Warehouse—Kimball Version

Ralph Kimball took a different approach, defining a data warehouse simply as “a copy of transaction data specifically structured for query and analysis.” The copy, to differentiate it from the operational system, has a different structure (the dimensional data model) to enable business users to understand and use the data more successfully, and to address DW query performance. Data warehouses always contain more than just transactional data—reference data is necessary to give context to the transactions. However, transactional data is the vast majority of the data in a data warehouse.

Dimensional data models are relational data models. They just do not consistently comply with normalization rules. Dimensional data models reflect business processes more simply than normalized models.

9.2.2 DW / BI Architecture and Components

This section introduces the major components found in most DW / BI environments by providing an overview of the big picture views presented by both Inmon and Kimball. First is the Corporate Information Factory, from Inmon. Second is Kimball’s approach, which he refers to as the “DW Chess Pieces”. Both views and their components are described and contrasted.

9.2.2.1 Inmon’s Corporate Information Factory

Inmon, along with Claudia Imhoff and Ryan Sousa, identified and wrote about components of a corporate data architecture for DW-BIM and called this the “Corporate Information Factory” (CIF). These components appear in tables following Figure 9.2.

Figure 9.2 The Corporate Information Factory

Table 9.1 lists and describes the basic components of the Corporate Information Factory view of DW / BI architecture.

Label – Name

Description

Raw Detailed data

Operational / Transactional Application data of the enterprise. The raw detailed data provides the source data to be integrated into the Operational Data Store (ODS) and DW components. They can also be in database or other storage or file format.

Integration and Transformation

This layer of the architecture is where the un-integrated data from the various application sources stores is combined / integrated and transformed into the corporate representation in the DW.

Reference Data

Reference data was a precursor to what is currently referred to as Master Data Management. The purpose was to allow common storage and access for important and frequently used common data. Focus and shared understanding on data upstream of the Data Warehouse simplifies the integration task in the DW.

Historical Reference Data

When current valued reference data is necessary for transactional applications, and at the same time it is critical to have accurate integration and presentation of historical data, it is necessary to capture the reference data that was in place at any point in time. For more discussion on reference data, see Chapter 8 Master and Reference Data Management.

Operational Data Store (ODS)

The focus of data integration is meeting operating and classically operational reporting needs that require data from multiple operational systems. The main distinguishing data characteristics of an ODS compared to a DW include current-valued vs. DW historical data and volatile vs. DW non-volatile data.

Note: ODS is an optional portion of the overall CIF architecture, dependent upon specific operational needs, and acknowledged as a component that many businesses omit.

Operational Data Mart (Oper-Mart)

A data mart focuses on tactical decision support. Distinguishing characteristics include current-valued vs. DW historical data, tactical vs. DW strategic analysis, and sourcing of data from an ODS rather than just the DW. The Oper-Mart was a later addition to the CIF architecture.

Data Warehouse (DW)

The DW is a large, comprehensive corporate resource, whose primary purpose is to provide a single integration point for corporate data in order to serve management decision, and strategic analysis and planning.

The data flows into a DW from the application systems and ODS ,and flows out to the data marts, usually in one direction only. Data that needs correction is rejected, corrected at its source, and re-fed through the system.

Data Marts (DM)

The purpose of the data marts is to provide for DSS / information processing and access that is customized and tailored for the needs of a particular department or common analytic need.

Table 9.1 Corporate Information Factory Component Descriptions

Table 9.2 provides context for the reporting scope and purpose of each of the Corporate Information Factory components and some explanatory notes.

Component

Reporting Scope / Purpose

Notes

Applications

Isolated Operational Reports

Limited to data within one application instance

ODS

Integrated Operational Reports

Reports requiring data from multiple source systems. Typically, they have more operational than analytical orientation, with little historical data.

DW

Exploratory Analysis

The complete set of corporate data allows for discovery of new relationships and information. Many BI data mining tools work with flat-file extracts from the DW, which can also offload the processing burden from the DW.

Oper-Mart

Tactical Analytics

Analytic reporting based on current-values with a tactical focus. Dimensional data modeling techniques employed.

Data Mart

Analytics – classical management decision support, and Strategic Analytics

Inmon’s early focus was on “departmental analysis”, which was experientially true for real-world organizational issues, such as political and funding expediency. Later work expanded concepts to common-analytic needs crossing departmental boundaries.

Table 9.2 Corporate Information Factory Component Reporting Scope and Purpose

Table 9.3 provides a compare-and-contrast from a business and application perspective between the four major components of the Corporate Information Factory, such as between the Applications, ODS, DW and Data Marts.

Note the following general observations about the contrast between the information on the right hand side for DW and Data Marts, compared to the left hand side for applications, in particular:

  • The purpose shifts from execution to analysis.
  • End users are typically decision makers instead of doers (front line workers).
  • System usage is more ad hoc than the fixed operations of the transactional operations.
  • Response time requirements are relaxed because strategic decisions allow more time than daily operations.
  • Much more data is involved in each operation / query or process

Application Data

ODS

DW

Data Mart

Business Purpose

Specific Business Function

Corp Integrated Operational Needs

Central Data Repository Integration and Reuse

Analysis:

Departmental (Inmon)

Business Process (Kimball)

Business Measures (Wells)

System Orientation

Operations (Execution)

Operations (Reports)

Infrastructure

Informational, Analytic (DSS)

Target Users

End Users: Clerical (Daily Operations)

Line Managers: Tactical Decision Makers

Systems:

Data Marts,

Data Mining

Executives: Performance Metrics / Enterprise Metrics

Sr. Mgrs: Organization Metrics

Mid Mgrs: Process Metrics

Knowledge Workers: Activity

How System is Used

Fixed Ops

Operational Reporting

Stage, Store, Feed

Ad-Hoc

System Availability

High

Medium

Varies

Relaxed

Typical Response Time

Seconds

Seconds to Minutes

Longer (Batch)

Seconds to Hours

# Records in an op.

Limited

Small to Med.

Large

Large

Amount of Data Per Process

Small

Medium

Large

Large

System Development Life Cycle (SDLC)

Classic

Classic

Classic

Modified

Table 9.3 Corporate Information Factory Components—Business / Application View

Table 9.4 provides a compare-and-contrast from a data perspective between the four major components of the Corporate Information Factory, such as between the applications, ODS, DW and Data Marts.

Table 9.4, especially the breakout rows for Amount of History and Latency, represents a classic framework where a majority of DW processes are for higher latency and, often, over-night batch processing. The combination of continued business pressure and requirements for more data faster, and the improvement in underlying technology, are blurring the lines and requiring advances in architectural design and approach. These topics are covered briefly in Section 9.2.4.1, Active Data Warehousing. Considered an advanced topic, it is not presented here as a separate architectural alternative.

Application

ODS

DW

Data Mart

Orientation

Functional

Subject

Subject

Limited Subject

View

Application

Corporate (Ops)

Corporate (Historical)

Focused Analysis

Integration

Not Integrated - Application Specific

Integrated Corporate Data

Integrated Corporate Data

Integrated Subset

Volatility

High, Create / Read / Update / Destroy (CRUD)

Volatile

Non-Volatile

Non-Volatile

Time

Current Value

Current Value

Time Variant

Time Variant

Detail Level

Detail Only

Detail Only

Detail + Summary

Detail + Summary

Amount of History*

30 to 180 Days

30 to 180 days

5-10 years

1-5 years

Latency*

Real Time to Near Real Time (NRT)

NRT

> 24 hours

1 day to 1 month

Normalized?

Yes

Yes

Yes

No

Modeling

Relational

Relational

Relational

Dimensional

Table 9.4 Corporate Information Factory Components—Data View

Note the following general observations about the contrast between the data perspective on the right hand side for DW and Data Marts, compared to the left hand side for applications, in particular:

  • Data is Subject vs. functional orientation.
  • Integrated data vs. stove-piped or siloed.
  • Data is time-variant vs. current-valued only.
  • Higher latency in the data.
  • Significantly more history is available.

9.2.2.2 Kimball’s Business Development Lifecycle and DW Chess Pieces

Ralph Kimball calls his approach the Business Dimensional Lifecycle; however, it is still commonly referred to as the Kimball Approach. From his Design Tip #493*, “We chose the Business Dimensional Lifecycle label instead, because it reinforced our core tenets about successful data warehousing based on our collective experiences since the mid-1980s.”

The basis of the Business Dimensional Lifecycle is three tenets:

  • Business Focus: Both immediate business requirements and more long-term broad data integration and consistency.
  • Atomic Dimensional Data Models: Both for ease of business user understanding and query performance.
  • Iterative Evolution Management: Manage changes and enhancements to the data warehouse as individual, finite projects, even though there never is an end to the number of these projects.

The Business Dimensional Lifecycle advocates using conformed dimensions and facts design. The conformation process enforces an enterprise taxonomy and consistent business rules so that the parts of the data warehouse become re-usable components that are already integrated.

Figure 9.3 is a representation of what Kimball refers to as Data Warehouse Chess Pieces (Adapted from figures in The Data Warehouse Toolkit, 2nd Edition, Ralph Kimball and Margy Ross, John Wiley & Sons, 2002). Note that Kimball’s use of the term “Data Warehouse” has been more inclusive and expansive than that of Inmon. In the diagram below, Kimball uses the term Data Warehouse to encompass everything in both the data staging and data presentation areas.

Figure 9.3 Kimball’s Data Warehouse Chess Pieces

Table 9.5 describes the basic components of the Kimball’s Data Warehouse Chess Pieces view of DW / BI architecture and notes how these components map to CIF components.

Name

Description

Operational Source Systems

Operational / Transactional Applications of the Enterprise. These provide the source data to be integrated into the ODS and DW components.

Equivalent to the Application systems in the CIF diagram.

Data Staging Area

Kimball artfully uses the analogy of a “kitchen” to refer to this area as one where the data is prepared behind-the-scenes for presentation. He refers to it as the comprehensive set of all storage and ETL processes that stand between the source systems and the data presentation area.

The key difference in the architectural approach here is that Kimball’s focus has always been on the efficient end-delivery of the analytical data. With that scope, smaller than Inmon’s corporate management of data, the data staging area becomes a potentially eclectic set of processes needed to integrate and transform data for presentation.

Similar to combining two CIF components, such as Integration and Transformation, and DW.

Note: In recent years, Kimball has acknowledged that an enterprise DW can fit into the architecture inside his Data Staging Area.

Data Presentation Area

Similar to the Data Marts in the CIF picture, with the key architectural difference being an integrating paradigm of a “DW Bus”, such as shared or conformed dimensions unifying the multiple data marts.

Data Access Tools

Focus on the needs and requirements for the end customers / consumers of the data has been a hallmark of Kimball’s approach. These needs translate into selection criteria from a broad range of data access tools to the right tools for the right task.

In the CIF model, the access tools are outside of the DW architecture.

Table 9.5 Kimball’s DW Chess Pieces—Component Descriptions

9.2.3 Tactical, Strategic and Operational BI

Tactical BI is the application of BI tools to analyze business trends by comparing a metric to the same metric from a previous month or year, etc. or to analyze historical data in order to discover trends that need attention. Use Tactical BI to support short-term business decisions.

Strategic BI has classically involved providing metrics to executives, often in conjunction with some formal method of business performance management, to help them determine if the corporation is on target for meeting its goals. Use Strategic BI to support long-term corporate goals and objectives.

Operational BI provides BI to the front lines of the business, where analytical capabilities guide operational decisions. Use Operational BI to manage and optimize business operations. Operational BI was the last of these three approaches to evolve in the industry. Operational BI entails the coupling of BI applications with operational functions and processes, with a requirement for very low tolerance for latency (near real-time data capture and data delivery). Therefore, more architectural approaches such as Service-oriented architectuure (SOA) become necessary to support operational BI fully. Some of these approaches are discussed in Section 9.2.4.1, Active Data Warehousing.

9.2.4 Types of Data Warehousing

The three major types of data warehousing are described in the following sections.

9.2.4.1 Active Data Warehousing

Data Warehouses serving tactical and strategic BI have existed for many years, often with a daily loading frequency, often serviced by a nightly batch window. These architectures were very dependent upon one of Inmon’s original hallmarks of data in the data warehouse, such as non-volatile data.

With the onset of Operational BI (and other general requirements from the business) pushing for lower latency and more integration of real time or near real time data into the data warehouse, new architectural approaches are emerging to deal with the inclusion of volatile data. A common application of operational BI is the automated banking machine (ABM) data provisioning. When making a banking transaction, historical balances and new balances resulting from immediate banking actions, need to be presented to the banking customer real-time. Full treatment of those new approaches for this data provisioning is beyond the scope of this introduction, but it will suffice to introduce two of the key design concepts that are required—isolation of change, and alternatives to batch ETL.

The impact of the changes from new volatile data must be isolated from the bulk of the historical, non-volatile DW data. Typical architectural approaches for isolation include a combination of building partitions and using union queries for the different partitions, when necessary.

Many alternatives to batch ETL handle the shorter and shorter latency requirements for data availability in the DW; some of these include trickle-feeds, pipelining, and Service-oriented architectuure (SOA) where Data Services are designed and maintained.

9.2.4.2 Multi-dimensional Analysis – OLAP

Online Analytical Processing (OLAP) refers to an approach to providing fast performance for multi-dimensional analytic queries. The term OLAP originated, in part, to make a clear distinction from OLTP, Online Transactional Processing. The typical output of OLAP queries are in a matrix format. The dimensions form the rows and columns of the matrix; and the factors, or measures, are the values inside the matrix. Conceptually, this illustrates as a cube. Multi-dimensional analysis with cubes is particularly useful where there are well-known ways analysts want to look at summaries of data.

A common application is financial analysis, where analysts want to repeatedly traverse known hierarchies to analyze data; for example, date (such as Year, Quarter, Month, Week, Day), organization (such as Region, Country, Business Unit, Department), and product hierarchy (such as Product Category, Product Line, Product).

9.2.4.3 ROLAP, MOLAP, HOLAP and DOLAP

Three classic implementation approaches support Online Analytical Processing. The names of these relate to the respective underlying database implementation approach, such as Relational, Multi-dimensional, Hybrid, and Database.

  • Relational Online Analytical Processing (ROLAP): ROLAP supports OLAP by using techniques that implement multi-dimensionality in the two-dimensional tables of relational database managements systems (RDBMS). Star schema joins are a common database design technique used in ROLAP environments.
  • Multi-dimensional Online Analytical Processing (MOLAP): MOLAP supports OLAP by using proprietary and specialized multi-dimensional database technology.
  • Hybrid Online Analytical Processing (HOLAP): This is simply a combination of ROLAP and MOLAP. HOLAP implementations allow part of the data to be stored in MOLAP form and another part of the data to be stored in ROLAP. Implementations vary on the control a designer has to vary the mix of partitioning.
  • Database Online Analytical Processing (DOLAP): A virtual OLAP cube is available as a special proprietary function of a classic relational database.

9.2.5 Dimensional Data Modeling Concepts and Terminology

Dimensional data modeling is the preferred modeling technique for designing data marts. Dr. Ralph Kimball pioneered many of the terms and techniques of dimensional data modeling. The purpose of this section is to introduce the concepts and terms.

Kimball’s focus has been on end-user presentation of the data, and dimensional data modeling, in general, focuses on making it simple for the end-user to understand and access the data. Inherent in the design technique is a conscious trade-off of preferring and choosing easy to understand and use structures from an end-user perspective, at the cost of more implementation work for the developers. This helps contribute to the fact that the majority of data mart design work ends up being in ETL processing.

Table 9.6 contrasts the typical differences in the characteristics of systems built from relational modeling for transactional applications versus those built with dimensional data modeling for data marts.

Entity Relationship Modeling

(Transactional Applications)

Dimensional Data Modeling

(Data Marts)

Typical System

Operational

Informational, Analytic (BI)

# Records in an operation

A few

Many (millions +)

Typical Response Time

Seconds

Seconds, minutes to hours

Target users

Clerical – front line staff

Management and analysts

Orientation

Application – Run the business

Analysis – Analyze the business

Availability

High

Relaxed

Amount of Data Per Process

Small

Large

Time Horizon for data

60-180 days

One to many years

How System is Used

Fixed Operations

Fixed and Ad-Hoc

Table 9.6 System Characteristics for Transactional Applications and Data Marts

Dimensional data modeling is a subset of entity relationship data modeling, and has the basic building blocks of entities, attributes, and relationships. The entities come in two basic types: facts, which provide the measurements; and dimensions, which provide the context. Relationships in simple dimensional modeling are constrained to all go through the fact table, and all dimension-to-fact relationships are one-to-many (1:M).

9.2.5.1 Fact Tables

Fact tables represent and contain important business measures. The term “fact” is overloaded, as “fact tables” (entities) contain one or more “facts” (attributes representing measures). The rows of a fact table correspond to a particular measurement and are numeric, such as amounts, quantities, or counts. Some measurements are the results of algorithms so that meta-data becomes critical to proper understanding and usage. Fact tables take up the most space in the database (90% is a reasonable rule of thumb), and tend to have a large number of rows.

Fact tables express or resolve many-to-many relationships between the dimensions. Access to fact tables is usually through the dimension tables.

Fact tables often have a number of control columns that express when the row was loaded, by what program, or indicators for most current record, or other statuses. These fields help the programmers, the operators and the super-users navigate and validate the data.

9.2.5.2 Dimension Tables

Dimension tables, or dimensions for short, represent the important objects of the business and contain textual descriptions of the business. Dimensions serve as the primary source for “query by” or “report by” constraints. They act as the entry points or links into the fact tables, and their contents provide report groupings and report labels. Dimensions are typically highly de-normalized and account for about 10% of the total data, as a rule of thumb. The depth and quality of the detailed design of dimensions determine the analytic usefulness of the resulting systems.

All designs will likely have a Date dimension and an Organization or Party dimension at a minimum. Other dimensions depend on the type of analysis that supports the data in the fact table.

Dimension tables typically have a small number of rows and large number of columns. Main contents of a dimension table are:

  • Surrogate or non-surrogate key.
  • The primary key representing what is used to link to other tables in the DW.
  • Descriptive elements, including codes, descriptions, names, statuses, and so on.
  • Any hierarchy information, including multiple hierarchies and often ‘types’ breakdown.
  • The business key that the business user uses to identify a unique row.
  • The source system key identification fields for traceability.
  • Control fields similar to the fact table control fields but geared to the type of dimension history capture that is designed, such as Types 1-3, 4 and 6 described below.

Dimensions must have unique identifiers for each row. The two main approaches to identifying keys for dimension tables are surrogate keys and natural keys.

9.2.5.2.1 Surrogate Keys

Kimball’s approach gives each dimension a single primary key, populated by a number unrelated to the actual data. The number is a “surrogate key” or “anonymous key”, and can be either a sequential number, or a truly random number. The advantages of using surrogate keys include:

  • Performance: Numeric fields sometimes search faster than other types of fields.
  • Isolation: It is a buffer from business key field changes. The surrogate key may not need changing if a field type or length changes on the source system.
  • Integration: Enables combinations of data from different sources. The identifying key on the source systems usually do not have the same structure as other systems.
  • Enhancement: Values, such as “Unknown” or “Not Applicable”, have their own specific key value in addition to all of the keys for valid rows.
  • Interoperability: Some data access libraries and GUI functions work better with surrogate keys, because they do not need additional knowledge about the underlying system to function properly.
  • Versioning: Enables multiple instances of the same dimension value, which is necessary for tracking changes over time.
  • De-bugging: Supports load issue analysis, and re-run capability.

In exchange for these advantages, there is extra ETL processing necessary to map the numeric key values to source key values, and maintain the mapping tables.

9.2.5.2.2 Natural Keys

For some systems, it is preferable not to create additional key fields, using, instead, the data that is already present to identify unique rows. The advantages of using natural keys include:

  • Lower overhead: The key fields are already present, not requiring any additional modeling to create or processing to populate.
  • Ease of change: In RDBMS where the concept of a domain exists, it is easy to make global changes due to changes on the source system.
  • Performance advantage: Using the values in the unique keys may eliminate some joins entirely, improving performance.
  • Data lineage: Easier to track across systems, especially where the data travels through more than two systems.

In exchange for these advantages, there can be a need to identify multiple fields in each query as part of the join, and possibly complex values for those non-numeric fields. Also, in some RDBMS, joins using long text strings may perform worse than those using numbers.

9.2.5.3 Dimension Attribute Types

The three main types of dimension attributes are differentiated by the need to retain historical copies. They are creatively named Type 1, Type 2 (and 2a), and Type 3. There are two other types that do not appear very often, also creatively named Type 4 and Type 6 (1+2+3). Types 1 through 3 can co-exist within the same table, and the actions during update depend on which fields with which types are having updates applied.

9.2.5.3.1 Type 1 Overwrite

Type 1 dimension attributes have no need for any historical records at all. The only interest is in the current value, so any updates completely overwrite the prior value in the field in that row. An example of Type 1 is ‘hair color’. When an update occurs, there is no need to retain the current value.

9.2.5.3.2 Type 2 New Row

Type 2 dimension attributes need all historical records. Every time one of these Type 2 fields changes, a new row with the current information is appended to the table, and the previously current row’s expiration date field is updated to expire it. An example is Billing Address. When the Billing Address changes, the row with the old address expires and a new row with the current Billing Address information is appended.

Note that managing Type 2 attributes requires that the table’s key be able to handle multiple instances of the same natural key, either through the use of surrogate keys, by the addition of an index value to the primary key, or the addition of a date value (effective, expiration, insert, and so on) to the primary key.

9.2.5.3.3 Type 3 New Column

Type 3 dimension attributes need only a selected, known portion of history. Multiple fields in the same row contain the historical values. When an update occurs, the current value is moved to the next appropriate field, and the last, no longer necessary, value drops off. An example is a credit score, where only the original score when the account opened, the most current score, and the immediate prior score are valuable. An update would move the current score to the prior score.

Another example is monthly bill totals. There can be 12 fields, named Month01, Month02, etc., or January, February, etc. If the former, then the current month value updates Month01 and all other values move down one field. If the latter, then when the proper month is updated, the user knows that the month after the current month contains last year’s data.

One useful purpose of Type 3 is for attribute value migrations. For example, a company decides to reorganize its product hierarchy, but wants to see sales figures for both the old hierarchy and the new for a year, to make sure that all sales are being recorded appropriately. Having both the old and the new available for a period of time allows this transition in the data.

9.2.5.3.4 Type 4 New Table

Type 4 dimension attributes initiate a move of the expired row into a ‘history’ table, and the row in the ‘current’ table is updated with the current information. An example would be a Supplier table, where expired Supplier rows roll off into the history table after an update, so that the main dimension table only contains current Supplier rows. The latter is sometimes called a Type 2a dimension.

Retrievals involving timelines are more complex in a Type 4 design, since current and history tables need to be joined before joining with the fact table. Therefore, it is optimal when the vast majority of access uses current dimension data and the historical table is maintained more for audit purposes than for active retrievals.

9.2.5.3.5 Type 6 1+2+3

Type 6 treats the dimension table as a Type 2, where any change to any value creates a new row, but the key value (surrogate or natural) does not change. One way to implement Type 6 is to add three fields to each row—effective date, expiration date, and a current row indicator. Queries looking for data as of any particular point in time check to see if the desired date is between the effective and end dates. Queries looking for only current data, add filters for the current row indicator. Adding filters has the drawback of requiring additional knowledge to create queries that correctly ask for the proper row by period value or indicator.

Another way to implement Type 6 is to add an index field instead of a current row indicator, with the current value of 0. Updated rows get the index value of zero, and all rows add 1 to their index values to move them down the line. Queries looking for the current values would set the filter for index value equal to zero, and queries looking for prior times would still use the effective and expiration dates. This technique has the drawback that all fact rows will link automatically to the index version 0 (the current row). Queries joining to the fact table will not find any prior values of the dimension unless the dimensional effective and expiration dates are included in the query.

9.2.5.4 Star Schema

A star schema is the representation of a dimensional data model with a single fact table in the center connecting to a number of surrounding dimension tables, as shown in Figure 9.4. It is also referred to as a star join schema, emphasizing that the joins from the central fact table are via single primary keys to each of the surrounding dimension tables. The central fact table has a compound key composed of the dimension keys.

9.2.5.5 Snowflaking

Snowflaking is the term given to de-normalizing the flat, single-table, dimensional structure in a star schema into the respective component hierarchical or network structures. Kimball’s design methods discourage snowflaking on two main principals: 1) it dilutes the simplicity and end-user understandability of the star schema, and 2) the space savings are typically minimal.

Three types of snowflake tables are recognized: true snowflakes, outriggers, and bridges:

  • Snowflake tables: Formed when a hierarchy is resolved into level tables. For example: a daily Period Dimension table resolves into the detail table for Date, and another table for Month or Year that is linked directly to the Date table.
  • Outrigger tables: Formed when attributes in one dimension table links to rows in another dimension table. For example, a date field in one dimension (such as Employee Hire Date) links to the Period Dimension table to facilitate queries that want to sort Employees by Hire Date Fiscal Year.
  • Bridge tables: Formed in two situations. The first is when a many-to-many relationship between two dimensions that is not or cannot be resolved through a fact table relationship. One example is a bank account with shared owners. The bridge table captures the list of owners in an ‘owner group’ bridge table. The second is when normalizing variable-depth or ragged hierarchies. The bridge table can capture each parent-child relationship in the hierarchy, enabling more efficient traversal.

9.2.5.6 Grain

Kimball coined the term grain to stand for the meaning or description of a single row of data in a fact table. Or, put another way, it refers to the atomic level of the data for a transaction. Defining the grain of a fact table is one of the key steps in Kimball’s dimensional design method. For example, if the fact table has data for a store for all transactions for a month, we know the grain or limits of the data in the fact table will not include data for last year.

Figure 9.4 Example Star Schema

9.2.5.7 Conformed Dimensions

Conformed dimensions are the common or shared dimensions across multiple data marts in Kimball’s design method. More precisely, Kimball defines dimensions to be conformed when they either match in terms of both data element names and their respective values, or contain a strict subset. The practical importance is that the row headers from any answer sets from conformed dimensions must be able to match exactly.

For example, think of multiple data marts or fact tables, all linking directly to the same dimension table, or a direct copy of that dimension table. Updates to that dimension table automatically show in all queries for those data marts.

Reuse of conformed dimensions in other star schemas allows for modular development of the DW. Stars can be clipped together through conformed dimensions as the design grows. A DW that starts with a fact table for the Accounts Payable Department can be clipped onto a fact on vendor performance in Supply Department through a Product dimension that they share in common. Ultimately, queries walk across subject areas to unify data access to the DW across the entire enterprise.

9.2.5.8 Conformed Facts

Conformed facts use standardized definitions of terms across individual marts. Different business users may use the same term in different ways. Does “customer additions” refer to “gross additions” or “adjusted additions”? Does “orders processed” refer to the entire order, or the sum of individual line items.

Developers need to be keenly aware of things that may be called the same but are different concepts across organizations, or conversely things that are called differently but are actually the same concept across organizations.

9.2.5.9 DW-Bus Architecture and Bus Matrix

The term bus came from Kimball’s electrical engineering background, where a bus was something providing common power to a number of electrical components. Building on that analogy, the DW-bus architecture of conformed dimensions is what allows multiple data marts to co-exist and share by plugging into a bus of shared or conformed dimensions.

The DW-bus matrix is a tabular way of showing the intersection of data marts, data processes, or data subject areas with the shared conformed dimensions. Table 9.7 shows a sample tabular representation of bus architecture. The opportunity for conformed dimensions appears where a data mart is marked as using multiple dimensions (the row). The DW-bus appears where multiple data marts use the same dimensions (the column).

The DW-bus matrix is a very effective communication and planning tool. Its unifying concept is one of Kimball’s most valuable contributions to the DW-BIM practice. The unifying concept becomes a key living design document in the DW-BIM. As new design pieces are added, the existing dimensions and facts, complete with their sources, update logic, and schedule, need to be reviewed for possible re-use.

Conformed Dimensions

Business Process (Marts)

Date

Product

Store

Vendor

Warehouse

Sales

X

X

X

Inventory

X

X

X

X

X

Orders

X

X

X

Table 9.7 DW-Bus Matrix Example

9.3 DW-BIM Activities

Data warehousing is concerned primarily with the part of the DW-BIM lifecycle from data source to a common data store across all relevant departments—in short, data content. BIM is concerned with the portion of lifecycle from common data store to targeted audience use—in short, data presentation.

DW and BIM naturally intertwine, as no DW can deliver value to the organization without some means of providing access to the collected data along with analytic and reporting capabilities. In turn, the effectiveness of a BIM capability is directly dependent upon the provision of data from the DW that is timely, relevant, integrated, and has other quality factors controlled for and documented as required.

DW-BIM activities overlap with many of the data management functions already covered in the Guide. The purpose of the DW-BIM Activities section is to articulate the activities involved in DW-BIM in a practical implementation based context. It includes references to other data management functions, with definitions elsewhere in the Guide, as well as providing practical insights into the various methods, tools, and techniques that are specific to the DW-BIM function.

9.3.1 Understand Business Intelligence Information Needs

Starting with and keeping a consistent business focus throughout the DW-BIM lifecycle is essential to success. Looking at the value chain of the enterprise is a good way to understand the business context. The specific business processes in a company’s value chain provide a natural business-oriented context in which to frame areas of analysis. See Section 4.2.2.3 for further information on Value Chains and Figure 4.4 for a good example.

Gathering requirements for DW-BIM projects has both similarities to and differences from gathering requirements for other projects in typical IT development. In DW-BIM projects, it is generally more important to understand the broader business context of the business area targeted, as reporting is generalized and exploratory. The broader business context is in stark contrast to operational systems, where the development process defines, up-front, the precise specific details and requirements of operations and reports.

Analysis for DW-BIM projects is more ad-hoc by nature and involves asking questions, which, in turn, lead to new or different questions. Of course, querying will be limited by the nature and quality of the data available. And although the exact specifications of all reports will not be known, what can be known is the context for the questions—the ways the knowledge workers will most likely want to slice-and-dice the data.

Identify and scope the business area, then identify and interview the appropriate business people. Ask what they do and why. Capture specific questions they want to ask. Document how they distinguish between and categorize important aspects of the information. Ask the business people how they track and define success. Where possible, define and capture key performance metrics and formulae.

Capturing the actual business vocabulary and terminology is a key to success. The BI requirements-gathering activity is a great opportunity to partner with the Meta-data Management function (see Chapter 11), as it is critical to have a business-understandable context of data end-to-end; from the initial data sources, through all the transformations, to the end presentation. In summary, a DW-BIM project requirements write-up should frame the whole context of the business areas and / or processes that are in scope.

Document the business context, then explore the details of the actual source data. Typically, the ETL portion can consume 67% of a DW-BIM project’s dollars and time. Data profiling is very helpful, and collaborating with the Data Quality function is essential (see Chapter 12). Evaluation of the state of the source data leads to more accurate up-front estimates for feasibility and scope of effort. The evaluation is also important for setting appropriate expectations.

Note that the DW is often the first place where the pain of poor quality data in source systems and / or data entry functions becomes apparent. Collaborating with the Data Governance function (see Chapter 3) is critical, as business input on how to handle all the unexpected variations that inevitably occur in the actual data is essential.

Creating an executive summary of the identified business intelligence needs is a best practice. The executive summary should include an overview of the business context, have a list of sample questions, provide commentary on the existing data quality and level-of-effort for cleansing and integration, and describe related organizations and business functions. It may also include a mock-up drawing for navigation of the solution showing the pathways for query and reporting in the selected presentation product. Review the executive summary with the business for prioritization in the DW-BIM program.

When starting a DW-BIM program, a good way to decide where to start is using a simple assessment of business impact and technical feasibility. Technical feasibility will take into consideration things like complexity, availability and state of the data, and the availability of subject matter experts. Projects that have high business impact and high technical feasibility are good candidates for starting.

Most importantly, assess the necessary business support, considering three critical success factors:

  • Business Sponsorship: Is there appropriate executive sponsorship, i.e., an identified and engaged steering committee and commensurate funding? DW-BIM projects require strong executive sponsorship.
  • Business Goals and Scope: Is there a clearly identified business need, purpose, and scope for the effort?
  • Business Resources: Is there a commitment by business management to the availability and engagement of the appropriate business subject matter experts? The lack of commitment is a common point of failure and a good enough reason to halt a DW-BIM project until commitment is confirmed.

9.3.2 Define and Maintain the DW-BI Architecture

Chapter 4, on Data Architecture Management, provided excellent coverage of both data architecture in general, as well as many of the specific components of the DW-BIM architecture, including enterprise data models (subject area, conceptual, and logical), data technology architecture, and data integration architecture. Section 9.2.2 introduced the key components of DW-BIM architecture. The current section adds some practical considerations related to defining and maintaining the DW-BIM architecture.

Successful DW-BIM architecture requires the identification and bringing together of a number of key roles, potentially from other major functions, including:

  • Technical Architect: Hardware, operating systems, databases, and DW-BIM architecture.
  • Data Architect: Data analysis, systems of record, data modeling, and data mapping.
  • ETL Architect / Design Lead: Staging and transform, data marts, and schedules.
  • Meta-data Specialist: Meta-data interfaces, meta-data architecture, and contents.
  • BI Application Architect / Design Lead: BI tool interfaces and report design, meta-data delivery, data and report navigation, and delivery (such as push, pull, canned, ad-hoc).

DW-BIM needs to leverage many of the disciplines and components of a company’s IT department and business functions; thus, another key set of initial activities includes assessing and integrating the appropriate business processes, architectures, and technology standards, including ones for:

  • Servers.
  • Databases.
  • Database gold copies (systems of record) identification and business sign-off.
  • Security.
  • Data retention.
  • ETL tools.
  • Data quality tools.
  • Meta-data tools.
  • BI tools.
  • Monitoring and management tools and reports.
  • Schedulers and schedules, including standard business and calendar key schedules.
  • Error handling processes and procedures.

Technical requirements including performance, availability, and timing needs are key drivers in developing the DW-BIM architecture. The DW-BIM architecture should answer the basic questions about what data goes where, when, why and how. The ‘how’ needs to cover the hardware and software detail. It is the organizing framework to bring all the activities together.

The design decisions and principles for what data detail the DW contains is a key design priority for DW-BIM architecture. Publishing the clear rules for what data will only be available via operational reporting (such as in non-DW) is critical to the success of DW-BIM efforts. The best DW-BIM architectures will design a mechanism to connect back to transactional level and operational level reports from atomic DW data. Having this mechanism is part of the art of good DW design. It will protect the DW from having to carry every transactional detail.

An example is providing a viewing mechanism for key operational reports or forms based on a transactional key, such as Invoice Number. Customers will always want all the detail available, but some of the operational data has value only in the context of the original report, and does not provide analytic value, such as long description fields.

It is very important that the DW-BIM architecture integrate with the overall corporate reporting architecture. Many different design techniques exist, but one helpful technique is to focus on defining business-appropriate Service Level Agreements (SLAs). Often the response time, data retention, and availability requirements and needs differ greatly between classes of business needs and their respective supporting systems, such as operating reporting versus DW versus data marts. Several tables in Section 9.2.2 will be helpful in considering the varying aspects of different design components of the DW-BIM architecture.

Another critical success factor is to identify a plan for data re-use, sharing, and extension. The DW-Bus matrix introduced in Section 9.2.5.9 provides a good organizing paradigm.

Finally, no DW-BIM effort can be successful without business acceptance of data. Business acceptance includes the data being understandable, having verifiable quality, and having a demonstrable lineage. Sign-off by the Business on the data should be part of the User Acceptance Testing. Structured random testing of the data in the BIM tool against data in the source systems over the initial load and a few update load cycles should be performed to meet sign-off criteria. Meeting these requirements is paramount for every DW-BIM architecture. Consider, up-front, a few critically important architectural sub-components, along with their supporting activities:

  • Data quality feedback loop: How easy is the integration of needed changes into operational systems?
  • End to-end meta-data: Does the architecture support the integrated end-to-end flow of meta-data? In particular, is there transparency and availability of meaning and context designed in across the architecture? Does the architecture and design support easy access to answers when the business wants to know “What does this report, this data element, this metric, etc., mean?”
  • End-to-end verifiable data lineage: To use modern, popular, TV parlance, is the evidence chain-of-custody for all DW-BIM data readily verifiable? Is a system of record for all data identified?

9.3.3 Implement Data Warehouses and Data Marts

Data warehouses and data marts are the two major classes of formal data stores in the DW-BIM landscape.

The purpose of a data warehouse is to integrate data from multiple sources and then serve up that integrated data for BI purposes. This consumption is typically through data marts or other systems (e.g. a flat file to a data-mining application). The design of a data warehouse is a relational database design with normalization techniques. Ideally, a single data warehouse will integrate data from multiple source systems, and serve data to multiple data marts.

The primary purpose of data marts is to provide data for analysis to knowledge workers. Successful data marts must provide access to this data in a simple, understandable, and well-performing manner. Dimensional modeling (using de-normalization techniques) and design, as introduced in Section 9.2.5, has largely been the technique of choice for designing user-oriented data marts. Create a data mart to meet specialized business analysis needs. Data marts often include aggregated and summarized information to support faster analysis. Kimball’s vision has only data marts and no normalized DW layer.

Chapter 5 covers detailed data design, and database design, in particular. The references at the end of the chapter provide a number of excellent books on DW-BIM implementation methods.

In review, the use of data warehouses and data marts could be considered an application of one of Covey’s famous Seven Habits4, such as start with the end in mind. First, identify the business problem to solve, then identify the details and what would be used (end solution piece of the software and associated data mart). From there, continue to work back into the integrated data required (the data warehouse), and ultimately, all the way back to the data sources.

9.3.4 Implement Business Intelligence Tools and User Interfaces

The maturity of the BI market and a wide range of available BI tools makes it rare for companies to build their own BI tools.5* The purpose of this section is to introduce the types of tools available in the BI marketplace, an overview of their chief characteristics, and some information to help match the tools to the appropriate customer-level capabilities.

Implementing the right BI tool or User Interface (UI) is about identifying the right tools for the right user set. Almost all BI tools also come with their own meta-data repositories to manage their internal data maps and statistics. Some vendors make these repositories open to the end user, while some allow business meta-data to be entered. Enterprise meta-data repositories must link to, or copy from, these repositories to get a complete view of the reporting and analysis activity that the tool is providing. Chapter 11 covers Meta-Data Management.

9.3.4.1 Query and Reporting Tools

Query and reporting is the process of querying a data source, then formatting it to create a report, either a production style report such as an invoice, or a management report.

The needs within business operations reporting are often different from the needs within business query and reporting. Yet sometimes, the needs blur and lines cross. Just as you can use a hammer to get a screw into the wall, you can use a business operations-reporting tool for management reporting. The converse, however, is not true; rarely can you use a business query and reporting tool to develop business operations reports. A business query tool may not support pixel-perfect layouts, normalized data sources, or the programmability that IT developers demand.

With business query and reporting, the data source is more often a data warehouse or data mart (though not always). While IT develops production reports, power users and casual business users develop their own reports with business query tools. Table 9.8 provides an excellent generalization of the mapping classes of BI tools to their respective primary classes of users. It compares some additional characteristics that help distinguish business operations-style reports from business query and reporting-style reports. These characteristics are by no means absolute, and you will not necessarily find vendor tools that fit precisely either. Business operations reports are not necessarily pixel-perfect, although some are. Use reports generated with business query tools individually, departmentally, or enterprise-wide.

Characteristics

Business Operations Reports

Business and Query Reporting

Primary Author

IT Developer

Power users or business user

Purpose

Document preparation

Decision making

Report Delivery

Paper or email, embedded in an application

Portal, spreadsheet, email

Print Quality

Pixel perfect

historically presentation quality, now pixel perfect

User Base

10s of 1000s

100s or 1000s

Data Source

OLTP – real time

Data warehouse or data mart, occasionally OLTP

Level of Data Detail

Atomic

Aggregated, filtered

Scope

Operational

Tactical, strategic

Usage

Often embedded within an OLTP application

BI as a separate application

Table 9.8 Production versus Business and Query Reporting

In the last few years, there has been a tremendous coalescing and collapsing of the marketplace with respect to reporting tools. All of the major BI vendors now offer classic pixel-perfect report capabilities that were once primarily in the domain of application reports. From a simple cost perspective, the delivery mechanism and infrastructure for reports or even information is agnostic to the content or type of information. In other words, it is prudent for companies to leverage common infrastructure and delivery mechanisms. These include the web, email, and applications for the delivery of all kinds of information and reports, of which DW-BIM is a subset.

Production reporting crosses the DW-BIM boundary and often queries transactional systems to produce operational items such as invoices or bank statements. The developers of production reports tend to be IT personnel.

Business query and reporting tools enable users who want to author their own reports, or create outputs for use by others. They are less concerned with the precise layout because they are not trying to generate an invoice or the like. However, they do want charts and tables quickly and intuitively. Some tools focus on innovative visualization of the data as a means to show meaning in the data with data maps, and moving landscapes of data over time. The formatting capabilities vary dramatically in this segment. Tools in this segment are referred to as ad hoc query tools. Often the reports created by business users become standard reports, not exclusively used for ad hoc business questions.

Figure 9.5 relates the classes of BI tools to the respective classes of BI users for those tools.

Figure 9.5 What BI Tools for What Users?

In defining the target user groups, there is a spectrum of BI needs. First, know your user groups and then match the tool to the user groups in your company. On one end, IT developers may be most concerned with extracting the data, and focus on advanced functionality. On the other end of the spectrum, information consumers may want fast access to previously developed and executed reports. These consumers may want some degree of interactivity such as drill, filter, sort, or may only want to see a static report.

Keep in mind that drilling is an OLAP functionality. So is this a need just for Analyst or Power Users, or is it something that customers / suppliers / casual users would also like, but that perhaps has not been possible in the past?

You need to understand how all classes of users expect to use the tool, including Web users. Will the Web just be a delivery mechanism, or also a report-authoring environment? Will you / how will you provide offline access for reports that are available over the Web?

Users may move from one class of users to another as their skills increase or as they perform different business functions. A supply chain manager, for example, may want to view a static financial report, but will want a highly interactive report for analyzing inventory. A financial analyst and a line manager responsible for expenses may be a power user when analyzing total expenses but a customer viewing a static report of one phone bill.

External users typically look at static reports, like a summary of their activity. Increasingly, however, companies are providing more interactive extranet reporting for their best customers and biggest suppliers. Front-line workers may use static, published reports, or a nugget of information embedded within an application. Executives and managers will use a combination of fixed reports, dashboards, and scorecards. Managers and power users tend to want to drill into these reports slice and dice the data to identify the root cause of problems.

9.3.4.2 On Line Analytical Processing (OLAP) Tools

OLAP provides interactive, multi-dimensional analysis with different dimensions and different levels of detail. Section 9.2.3.2, Multi-dimensional Analysis—MOLAP, briefly introduced this topic. This section covers OLAP tools, which provide for the arrangement of data into OLAP cubes for fast analysis.

Typically, cubes in the BI tools are generated from a star (or snowflake) database schema. The OLAP cubes consist of numeric facts, called measures, from the fact tables. These cubes can be virtual on-demand or batch jobbed. The dimensions categorize their facts in the respective schema (See Section 9.2.2).

The value of OLAP tools and cubes is reduction of the chance of confusion and erroneous interpretation, by aligning the data content with the analyst’s mental model. The analyst can navigate through the database and screen for a particular subset of the data, changing the data’s orientations and defining analytical calculations. Slice-and-dice is the user-initiated process of navigation by calling for page displays interactively, through the specification of slices via rotations and drill down / up. Common OLAP operations include slice and dice, drill down, drill up, roll up, and pivot.

  • Slice: A slice is a subset of a multi-dimensional array corresponding to a single value for one or more members of the dimensions not in the subset.
  • Dice: The dice operation is a slice on more than two dimensions of a data cube, or more than two consecutive slices.
  • Drill Down / Up: Drilling down or up is a specific analytical technique whereby the user navigates among levels of data, ranging from the most summarized (up) to the most detailed (down).
  • Roll-up: A roll-up involves computing all of the data relationships for one or more dimensions. To do this, define a computational relationship or formula.
  • Pivot: To change the dimensional orientation of a report or page display.

9.3.4.3 Analytic Applications

Henry Morris of IDC first coined the term “analytic applications” in the mid 1990s, clarifying how they are different from OLAP and BI tools in general.6 Analytic applications include the logic and processes to extract data from well-known source systems, such as vendor ERP systems, a data model for the data mart, and pre-built reports and dashboards. Analytic applications provide businesses with a pre-built solution to optimize a functional area (people management, for example) or industry vertical (retail analytics, for example).

Different types of analytic applications include customer, financial, supply chain, manufacturing, and human resource applications.

The buy versus build approach greatly influences the nuances within analytic applications. When you buy an analytic application, you buy the data model and pre-built cubes and reports with functional metrics. These buy applications tell you what is important, what you should be monitoring, and provide some of the technology to help you get to value faster. For example, with a general BI tool, you determine how and whether to calculate business measures, such as average sale per store visit, and in which reports you want it to appear. A pre-built analytic application provides this and other metrics for you. Some build analytic applications provide a development environment for assembling applications.

The value proposition of analytic applications is in the quick start, such as the shortened time-to-market and delivery. Some of the key questions for evaluation of analytic applications are:

  1. Do we have the standard source systems for which ETL is supplied? If yes, how much have we modified it? Less modification equals more value and a better fit.
  2. How many other source systems do we need to integrate? The fewer the sources, the better the value and fit.
  3. How much do the canned industry standard queries, reports, and dashboards match our business? Involve your business analysts and customers and let them answer that!
  4. How much of the analytic application’s infrastructure matches your existing infrastructure? The better the match, the better the value and fit.

9.3.4.4 Implementing Management Dashboards and Scorecards

Dashboards and scorecards are both ways of efficiently presenting performance information. Typically, dashboards are oriented more toward dynamic presentation of operational information, while scorecards are more static representations of longer-term organizational, tactical, or strategic goals. Scorecards focus on a given metric and compare them to a target, often reflecting a simple status of red, yellow, and green for goals, based on business rules; dashboards typically present multiple numbers in many different ways.

Typically, scorecards are divided into 4 quadrants or views of the organization: Finance, Customer, Environment, and Employees, though there is flexibility, depending on the priorities of the Organization. Each will have a number of metrics that are reported and trended to various targets set by senior executives. Variance to targets is shown, usually with a root cause or comment accompanying each metric. Reporting is usually on a set interval, and ownership of each metric is assigned so that performance improvement expectations can be enforced.

In his book on Performance Dashboards, Wayne Eckerson provides in depth coverage of the types and the architectures of dashboards. The purpose of presenting this information is to provide an example of the way various BI techniques combine to create a rich integrated BI environment. Figure 9.6 is an adaptation of a related TDWI publication7*.

Figure 9.6 The Three Threes of Performance Dashboards

9.3.4.5 Performance Management Tools

Performance management applications include budgeting, planning, and financial consolidation. There have been a number of major acquisitions in this segment, as ERP vendors and BI vendors see great growth opportunities here and believe BI and Performance Management are converging. On the customer buying side, the degree to which customers buy BI and performance management from the same vendor depends on product capabilities, but also on the degree to which the CFO and CIO co-operate. It is important to note that budgeting and planning does not apply only to financial metrics, but to workforce, capital, and so on, as well.

9.3.4.6 Predictive Analytics and Data Mining Tools

Data mining is a particular kind of analysis that reveals patterns in data using various algorithms. Whereas standard query and reporting tools require you to ask a specific question, a data mining tool will help users discover relationships or show patterns in a more exploratory fashion. Predictive analytics (‘what-if” analysis) allow users to create a model, test the model based on actual data, and then project future results. Underlying engines may be neural networks or inference.

Use data mining in predictive analysis, fraud detection, root cause analysis (through clustering), customer segmentation and scoring, and market basket analysis. Although data mining is one segment of the BI market, it continues to be an application reserved for specialist users. In the past, statisticians have largely extracted data from source systems and data warehouses to perform analyses outside of the BI environment. Recent partnerships between BI and DB vendors are providing tighter coupling and integrating of analytic processing and DB capabilities. Typically flat file extracts are used to train the engine, and then a full run on a source database is performed, producing statistical reports and charts.

Note that a good strategy for interfacing with many data mining tools is to work with the business analysts to define the data set needed for analysis, and then arrange for a periodic file extract. This strategy offloads the intense multi-pass processing involved in data mining from the DW, and many data mining tools work with file-based input, as well.

9.3.4.7 Advanced Visualization and Discovery Tools

Advanced visualization and discovery tools often use an in-memory architecture to allow users to interact with the data in a highly visual, interactive way. Patterns in a large dataset can be difficult to recognize in a numbers display. A pattern can be picked up visually fairly quickly, when thousands of data points are loaded into a sophisticated display on a single page of display.

The difference in these tools versus most dashboard products is usually in:

  1. The degree of sophisticated analysis and visualization types, such as small multiples, spark lines, heat maps, histograms, waterfall charts, bullet graphs, and so on.
  2. Adherence to best practices according to the visualization community.
  3. The degree of interactivity and visual discovery versus creating a chart on a tabular data display.

9.3.5 Process Data for Business Intelligence

The lion’s share of the work in any DW-BIM effort is in the preparation and processing of the data. This section introduces some of the architectural components and sub-activities involved in processing data for BI.

9.3.5.1 Staging Areas

A staging area is the intermediate data store between an original data source and the centralized data repository. All required cleansing, transformation, reconciliation, and relationships happen in this area.

Advanced architectures implement these processes in a well-defined and progressive manner. Dividing the work reduces the overall complexity, and makes debugging much simpler. Having an initial staging area is a common, simple strategy to offload a complete set of data from the respective source system as-is, i.e., with no transforms.

A change-capture mechanism reduces the volume of transmitted data sets. Several months to a few years of data can be stored in this initial staging area. Benefits of this approach include:

  • Improving performance on the source system by allowing limited history to be stored there.
  • Pro-active capture of a full set of data, allowing for future needs.
  • Minimizing the time and performance impact on the source system by having a single extract.
  • Pro-active creation of a data store that is not subject to transactional system limitations.

Use subsequent design components to filter data only needed for business priorities, and do iterative, progressive, conforming and normalization. Designs that further allow separation of data conforming, such as conforming types and value sets, from merging and normalization will be simpler to maintain. Many architectures name this data integration and transformation to distinguish it from the simple copy-only staging area.

9.3.5.2 Mapping Sources and Targets

Source-to-target mapping is the documentation activity that defines data type details and transformation rules for all required entities and data elements, and from each individual source to each individual target. DW-BIM adds additional requirements to this classic source-to-target mapping process encountered as a component of any typical data migration. In particular, one of the goals of the DW-BIM effort should be to provide a complete lineage for each data element available in the BI environment all the way back to its respective source(s)

The most difficult part of any mapping effort is determining valid links between data elements in multiple equivalent systems. Consider the effort to consolidate data into an EDW from multiple billing or order management systems. Chances are that tables and fields that contain equivalent data do not have the same names or structures. A solid taxonomy is necessary to match the data elements in different systems into a consistent structure in the EDW. Gold sources or system of record source(s) must be signed off by the Business.

9.3.5.3 Data Cleansing and Transformations (Data Acquisition)

Data cleansing focuses on the activities that correct and enhance the domain values of individual data elements, including enforcement of standards. Cleansing is particularly necessary for initial loads where significant history is involved. The preferred strategy is to push data cleansing and correction activity back to the source systems, whenever possible.

Strategies must be developed for rows of data that are loaded but found to be incorrect. A policy for deleting old records may cause some havoc with related tables and surrogate keys, expiring a row and loading the new data as a whole new row may be a better option.

Data transformation focuses on activities that provide organizational context between data elements, entities, and subject areas. Organizational context includes cross-referencing, reference and master data management (see Chapter 8), and complete and correct relationships. Data transformation is an essential component of being able to integrate data from multiple sources. Data transformation development requires extensive involvement with Data Governance.

9.3.6 Monitor and Tune Data Warehousing Processes

Transparency and visibility are the key principles that should drive DW-BIM monitoring. The more one can expose the details of the DW-BIM activities, the more end-customers can see and understand what is going on (and have confidence in the BI) and less direct end-customer support will be required. Providing a dashboard that exposes the high-level status of data delivery activities, with drill-down capability, is a best practice that allows an on-demand-pull of information by both support personnel and customers. The addition of data quality measures will enhance the value of this dashboard where performance is more than just speed and timing.

Processing should be monitored across the system for bottlenecks and dependencies among processes. Database tuning techniques should be employed where and when needed, including partitioning, tuned backup and recovery strategies. Archiving is a difficult subject in data warehousing. Users often consider the data warehouse as an active archive due to the long histories that are built, and are unwilling, particularly if the OLAP sources have dropped records, to see the data warehouse engage in archiving.

Management by exception is a great policy to apply here. Sending success messages will typically result in ignored messages, but sending attention messages upon failure is a prudent addition to a monitoring dashboard.

9.3.7 Monitor and Tune BI Activity and Performance

A best practice for BI monitoring and tuning is to define and display a set of customer-facing satisfaction metrics. Average query response time and the number of users per day / week / month, are examples of useful metrics to display. In addition to displaying the statistical measures available from the systems, it is useful to survey DW-BIM customers regularly.

Regular review of usage statistics and patterns is essential. Reports providing frequency and resource usage of data, queries, and reports allow prudent enhancement. Tuning BI activity is analogous to the principle of profiling applications in order to know where the bottlenecks are and where to apply optimization efforts. The creation of indexes and aggregations is most effective when done according to usage patterns and statistics. Tremendous performance gains can come from simple solutions such as posting the completed daily results to a report that runs hundreds or thousands of times a day.

9.4 Summary

The guiding principles for implementing data warehousing and business intelligence management into an organization, a summary table of the roles for each data warehousing and business intelligence activity, and organization and cultural issues that may arise during data warehousing and business intelligence management are summarized below.

9.4.1 Guiding Principles

The implementation of the data warehousing and business intelligence management function into an organization follows eleven guiding principles:

  1. Obtain executive commitment and support. These projects are labor intensive.
  2. Secure business SME’s. Support and high availability are necessary for getting the correct data and useful BI solution.
  3. Be business focused and driven. Make sure DW / BI work is serving real priority business needs and solving burning business problems. Let the business drive the prioritization.
  4. Demonstrable data quality is essential. Critical to DW / BI success is being able to answer basic questions like “Why is this sum X?” “How was that computed?” and “Where did the data come from?”
  5. Provide incremental value. Ideally deliver in continual 2-3 month segments.
  6. Transparency and self service. The more context (meta-data of all kinds) provided, the more value customers derive. Wisely exposing information about the process reduces calls and increases satisfaction.
  7. One size does not fit all. Make sure you find the right tools and products for each of your customer segments.
  8. Think and architect globally, act and build locally. Let the big-picture and end-vision guide the architecture, but build and deliver incrementally, with much shorter term and more project-based focus.
  9. Collaborate with and integrate all other data initiatives, especially those for data governance, data quality, and meta-data.
  10. Start with the end in mind. Let the business priority and scope of end-data-delivery in the BI space drive the creation of the DW content. The main purpose for the existence of the DW is to serve up data to the end business customers via the BI capabilities.
  11. Summarize and optimize last, not first. Build on the atomic data and add aggregates or summaries as needed for performance, but not to replace the detail.

9.4.2 Process Summary

The process summary for the data warehousing and business intelligence management function is shown in Table 9.9. The deliverables, responsible roles, approving roles, and contributing roles are shown for each activity in the data warehousing and business intelligence management function. The Table is also shown in Appendix A9.

Activities

Deliverables

Responsible Roles

Approving Roles

Contributing Roles

7.1 Understand Business Intelligence Information Needs (P)

DW-BIM Project Requirements

Data / BI Analyst, BI Program Manager, SME

Data Steward, Business Executives and Managers

Meta-Data Specialist, Business Process Lead

7.2 Define the Data Warehouse / BI Architecture (P) (same as 2.1.5)

Data Warehouse / Business Intelligence Architecture

Data Warehouse Architect,

Business Intelligence Architect

Enterprise Data Architect, DM Executive, CIO, Data Architecture Steering Committee, Data Governance Council

Business Intelligence Specialists, Data Integration Specialists, DBAs, Other Data Mgmt. Professionals, IT architects

7.3 Implement Data Warehouses and Data Marts (D)

Data Warehouses, Data Marts, OLAP Cubes

Business Intelligence Specialists

Data Warehouse Architect, Data Stewardship Teams

Data Integration Specialists, DBAs, Other Data Mgmt. Professionals, Other IT Professionals

7.4 Implement Business Intelligence Tools and User Interfaces (D)

BI Tools and User Environments, Query and Reporting, Dashboards, Scorecards, Analytic Applications, etc.

Business Intelligence Specialists

Data Warehouse Architect, Data Stewardship Committee,

Data Governance Council, Business Executives and Managers

Data Warehouse Architect, Other Data Mgmt. Professionals,, Other IT Professionals

7.5 Process Data for Business Intelligence (O)

Accessible Integrated Data,

Data Quality Feedback Details

Data Integration Specialists

Data Stewards

Other Data Mgmt. Professionals,

Other IT Professionals

7.6 Monitor and Tune Data Warehousing Processes (C)

DW Performance Reports

DBAs,

Data Integration Specialists

IT Operators

7.7 Monitor and Tune BI Activity and Performance (C)

BI Performance Reports, New Indexes, New Aggregations

Business Intelligence Specialists, DBAs,

Business Intelligence Analysts

Other Data Mgmt. Professionals,

IT Operators,

IT Auditors

Table 9.9 DW and BI Management Process Summary

9.4.3 Organizational and Cultural Issues

Q1: I can’t get CEO / CIO support. What can I do?

A1: Try to discover what their burning business problems and issues are and align your project with providing solutions to those.

Q2: How do I balance the pressures of individual project delivery with DW / BI program goals of building out re-usable data and infrastructure?

A2a: Build out re-usable infrastructure and data a piece at a time.

A2b: Use the DW- bus matrix as a communication and marketing tool. On a project by project basis, negotiate a give-and-take – e.g., “Here are the conformed dimensions that other projects have developed that you get to benefit from.”; and “Here are the ones we are asking this project to contribute to building so other future projects can benefit.”

A2c: Don’t apply the same rigor and overhead to all data sources. Relax the rules / overhead for single source, project-specific data. Use business priorities to determine where to apply extra rigor. In short, use the classic 80 / 20 rule: 80% of the value comes from 20% of the data. Determine what that 20% is and focus on it.

9.5 Recommended Reading

The references listed below provide additional reading that support the material presented in Chapter 9. These recommended readings are also included in the Bibliography at the end of the Guide.

9.5.1 Data Warehousing

Adamson, Christopher. Mastering Data Warehouse Aggregates: Solutions for Star Schema Performance. John Wiley & Sons, 2006. ISBN 0-471-77709-9. 345 pages.

Adamson, Christopher and Michael Venerable. Data Warehouse Design Solutions. John Wiley & Sons, 1998. ISBN 0-471-25195-X. 544 pages.

Adelman, Sid and Larissa T. Moss. Data Warehouse Project Management. Addison-Wesley Professional, 2000. ISBN 0-201-61635-1. 448 pages.

Adelman, Sid and others. Impossible Data Warehouse Situations: Solutions from the Experts. Addison-Wesley, 2002. ISBN 0-201-76033-9. 432 pages.

Brackett, Michael. The Data Warehouse Challenge: Taming Data Chaos. New York: John Wiley & Sons, 1996. ISBN 0-471-12744-2. 579 pages.

Caserta, Joe and Ralph Kimball. The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming and Delivering Data. John Wiley & Sons, 2004. ISBN 0-764-56757-8. 525 pages.

Correy, Michael J. and Michael Abby. Oracle Data Warehousing: A Practical Guide to Successful Data Warehouse Analysis, Build and Roll-Out. TATA McGraw-Hill, 1997. ISBN 0-074-63069-5.

Covey, Stephen R. The 7 Habits of Highly Effective People. Free Press, 2004. ISBN 0743269519. 384 Pages.

Dyche, Jill. E-Data: Turning Data Into Information With Data Warehousing. Addison- Wesley, 2000. ISBN 0-201-65780-5. 384 pages.

Gill, Harjinder S. and Prekash C. Rao. The Official Guide To Data Warehousing. Que, 1996. ISBN 0-789-70714-4. 382 pages.

Hackney, Douglas. Understanding and Implementing Successful Data Marts. Addison Wesley, 1997. ISBN 0-201-18380-3. 464 pages.

Imhoff, Claudia, Nicholas Galemmo and Jonathan G. Geiger. Mastering Data Warehouse Design: Relational and Dimensional Techniques. John Wiley & Sons, 2003. ISBN 0-471-32421-3. 456 pages.

Imhoff, Claudia, Lisa Loftis and Jonathan G. Geiger. Building the Customer-Centric Enterprise: Data Warehousing Techniques for Supporting Customer Relationship Management. John Wiley & Sons, 2001. ISBN 0-471-31981-3. 512 pages.

Inmon, W. H. Building the Data Warehouse, 4th Edition. John Wiley & Sons, 2005. ISBN 0-764-59944-5. 543 pages.

Inmon, W. H. Building the Operational Data Store, 2nd edition. John Wiley & Sons, 1999. ISBN 0-471-32888-X. 336 pages.

Inmon, W. H., Claudia Imhoff and Ryan Sousa. The Corporate Information Factory, 2nd edition. John Wiley & Sons, 2000. ISBN 0-471-39961-2. 400 pages.

Inmon, W. H. and Richard D. Hackathorn. Using the Data Warehouse. Wiley-QED, 1994. ISBN 0-471-05966-8. 305 pages.

Inmon, William H., John A. Zachman and Jonathan G. Geiger. Data Stores, Data Warehousing and the Zachman Framework. McGraw-Hill, 1997. ISBN 0-070-31429-2. 358 pages.

Kimball, Ralph and Margy Ross. The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling, 2nd edition. New York: John Wiley & Sons, 2002. ISBN 0-471-20024-7. 464 pages.

Kimball, Ralph, Laura Reeves, Margy Ross and Warren Thornwaite. The Data Warehouse Lifecycle Toolkit: Expert Methods for Designing, Developing and Deploying Data Warehouses. John Wiley & Sons, 1998. ISBN 0-471-25547-5. 800 pages.

Kimball, Ralph and Richard Merz. The Data Webhouse Toolkit: Building the Web-Enabled Data Warehouse. John Wiley & Sons, 2000. ISBN 0-471-37680-9. 416 pages.

Mattison, Rob, Web Warehousing & Knowledge Management. McGraw Hill, 1999. ISBN 0-070-41103-4. 576 pages.

Morris, Henry. Analytic Applications and Business Performance Management. DM Review Magazine, March, 1999. www.dmreview.com. Note: www.dmreview.com is now www.information-management.com.

Moss, Larissa T. and Shaku Atre. Business Intelligence Roadmap: The Complete Project Lifecycle for Decision-Support Applications. Addison-Wesley, 2003. ISBN 0-201-78420-3. 576 pages.

Poe, Vidette, Patricia Klauer and Stephen Brobst. Building A Data Warehouse for Decision Support, 2nd edition. Prentice-Hall, 1997. ISBN 0-137-69639-6. 285 pages.

Ponniah, Paulraj. Data Warehousing Fundamentals: A Comprehensive Guide for IT Professionals. John Wiley & Sons – Interscience, 2001. ISBN 0-471-41254-6. 528 pages.

Westerman, Paul. Data Warehousing: Using the Wal-Mart Model. Morgan Kaufman, 2000. ISBN 155860684X. 297 pages.

9.5.2 Business Intelligence

Biere, Mike. Business Intelligence for the Enterprise. IBM Press, 2003. ISBN 0-131-41303-1. 240 pages.

Eckerson, Wayne W. Performance Dashboards: MEassuring, Monitoring, and Managing Your Business. Wiley, 2005. ISBN-10: 0471724173. 320 pages.

Bischoff, Joyce and Ted Alexander. Data Warehouse: Practical Advice from the Experts. Prentice-Hall, 1997. ISBN 0-135-77370-9. 428 pages.

Howson, Cindi. “The Business Intelligence Market”. http://www.biscorecard.com/. Requires annual subscription to this website.

Malik, Shadan. Enterprise Dashboards: Design and Best Practices for IT. Wiley, 2005. ISBN 0471738069. 240 pages.

Moss, Larissa T., and Shaku Atre. Business Intelligence Roadmap: The Complete Project Lifecycle for Decision-Support Applications. Addison-Wesley, 2003. ISBN 0-201-78420-3. 576 pages.

Vitt, Elizabeth, Michael Luckevich and Stacia Misner. Business Intelligence. Microsoft Press, 2008. ISBN 073562660X. 220 pages.

9.5.3 Data Mining

Cabena, Peter, Hadjnian, Stadler, Verhees and Zanasi. Discovering Data Mining: From Concept to Implementation. Prentice Hall, 1997. ISBN-10: 0137439806

Delmater, Rhonda and Monte Hancock Jr. Data Mining Explained, A Manager’s Guide to Customer-Centric Business Intelligence. Digital Press, Woburn, MA, 2001. ISBN 1-5555-8231-1.

Rud, Olivia Parr. Data Mining Cookbook: Modeling Data for Marketing, Risk and Customer Relationship Management. John Wiley & Sons, 2000. ISBN 0-471-38564-6. 367 pages.

9.5.4 OLAP

Thomsen, Erik. OLAP Solutions: Building Multidimensional Information Systems, 2nd edition. Wiley, 2002. ISBN-10: 0471400300. 688 pages.

Wremble, Robert and Christian Koncilia. Data Warehouses and Olap: Concepts, Architectures and Solutions. IGI Global, 2006. ISBN: 1599043645. 332 pages.

http://www.olapcouncil.org/research/resrchly.htm

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset