C H A P T E R  1

Introducing Integration Services

I’m the glue that holds everything together.

—Singer Otis Williams

Your business analysts have finished gathering business requirements. The database architect has designed and built a database that can be described only as a work of art. The BI architects are designing their OLAP cubes and the dimensional data marts that feed them. On the other hand, maybe you’re a one-man show and have designed and built everything yourself. Either way, the only piece that’s missing is a tool to bring it all together. Enter SQL Server Integration Services (SSIS).

Like Otis Williams, cofounder of the Motown group the Temptations, SSIS is the glue that holds it all together. More than that, SSIS is the circulatory system of your data warehousing and BI solutions. SSIS breathes life into your technical solutions by moving data—the lifeblood of your organization—from disparate sources, along well-known paths, and injecting it directly into the heart of your system. Along the way, SSIS can validate, cleanse, manipulate, transform, and enrich your data for maximum effectiveness.

In this book, we’ll take you on a tour of SSIS, from building your very first SSIS package to implementing complex multipackage design patterns seamlessly. This chapter introduces you to SSIS and the concepts behind extract, transform, and load (ETL) processes in general. We begin at the beginning, with a brief history of ETL, Microsoft-style.

A Brief History of Microsoft ETL

Before we dive headfirst into the details of Microsoft’s current-generation ETL processing solution, it’s important to understand just what ETL is. As we have stated, ETL is an acronym for extract, transform, and load, which is a very literal description of modern data manipulation and movement processes. When we talk about ETL, we are specifically talking about (1) extracting data from a source, such as a database or flat files; (2) transforming data, or manipulating and enriching it en route to its destination; and (3) loading data into its destination, often a database.

Over the years, business requirements for data processing in nearly any industry you can point to have grown more complex, even as the amount of data that needs to be processed has increased exponentially. Unfortunately, the number of hours in a day has remained fairly constant over the same time period, meaning you’re stuck with the same limited processing window each day to transport and manipulate an ever-growing magnitude of data. ETL solutions have become increasingly sophisticated and robust in response to these increased data processing demands of performance, flexibility, and quality.

So we’ll begin our journey into SSIS by looking at how ETL has evolved in the SQL Server world. Up to SQL Server 6.5, the bulk copy program (bcp) was the primary tool for loading data into SQL Server databases. A command-line utility, bcp made loading basic text files into database tables fairly simple. Unfortunately, the flip side of that simplicity was that you could use bcp only to load data from flat files, and you couldn’t perform additional validations or transformations on the data during the load. A common database-to-database ETL scenario with bcp might include extracting data from a database server to a delimited text file, importing the file into a SQL Server database, and finally using T-SQL to perform transformations on the data in the database. The bcp utility is still provided with all versions of SQL Server, and is still used for simple one-off data loads from flat files on occasion.

In response to the increasing demands of ETL processing, Data Transformation Services (DTS) made its first appearance in SQL Server 7. With DTS, you could grab data from a variety of sources, transform it on the fly, and load it into the database. Although DTS was a much more sophisticated tool than bcp, it still lacked much of the functionality required to develop enterprise-class ETL solutions.

With the release of SQL Server 2005, Microsoft replaced DTS with SQL Server Integration Services (SSIS). SSIS is a true enterprise ETL solution with several advancements over its predecessors, including built-in logging; support for a wide variety of complex transformation, data validation, and data cleansing components; separation of process control from data flow; support for several types of data sources and destinations; and the ability to create custom components, to name a few.

SSIS in SQL Server 11 represents the first major enhancement to SSIS since its introduction way back in 2005. In this newest release, Microsoft has implemented major improvements in functionality and usability. Some of the new goodness includes the ability to move ETL packages seamlessly between environments, centralized storage and administration of SSIS packages, and a host of usability enhancements. In this book, you’ll explore the core functionality you need to get up and running with SSIS and the advanced functionality you need to implement the most complex ETL processing.

ETL: THE LOST YEARS

What Can SSIS Do for You?

SSIS provides a wide array of out-of-the-box functionality to accomplish common ETL-related tasks. The major tasks you’ll encounter during most ETL processing include the following:

  • Extracting data from a wide variety of sources including flat files, XML, the Internet, Microsoft Excel spreadsheets, and relational and nonrelational databases. If the stock source adapters don’t cover your needs, SSIS’s support for .NET gives you the ability to extract data from literally any data source that you have access to.
  • Validating data according to predefined rules you specify as it moves through your ETL process. You can validate data by using a variety of methods such as ensuring that strings match patterns and that numeric values are within a given range.
  • Performing Data cleansing, or the process of identifying invalid data values and removing them or modifying them to conform to your predefined constraints. Examples include changing negative numbers to zero or removing extra whitespace characters from strings.
  • Deduplicating data, which is the elimination of data records that you consider to be duplicates. For a given process, you may consider entire records that are value-for-value matches to be duplicates; for other processes, you may determine that a value match on a single field (or set of fields), such as Telephone Number, identifies a duplicate record.
  • Loading data into files, databases such as SQL Server, or other destinations. SSIS provides a wide range of stock destination adapters that allow you to output data to several well-defined destinations. As with data extraction, if you have a special destination in mind that’s not supported by the SSIS stock adapters, the built-in .NET support lets you output to nearly any destination you can access.

Nearly any process that you can define in terms of ETL steps can be performed with SSIS. And it’s not just limited to databases (though that is our primary focus in this book). As an example, you can use Windows Management Instrumentation (WMI) to retrieve data about a computer system, format it to your liking, and store it in an Excel spreadsheet; or you can grab data from a comma-delimited file, transform it a bit, and write it back out to a new comma-delimited file. Not to put too fine a point on it, but you can perform just about any task that requires data movement and manipulation with SSIS.

What Is Enterprise ETL?

You’ve seen us refer to SSIS as an enterprise ETL solution in this chapter, and you may have asked yourself, “What is the difference between an enterprise ETL solution and any other ETL solution?” Don’t worry, it’s a common question that we asked once and that has since been asked of us several times. It has a very simple answer: enterprise ETL solutions have the ability to help you meet your nonfunctional requirements in addition to the standard functional requirements of extract, transform, and load.

So what is a nonfunctional requirement and what does it have to do with ETL? If you’ve ever been on a development project for an application or business system, you’re probably familiar with the term. In the previous section, we discussed how SSIS helps you meet your ETL functional requirements—those requirements of a system that describe what it does. In the case of ETL, the functional requirements are generally pretty simple: (1) get data from one or more sources, (2) manipulate the data according to some predefined business logic, and (3) store the data somewhere.

Nonfunctional requirements, on the other hand, deal more with the qualities of the system. These types of requirements deal in aspects such as robustness, performance and efficiency, maintainability, security, scalability, reliability, and overall quality of an ETL solution. We like to think of nonfunctional requirements as the aspects of the system that do not necessarily have a direct effect on the end result or output of the system; instead they work behind the scenes in support of the result generation.

Here are some of the ways SSIS can help you meet your nonfunctional requirements:

  • Robustness is provided in SSIS primarily through built-in error-handling to capture and deal with bad data and execution exceptions as they occur, transactions that ensure consistency of your data should a process enter an unrecoverable processing exception, and checkpoints that allow some ability to restart packages.
  • Performance and efficiency are closely related, but not entirely synonymous, concepts. You can think of performance as the raw speed with which your ETL processes accomplish their tasks. Efficiency digs a bit deeper and includes minimizing resource (memory, CPU, and persistent storage) contention and usage. SSIS has many optimizations baked directly into its data flow components and data flow engine—for instance, to tweak the raw performance and resource efficiency of the data flow. Chapter 14 covers the things you can do to get the most out of the built-in optimizations.
  • Maintainability can be boiled down to the ongoing cost of managing and administering your ETL solution after it’s in production. Maintainability is also one of the easiest items to measure, because you can ask questions such as, “How many hours each month do I have to spend fixing issues in ETL processes?” or “How many hours of manual intervention are required each week to deal with, or to avoid, errors in my ETL process?” SSIS provides a new project deployment model to make it easier to move ETL projects from one environment to the next; and BIDS provides built-in support for source control systems such as Team Foundation Server (TFS) to help minimize the maintenance costs of your solutions.
  • Security is provided in SSIS through a variety of methods and interactions with other systems, including Windows NT File System (NTFS) and SQL Server 11. Package and project deployment to SQL Server is a powerful method of securing your packages. In this case, SQL Server uses its robust security model to control access to, and encryption of, SSIS package contents.
  • Scalability can be defined as how well your ETL solution can handle increasing quantities of data. SSIS provides the ability to scale predictably with your increased demands, providing of course that you create your packages to maximize SSIS’s throughput. We discuss scalable ETL design patterns in Chapter 15.

Image TIP: For in-depth coverage of SSIS design patterns, we highly recommend picking up a copy of SSIS Design Patterns by Andy Leonard, Matt Masson, Tim Mitchell, Jessica Moss, and Michelle Ufford (Apress, 2012).

  • Reliability, put simply, can be defined as how resistant your ETL solution is to failure—and if failure does occur, how well your solution handles the situation. SSIS provides extensive logging capabilities that, when combined with BIDS’s built-in debugging capabilities, can help you quickly track down and fix the root cause of failure. SSIS can also notify you in the event of a failure situation.

All these individual nonfunctional requirements, when taken together, help define the overall quality of your ETL solution. Although it’s relatively simple to put together a package or program that shuttles data from point A to point B, the nonfunctional requirements provide a layer on top of this basic functionality that allows you to meet your service-level agreements (SLAs) and other processing requirements.

SSIS Architecture

One of the major improvements that SSIS introduced over DTS was the separation of the concepts of control flow and data flow. The control flow manages the order of execution for packages and manages communication with support elements such as event handlers. The data flow engine is exposed as a component within the control flow and it provides high-performance data extraction, transformation, and loading services.

As you can see in Figure 1-1, the relationship between control flows, data flows, and their respective components is straightforward.

Image

Figure 1-1. Relationship between control flow and data flow

Simply speaking, a package contains the control flow. The control flow contains control flow tasks and containers, both of which are discussed in detail in Chapter 4. The Data Flow task is a special type of task that contains the data flow. The data flow contains data flow components, which move and manipulate your data. There are three types of data flow components:

  • Sources can pull data from any of a variety of data stores, and feed that data into the data flow.
  • Transformations allow you to manipulate and modify data within the data flow one row at a time.
  • Destinations provide a means for the data flow to output and persist data after it moves through the final stage of the data flow.

Although the simplified diagram in Figure 1-1 shows only a single data flow in the control flow, any given control flow can contain many data flows. As the diagram also illustrates, both control flows and data flows are found within the confines of SSIS packages that you can design, build, and test with Microsoft Business Intelligence Development Studio (BIDS). BIDS is a shell of the Visual Studio integrated development environment (IDE) that .NET programmers are familiar with. Figure 1-2 shows the data flow for a very simple SSIS package in the BIDS designer.

Image

Figure 1-2. Simple SSIS package data flow in BIDS

Since the introduction of SSIS, Microsoft has made significant investments in the infrastructure required to support package execution and enterprise ETL management. In addition to data movement and manipulation, the SSIS infrastructure supports logging, event handling, connection management, and enumeration activities. Figure 1-3 is a simplified pyramid showing the major components of the SSIS infrastructure.

Image NOTE: We introduce BIDS and discuss the new designer features in Chapter 2.

Image

Figure 1-3. SSIS architectural components (simplified)

At the base of the pyramid lie the command-line utilities, custom applications, the SSIS designer, and wizards such as the import/export wizard that provide interaction with SSIS. These applications and utilities are developed in either managed or unmanaged code. The object model layer exposes interfaces that allow these utilities and applications to interact with the Integration Services runtime. The Integration Services runtime, in turn, executes packages and provides support for logging, breakpoints and debugging, connection management and configuration, and transaction control. At the very top of the pyramid is the SSIS package itself, which you design and build in the BIDS environment, to contain the control flow and data flows discussed earlier in this chapter.

BYE-BYE, DATA TRANSFORMATION SERVICES

New SSIS Features

There are a number of improvements in the SQL Server 11 release of SSIS. We discuss these new features and enhancements throughout the book. Before we dig into the details in later chapters, we are summarizing the new features here for your convenience:

  • Project deployment model: The new project deployment model is new to SQL Server 11 SSIS. The key features of this new deployment model are the Integration Services catalog, environments, and parameters. The new deployment model is designed to make deployment and administration of SSIS packages and ETL systems easier across multiple environments. We discuss the project deployment model in Chapter 17.
  • T-SQL views and stored procedures: The new project deployment model includes several new SSIS-specific views and stored procedures for SQL Server. We present these views and stored procedures in Chapter 17, during the discussion of the project deployment model.
  • BIDS usability enhancements: BIDS has been improved to make package development and editing simpler. The BIDS designers are now more flexible, and UI enhancements such as the Integration Services toolbox make it easier to use. We present the new features in BIDS and SQL Server Management Studio (SSMS) in Chapter 3.
  • Object impact and data lineage analysis: The object impact and data lineage analysis feature provides metadata for locating object dependencies—for instance, which tables a package depends on. This new tool provides useful information for troubleshooting performance problems or dependency issues, or when taking a proactive approach to locating dependencies. We demonstrate impact and data lineage analysis in Chapter 12.
  • Improved Merge and Merge Join transformations: The Merge and Merge Join transformations have been improved in SQL Server 11 SSIS by providing better internal controls on memory usage. We cover the new features of the Merge and Merge Join transformations in Chapter 6.
  • Data correction component: The Data Correction transformation provides a tool to help improve data quality. We discuss this new transformation in Chapter 7.
  • Custom data flow component improvements: SQL Server 11 SSIS includes improvements that allow developers to more easily create custom data flow components that support multiple inputs. We explore these new features in the discussion of custom data flow components in Chapter 21.
  • Source and Destination Assistants: The new Source and Destination Assistants are designed to guide you through the steps to create a source or destination. We talk about the new assistants in Chapter 6.
  • Simplified data viewer: The data viewer has been simplified in SQL Server 11 SSIS to make it easier to use. We demonstrate the data viewer in Chapter 8.

Our Favorite People and Places

There are a number of SSIS experts and resources that have been out there since the introduction of SSIS. Here are some of our recommendations, a “best of” list for SSIS on the Web:

Andy Leonard is an SSIS guru and SQL Server Most Valuable Professional (MVP) who has been a head-down, hands-on SSIS developer from day 1. In fact, Andy was a contributing author on the original Professional SQL Server 2005 Integration Services (Wrox, 2006) book, the gold standard for SQL Server 2005 SSIS. Andy’s blog is located at http://sqlblog.com/blogs/andy_leonard/default.aspx.

Jamie Thomson may well be one of the most prolific SSIS experts. Having blogged on hundreds of SSIS topics, Jamie is a SQL Server MVP and the original SSIS Junkie. You can read his newest material at http://sqlblog.com/blogs/jamie_thomson/ or catch up on your SSIS Junkie classic reading at http://consultingblogs.emc.com/jamiethomson.

Brian Knight, founder of Pragmatic Works and a SQL Server MVP, is a well-known writer and trainer on all things BI and all things SSIS. Catch up with Brian at www.bidn.com/blogs/BrianKnight.

Books Online (BOL) is the holy book for all things SQL Server, and that includes SSIS. When you need an answer to a specific SQL Server or SSIS question, there’s a very high probability your search will end at BOL. With that in mind, we like to cut out the middleman and start our searches at http://msdn.microsoft.com/en-us/library/ms130214(v=SQL.110).aspx.

SQLServerCentral.com (SSC) was founded by a roving gang of hard-core DBAs, including the infamous Hawaiian-shirt- and cowboy-hat-wearing Steve Jones, a SQL Server MVP. Steve keeps SSC updated with lots of community-based content covering a wide range of topics including SSIS. When you want to check out the best in community-generated content, go to www.sqlservercentral.com.

SSIS Team Blog, maintained by Microsoft’s own Matt Masson, is located at http://blogs.msdn.com/b/mattm. Check out this blog for the latest and greatest in SSIS updates, patches, and insider tips and tricks.

Allan Mitchell and Darren Green, SSIS experts and SQL MVPs from across the pond, share their expertise at www.sqlis.com.

CodePlex is a Microsoft site hosting open source projects. From www.codeplex.com, you can download a variety of open source projects that include the AdventureWorks family of sample databases, open source SSIS custom components, complete SSIS-based ETL frameworks, ssisUnit (the SSIS unit testing framework), and BIDS Helper. This is one site you want to check out.

Professional Association for SQL Server (PASS) is a professional organization for all SQL Server developers, DBAs, and BI pros. Membership is free, and the benefits include access to the SQL Server Standard magazine. Visit PASS at www.sqlpass.org for more information.

We highly recommend visiting these sites as you learn SSIS or encounter questions about best practices or need guidance on how to accomplish very specific tasks in your packages.

Summary

SSIS is a powerful and flexible enterprise ETL solution, designed to meet your most demanding data movement and manipulation needs. This chapter introduced some of the basic concepts of ETL and how SSIS fits into the SQL Server ETL world. We presented some of the core concepts of the SSIS architecture and talked about the newest features in SQL Server 11 SSIS. We wrapped up with a listing of a few of our favorite people and resource sites. In Chapter 2 we introduce BIDS and SSMS, with an emphasis on the newest features designed to make your ETL package design, build, and testing easier than ever.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset