Chapter 6. Building a DataOps Toolkit

The DataOps cross-functional toolkit is not a single tool or platform—nor can it ever be. Rather, it is part of a framework that prioritizes responding to change over following a plan. This means that the toolkit is inherently a collection of complementary, best of breed tools with interoperability and automation at their core design.

Interoperability

Interoperability is perhaps the biggest departure in the DataOps stack from the data integration tools or platforms of the past.

Although a good ETL platform certainly appeared to afford interoperability as a core principle, with many out-of-the-box connectors supporting hundreds of data formats and protocols (from Java Database Connectivity [JDBC] to Simple Object Access Protocol [SOAP]), this, in fact, was a reaction to the complete lack of interoperability in the set of traditional data repositories and tools.

In practice, if your ETL platform of choice did not support a native connector to a particular proprietary database, enterprise resource planning (ERP), customer relationship management (CRM), or business suite, the data was simply forfeited from any integration project. Conversely, data repositories that did not support an open data exchange format forced users to work within the confines of that data repository, often subsisting with rigid, suboptimal, worst-of-breed, tack-on solutions. The dream of being able to develop a center of excellence around a single data integration vendor quickly evaporated as the rigid constraints of the single vendor tightened with increasing data variety and velocity.

Composable Agile Units

Just as an Agile development team comprises members covering each talent necessary for full feature development, so the Agile DataOps toolkit comprises each function necessary to deliver purpose-fit data in a timely manner. The Agile DataOps toolkit not only allows you to plug-and-play a particular tool between different vendor, open source or homegrown solutions, it also allows greater freedom in deciding the boundaries of your composable Agile units.

This is key when trying to compose Agile DataOps teams and tools while working within the realities of your available data engineering and operations skill sets. Whether your data dashboarding team member is a Tableau or Domo wizard does not determine which record matching engine you use. If your dashboarding team can work across many different record matching engines, then your record matching engine should be able to work across many different dashboarding tools. In stark contrast to the single platform approach, DataOps thrives in an environment of plug-and-play tools and capabilities. DataOps tools aspire to the mantra of doing one thing exceptionally well, and thus when taken together present a set of nonoverlapping complementary capabilities that align with your composable units.

Results Import

In practice, however, even DataOps tools can, and often must, overlap in capability. The answer to this apparent redundancy reveals one of the unique hallmarks of interoperability in the DataOps stack, namely the tool’s ability to both export and import its results for common overlapping capabilities. In composing your DataOps pipeline, this allows your team to decide where best the artifacts and results of a particular common ability should be natively generated or imported.

Consider that a data extract tool, a data unification tool, and a data catalog are each capable of generating a basic dataset profile (record count, attribute percentage null, etc.):

  • The data extract tool has direct connectivity to the raw sources and can use and extract physical properties and even metrics where available, producing a basic data quality profile.

  • The data unification tool, although capable of generating a profile of the raw data it sees, might not have the same visibility as the extract tool and would benefit from being able to import the extract tool’s profile. This same imported profile then sits alongside any generated profile, for example of a flat file to which the unification tool has direct access.

  • The data cataloging tool, capable of profiling myriad file formats, catalogs both the raw datasets and the unified logical datasets, and benefits by importing both the extract tool dataset’s profiles and the unification tool dataset’s profiles, presenting the most informative view of the datasets.

Although each tool in the just-described DataOps toolkit is capable of generating a dataset profile, it is the ability of each tool to both export and import the dataset profile that is key. This ability allows the DataOps team great flexibility and control in composing a pipeline, designing around the core nonoverlapping ability of each component, while taking advantage of their import and export functionality to afford the interoperability of common abilities.

Metadata Exchange

The ability of the DataOps team to rapidly compose, build, and deploy data unification, cataloging, and analytics pipelines, with plug-and-play components, seems to present a formidable systems integration challenge. In DataOps however, the focus of interoperability is not the need for closer systems integration—for example, with each tool possessing native support for each other tool—but rather the need for open interoperable metadata exchange formats. This objective reveals the second unique hallmark of interoperability in the DataOps stack, namely the ready exchange and enrichment of metadata.

To illustrate, consider a data element richly decorated with the following metadata:

Data type

numeric

To a data extraction tool, the data type is immediately valuable because knowing the data type might allow for faster extraction and removal of unnecessary type casting or type validation.

Description

Total spend

To a data unification tool, the description is immediately valuable because the description, when presented to an end user or algorithm, allows for more accurate mapping, matching, or categorization.

Format

X,XXX.XX

To a dashboarding tool, the data format is immediately valuable in presenting the raw data to an analyst in a meaningful, consumer-friendly manner.

To realize these benefits each of our DataOps tools requires the ability to both preserve (pass through) this metadata, and the ability to enrich it. For example, if we’re lucky, the source system already contains this metadata on the data element. We now require our extraction tool to read, preserve, and pass this metadata to the unification tool. In turn, the unification tool preserves and passes this metadata to the cataloging tool. If we’re unlucky and the source system is without this metadata, the extraction tool is capable of enriching the metadata by casting the element to numeric, the unification tool is capable of mapping to an existing data element with description Total spend, and the dashboarding tool applies a typical currency format X,XXX.XX.

In flowing from source, through extraction, to unification and dashboard, the metadata is preserved by each tool by supporting a common interoperable metadata exchange format. Equally important to preserving the metadata, each tool enriches the metadata and exposes it to each other tool. This interaction is in stark contrast to data integration practices espoused by schema-first methodologies. Although metadata pass-through and enrichment are certainly a secondary requirement to that of data pass-through and enrichment (which even still remains a challenge for many tools today), it is certainly a primary and distinguishing feature of a DataOps tool and is an essential capability for realizing interoperability in the DataOps stack.

Automation

One simple paradigm for automation is that every common UI action or set of actions, and any bulk or data-scale equivalent of these actions, is also available via a well-formed API. In the DataOps toolkit, this meeting of the API-first ethos of Agile development with the pragmatism of DevOps means that any DataOps tool should furnish a suite of APIs that allow the complete automation of its tasks.

Broadly speaking, we can consider that these tasks are performed as part of either continuous or batch automation. Continuous automation concerns itself with only a singular phase, namely that of updating or refreshing a preexisting state. In contrast, in batch automation, we encounter use cases that have a well-defined beginning (usually an empty or zero state), middle, and end, and automation is required explicitly around each of these phases. The suite of APIs that facilitate automation must concern itself equally with continuous and batch automation.

Continuous Automation

If the primary challenge of the DevOps team is to streamline the software release cycle to meet the demands of the Agile development process, so the objective of the DataOps team is to automate the continuous publishing of datasets and refreshing of every tool’s results or view of those datasets.

To illustrate continuous automation in a DataOps project, let’s consider a data dashboarding tool. We might ask the following questions of its API suite. Does the tool have the following:

  • APIs that allow the updating of fundamental data objects such as datasets, attributes, and records?

  • APIs to update the definition of the dashboard itself to use a newly-available attribute?

  • APIs to update the dashboard’s configuration?

  • A concept of internally and externally created or owned objects? For example, how does it manage conflict between user and API updates of the same object?

  • APIs for reporting health, state, versions, and up-to-dateness?

The ability of a tool to be automatically updated and refreshed is critical to delivering working data and meeting business-critical timeliness of data. The ease at which the dataset can be published and republished directly affects the feedback cycle with the data consumer and ultimately determines the DataOps team’s ability to realize the goal of shortened lead time between fixes and faster MTTR.

Batch Automation

With most tools already providing sufficient capability to be continuously updated in an automatic, programmatic manner, it is easy to overlook the almost distinctly Agile need to spin up and tear down a tool or set of tools automatically.

Indeed, it is especially tempting to devalue batch automation as a once-off cost that need only be paid for a short duration at the very beginning when setting up a longer-term, continuous pipeline. However, the ability of a tool to be automatically initialized, started, run, and shut down is a core requirement in the DataOps stack and one that is highly prized by the DataOps team. Under the tenet of automate everything, it is essential to achieving responsiveness to change and short time to delivery.

To illustrate, let’s consider batch automation features for a data unification tool. Does the tool have the following:

  • APIs that allow the creation of fundamental data objects such as datasets, attributes, and records?

  • APIs to import artifacts typically provided via the UI? For example, metadata about datasets, annotations and descriptions, configuration, and setup?

  • APIs to perform complex user interactions or workflows? For example, manipulating and transforming, mapping, matching, or categorizing data?

  • APIs for reporting backup, restore, and shutdown state? For example, can I ask the tool if shutdown is possible or if it must be forced?

A tool that can be automatically stood up from scratch and executed, where every step is codified, delivers on the goal of automate everything and unlocks the tenet of test everything. It is possible to not only spin up a complete, dataset-to-dashboard pipeline for performing useful work (e.g., data model trials), but also to test the pipeline programmatically, running unit tests and data flow tests (e.g., did the metadata in the source dataset pass through successfully to the dashboard?). Batch automation is critical to realizing repeatability in your DataOps pipelines and a low error rate for newly published datasets.

As the key components of the DataOps ecosystem continue to evolve—meeting the demands of ever-increasing data variety and velocity and offering greater and greater capabilities for delivering working data—the individual tools that are deemed best of breed for these components will be those that prioritize interoperability and automation in their fundamental design.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset