Chapter 3. DataOps as a Discipline

DataOps, like DevOps, emerges from the recognition that separating the product—production-ready data—from the process that delivers it—operations—impedes quality, timeliness, transparency, and agility. The need for DataOps comes about because data consumption has changed dramatically over the past decade. Just as internet applications raised user expectations for the usability, availability, and responsiveness of applications, things like Google Knowledge Panel and Wikipedia have dramatically raised user expectations for the usability, availability, and freshness of data.

What’s more, with increased access to very usable self-service data preparation and visualization tools, there are also now many users within the enterprise who are ready and able to prepare data for their own use if official channels are unable to meet their expectations. In combination, these changes have created an environment in which continuing with the cost-laden, delay-plagued, opaque operations used to deliver data in the past are no longer acceptable. Taking a cue from DevOps, DataOps looks to combine the production and delivery of data into a single, Agile practice that directly supports specific business functions. The ultimate goal is to cost-effectively deliver timely, high-quality data that meets the ever-changing needs of the organization.

In this chapter, we review the history of DataOps, the problems it is designed to address, the tools and processes it uses, and how organizations can effectively make the transition to and gain the benefits of DataOps.

DataOps: Building Upon Agile

DataOps is a methodology that spans people, processes, tools, and services to enable enterprises to rapidly, repeatedly, and reliably deliver production data from a vast array of enterprise data sources to a vast array of enterprise data consumers.

DataOps builds on the many decades of accumulated wisdom received from Agile processes. It is worth taking a moment to highlight some key goals and tenets of Agile, how they have been applied to software, and how they can be applied to data. Agile software development arose from the observation that software projects that were run using traditional processes were plagued by the following:

  • High cost of delivery, long time to delivery, and missed deadlines

  • Poor quality, low user satisfaction, and failure to keep pace with ever-changing requirements

  • Lack of transparency into progress toward goals as well as schedule unpredictability

  • Anti-scaling in project size, where the cost per feature of large projects is higher than the cost per feature of small projects

  • Anti-scaling in project duration, where the cost of maintenance grows to overwhelm available resources

These are the same frustrations that plague so many data delivery projects today.

The Agile Manifesto

In establishing an approach that seeks to address each of these issues, the Agile community introduced several core tenets in an Agile Manifesto:

We value:

  1. Individuals and interactions over processes and tools

  2. Working software over comprehensive documentation

  3. Customer collaboration over contract negotiation

  4. Responding to change over following a plan

That is, while there is value in the items on the right, we value the items on the left more. Let’s review these briefly, their impact on software development, and the expected impact on data delivery.

Tenet 2: Working software

I’ll start with tenet 2, because it really should be tenet 1: the goal of software engineering is to deliver working software. Everything else is secondary. With working software, users can accomplish their goals significantly more readily than they could without the software. This means that the software meets the users’ functional needs, quality needs, availability needs, serviceability needs, and so on. Documentation alone doesn’t enable users to accomplish their goals.

Similarly, the goal of data engineering is to produce working data; everything else is secondary. With working data, users can accomplish their goals significantly more readily than they could without the data. Ideally, data engineering teams will be able to adhere to principles of usability and data design that make documentation unnecessary for most situations.

The other three tenets are in support of this main tenet. They all apply equally well to a data engineering team, whose goal is to produce working data.

Tenet 1: Individuals and interactions

Software is written by people, not processes or tools. Good processes and tools can support people and help them be more effective, but neither processes nor tools can make mediocre engineers into great engineers. Conversely, poor processes or tools can reduce even the best engineers to mediocrity. The best way to get the most from your team is to support them as people, first, and to bring in tools and processes only as necessary to help them be more effective.

Tenet 3: Customer collaboration

When you try to capture your customers’ needs up front in a requirements “contract,” customers will push for a very conservative contract to minimize their risk. Building to this contract will be very expensive and still not likely meet customers’ real needs. The best way to determine whether a product meets your customer’s needs and expectations is to have the customer use the product and give feedback. Getting input as early and as often as possible ensures course corrections are as small as possible.

Tenet 4: Responding to change

Change is constant—in requirements, in process, in availability of resources, and so on—and teams that fail to adapt to these changes will not deliver software that works. No matter how good a plan is, it cannot anticipate the changes that will happen during execution. Rather than invest heavily in upfront planning, it is much better to plan only as much as necessary to ensure that the team is aligned and the goals are reasonable and then measure often to determine whether a course correction is necessary. Only by adapting swiftly to change can the cost of adaptation be kept small.

Agile Practices

The preceding has described the goal and tenets of Agile, but not what to actually do. There are many variations of the Agile process, but they share several core recommendations:

Deliver working software frequently

In days or weeks, not months or years, adding functionality incrementally until a release is completed.

Get daily feedback from customers (or customer representatives)

Gather feedback on what has been done so far.

Accept changing requirements

Be prepared to do so even late in development.

Work in small teams

Work in teams of three to seven people who are motivated, trusted, and empowered individuals, with all the skills required for delivery present on each team.

Keep teams independent

This means each team’s responsibilities span all domains, including planning, analysis, design, coding, unit testing, acceptance testing, releasing, and building and maintaining tools and infrastructure.

Continually invest in automation

You should aim to automate everything.

Continually invest in improvement

Again, automate everything, including process, design, and tools.

These practices have enabled countless engineering teams to deliver timely, high-quality products, many of which we use every day. These same practices are now enabling data engineering teams to deliver the timely, high-quality data that powers applications and analytics. But there is another transition made in the software world that needs to be picked up in the data world. When delivering hosted applications and services, Agile software development is not enough. It does little good to rapidly develop a feature if it then takes weeks or months to deploy it, or if the application is unable to meet availability or other requirements due to inadequacy of the hosting platform. These are operations, and they require a skill set quite distinct from that of software development. The application of Agile to operations created DevOps, which exists to ensure that hosted applications and services not only can be developed, but also delivered in an Agile manner.

Agile Operations for Data and Software

Agile removed many barriers internal to the software development process and enabled teams to deliver production features in days, instead of years. For hosted applications in particular, the follow-on process of getting a feature deployed retained many of the same problems that Agile intended to address. Bringing development and operations into the same process, and often the same team, can reduce time-to-delivery down to hours or minutes. The principle has been extended to operations for nonhosted applications, as well, with similar effect. This is the core of DevOps. The problems that DevOps intends to address look very similar to those targeted by Agile software development:

  • Improved deployment frequency

  • Faster time to market

  • Lower failure rate of new releases

  • Shortened lead time between fixes

  • Faster mean time to recovery (MTTR) (in the event of a new release crashing or otherwise disabling the current system)

We can summarize most of these as availability—making sure that the latest working software is consistently available for use. To determine whether a process or organization is improving availability, you need something more transparent than percent uptime, and you need to be able to measure it continuously so that you know when you’re close, and when you’re deviating. Google’s Site Reliability Engineering team did some of the pioneering work looking at how to measure availability in this way, and distilled it into the measure of the fraction of requests that are successful. DevOps, then, has the goal of maximizing the fraction of requests that are successful, at minimum cost.

For an application or service, a request can be logging in, opening a page, performing a search, and so on. For data, a request can be a query, an update, a schema change, and so forth. These requests might come directly from users (for example, on an analysis team) or they could be made by applications or automated scripts. Data development produces high-quality data, whereas DataOps ensures that the data is consistently available, maximizing the fraction of requests that are successful.

DataOps Tenets

DataOps is an emerging field, whereas DevOps has been put into practice for many years now. We can use our depth of experience with DevOps to provide a guide for the developing practice of DataOps. There are many variations in DevOps, but they share a collection of core tenets:

  1. Think services, not servers

  2. Infrastructure as Code

  3. Automate everything

Let’s review these briefly, how they affect service availability, and the expected impact on data availability.

Tenet 1: Think services, not servers

When it comes to availability, there are many more options for making a service available than there are for making a server available. By abstracting services from servers, we open up possibilities such as replication, elasticity, failover, and more, each of which can enable a service to successfully handle requests under conditions where an individual server would not be successful, for example, under a sudden surge in load, or requests that come from broad geographic distribution.

This should make it clear why it is so important to think of data availability not as database server availability, but as the availability of Data as a Service (DaaS). The goal of the data organization is not to deliver a database, or a data-powered application, but the data itself, in a usable form. In this model, data is typically not delivered in a single form factor, but simultaneously in multiple form factors to meet the needs of different clients: RESTful web services to meet the needs of service-oriented applications; streams to meet the need of real-time dashboards and operations; and bulk data in a data lake for offline analytic use cases. Each of these delivery forms can have independent service-level objectives (SLOs), and the DataOps organization can track performance relative to those objectives when delivering data.

Tenet 2: Infrastructure as Code

A service can’t be highly available if responding to an issue in its infrastructure depends on having the person with the right knowledge or skills available. You can’t increase the capacity of a service if the configuration of its services isn’t captured anywhere other than in the currently running instances. And you can’t trust that infrastructure will be correctly deployed if it requires a human to correctly execute a long sequence of steps. By capturing all the steps to configure and deploy infrastructure as code, not only can infrastructure changes be executed quickly and reliably by anyone on the team, but that code can be planned, tested, versioned, released, and otherwise take full advantage of the depth of experience we have with software development.

With Infrastructure as Code (IaC), deploying additional servers is a matter of running the appropriate code, dramatically reducing the time to deployment as well as the opportunity for human error. With proper versioning, if an issue is introduced in a new version of a deployment, we can roll back the deployment to a previous version while the issue is identified and addressed. To further minimize issues found in production, we can deploy infrastructure in staging and user acceptance testing (UAT) environments, with full confidence that redeploying in production will not bring any surprises. Using IaC enables operations to be predictable, reliable, and repeatable.

From the DataOps perspective, this means that everything involved in delivering data must be embodied in code. Of course, this includes infrastructure such as hosts, networking, and storage, but, importantly, this also covers everything to do with data storage and movement, from provisioning databases, to deploying ETL servers and data-processing workflows, to setting up permissions, access control, and enforcement of data governance policy. Nothing can be done as a one-off; everything must be captured in code that is versioned, tested, and released. Only by rigorously following this policy will data operations be predictable, reliable, and repeatable.

Tenet 3: Automate everything

Many of the techniques available for keeping services available will not work if they require a human in the loop. When there is a surge in demand, service availability will drop if deploying a new server requires a human to click a button. Deploying the latest software to production will take longer if a human needs to run the deployment script. Rather, all of these processes need to be automated. This pervasive automation unlocks the original goal of making working software highly available to users. With pervasive automation, new features are automatically tested both for correctness and acceptance; the test automation infrastructure is itself tested automatically; deployment of new features to production is automated; scalability and recovery of deployed services is automated (and tested, of course); and it is all monitored, every step of the way. This is what enables a small DevOps team to effectively manage a large infrastructure, while still remaining responsive.

Automation is what enables schema changes to propagate quickly through the data ecosystem. It is what ensures that responses to compliance violations can be made in a timely, reliable, and sustainable way. It is what ensures that we can uphold data freshness guarantees. And it is what enables users to provide feedback on how the data does or could better suit their needs so that the process of rapid iteration can be supported. Automation is what enables a small DataOps team to effectively keep data available to the teams, applications, and services that depend on it.

DataOps Practices

The role of the operations team is to provide the applications, services, and other infrastructure used by the engineering teams to code, build, test, package, release, configure, deploy, monitor, govern, and gather feedback on their products and services. Thus, the operations team is necessarily interdisciplinary. Despite this breadth, there are concrete practices that apply across all these domains:

Apply Agile process

Short time-to-delivery and responsiveness to change (along with everything that comes with those requirements) are mandatory for the DataOps team to effectively support any other Agile team.

Integrate with your customer

The DataOps team has the advantage that the customers, the engineering teams they support, are in-house, and therefore readily available for daily interaction. Gather feedback at least daily. If it’s possible for DataOps and data engineering to be colocated, that’s even better.

Implement everything in code

This means host configuration, network configuration, automation, gathering and publishing test results, service installation and startup, error handling, and so on. Everything needs to be code.

Apply software engineering best practices

The full value of IaC is attained when that code is developed using the decades of accumulated wisdom we have in software engineering. This means using version control with branching and merging, automated regression testing of everything, clear code design and factoring, clear comments, and so on.

Maintain multiple environments

Keep development, acceptance testing, and production environments separate. Never test in production, and never run production from development. Note that one of the production environments for DataOps is the development environment for the data engineers, and another is the production environment for the data engineers. The DataOps development environment is for the DataOps team to develop new features and capabilities.

Integrate the toolchains

The different domains of operations require different collections of tools (“toolchains”). These toolchains need to work together for the team to be able to be efficient. Your data movement engine and your version control need to work together. Your host configuration and your monitoring need to work together. You will be maintaining multiple environments, but within each environment, everything needs to work together.

Test everything

Never deploy data if it hasn’t passed quality tests. Never deploy a service if it hasn’t passed regression tests. Automated testing is what allows you to make changes quickly, having confidence that problems will be found early, long before they get to production.

These practices enable a small operations team to integrate tightly with data engineering teams so that they can work together to deliver the timely, high-quality data that powers applications and analytics.

DataOps Challenges

DataOps teams, particularly those working with Big Data, encounter some challenges that other operations teams do not.

Application Data Interface

When integrating software packages into a single product, software engineers take advantage of application programing interfaces (APIs), which specify a functional and nonfunctional contract. Software subsystems can be written to provide or consume an API, and can be independently verified using a stubbed implementation on the other side of the API. These independently developed subsystems then can be fit together and will interoperate thanks to the contractual clarity of the API. There is no such equivalent for data. What we would like is an application data interface (ADI), which specifies a structural and semantic model of data so that data providers and data consumers can be verified independently and then fit together and trusted to interoperate thanks to the contractual clarity of the ADI. There have been multiple attempts to standardize representation of data structure and semantics, but there is no widely accepted standard. In particular, the Data Definition Language (DDL) subset of SQL specifies structure and constraints of data, but not semantics. There are other standards for representing data semantics, but none have seen broad adoption. Therefore, each organization needs to independently select and employ tools to represent and check data model and semantics.

Data Processing Architecture

There are two fundamental modes for data: snapshots, represented in tables, and transactions, represented in streams. The two support different use cases, and, unfortunately, they differ in every respect, from structure, to semantics, to queries, to tools and infrastructure. Data consumers want both. There are well-established methods of modeling the two in the data warehousing world, but with the ascendency of data lakes, we are having to discover new methods of supporting them. Fortunately, the data warehousing lessons and implementation patterns transfer relatively cleanly to the technologies and contexts of contemporary data lakes, but because there is not yet good built-in tool support, the DataOps team will be confronted with the challenge of assembling and configuring the various technologies to deliver data in these modes.

There are now multiple implementation patterns that purport to handle both snapshot and streaming use cases while enabling a DataOps team to synchronize the two to a certain degree. Prominent examples are the Lambda Architecture and Kappa Architecture. Vendor toolchains do not yet have first-class support for such implementation patterns, so it is the task of the DataOps team to determine which architecture will meet their organization’s needs and to deploy and manage it.

Query Interface

Data is not usable without a query interface. A query interface is a type of API, so data consumers can be written and verified against an abstract interface and then run against any provider of that API. Unfortunately, most query interfaces are vendor or vendor/version specific, and the vendors provide only one implementation of the query interface, so much of the benefit of writing to an API is lost. SQL is an attempt to create a standard data query API, but there is enough variation between vendor implementations that only the simplest of queries are compatible across vendors, and attaining good performance always requires use of vendor-specific language extensions.

Thus, even though we want to focus on DaaS, independent of any particular vendor platform, the current reality is that the vendor and version of most query interfaces must be transparent to end users, and becomes part of the published interface of the data infrastructure. This impedes upgrades, and makes it nearly impossible to change vendors.

This problem is compounded by the fact that different data consumers require different kinds of query interface to meet their needs. There are three very different modes of interacting with data, and the DataOps team needs to provide interfaces for all of them:

  • A REST interface to find, fetch, and update individual or small groups of records

  • A batch query interface that supports aggregation over large collections of data

  • A streaming interface that supports real-time analytics and alerting

The infrastructure, technology, and design of systems to support each of these kinds of query interface is very different. Many vendors provide only one or two of them and leave much of the complexity of deployment up to the DataOps team. The DataOps team needs to take this into consideration when designing their overall data processing architecture.

Resource Intensive

Even moderate-scale data places significant demands on infrastructure, so provisioning is another DataOps challenge. DataOps needs to consider data storage, movement, query processing, provenance, and logging. Storage must be provisioned for multiple releases of data as well as for different environments. Compute must be provisioned intelligently, to keep data transfers within acceptable limits. Network must be provisioned to support the data transfers that cannot be avoided. Although provisioning to support resource-intensive loads is not unique to DataOps, the nature of data is such that DataOps teams will have very little runway relative to other kinds of teams before they begin to run into difficult challenges and trade-offs.

Schema Change

Vendors change data with every release. Analysts require data changes for every new analytic or visualization. These modifications put schemas, and therefore ADIs, in a state of perpetual change. Each change might require adjustment to the entire depth of the associated data pipelines and applications. Managing the entire DataOps ecosystem as versioned, tested code, with clear separation between development and production environments, makes it possible to respond quickly to these changes, with confidence that problems will be caught quickly. Unfortunately, many tools still assume that schemas change slowly or not at all, and the DataOps team must implement responsiveness to schema change outside these tools. Good factoring of code to centralize schema definition is the only way to keep up with this rapid pace of change.

Governance

Regulations from both government and industry cover data access, retention, traceability, accountability, and more. DataOps must support these regulations and provide alerting, logging, provenance, and so on throughout the data-processing infrastructure. Data governance tools are rapidly maturing, but interoperability between governance tools and other data infrastructure is still a significant challenge. The DataOps team will need to bridge the gaps between these toolchains to provide the coverage required by regulation.

The Agile Data Organization

DataOps in conjunction with Agile data engineering builds a next-generation data engineering organization. The goal of DataOps is to extend the Agile process through the operational aspects of data delivery so that the entire organization is focused on timely delivery of working data. Analytics is a major consumer of data, and DataOps in the context of Agile analytics has received quite a bit of attention. Other consumers also substantially benefit from DataOps, including governance, operations, security, and so forth. By combining the engineering skills that are able to produce the data with the operations skills that are able to make it available, this team is able to cost-effectively deliver timely, high-quality data that meets the ever-changing needs of the data-driven enterprise.

This cross-functional team will now be able to deliver several key capabilities to the enterprise:1

Source data inventory

Data consumers need to know what raw material is available to work with. What are the datasets, and what attributes do they contain? On what schedule is the source updated? What governance policies are they subject to? Who is responsible for handling issues? All of these questions need to be answered by the source data inventory.

Data movement and shaping

Data needs to get from the sources into the enriched, cleaned forms that are appropriate for operations. This requires connectivity, movement, and transformation. All of these operations need to be logged, and the full provenance of the resulting data needs to be recorded.

Logical models of unified data

Operations need to run on data models of entities that are tied to the business and are well understood. These models need to be concrete enough to enable practical use, while maintaining flexibility to accommodate the continuous change in the available and needed data.

Unified data hub

The hub is a central location where users can find, access, and curate data on key entities—suppliers, customers, products, and more—that powers the entire organization. The hub provides access to the most complete, curated, and up-to-date information on these entities, and also identifies the provenance, consumers, and owners of that information.

Feedback

At time of use, data quality issues become extremely transparent, so capturing feedback at point of use is critical to enabling the highest quality data. Every data consumer needs a readily accessible feedback mechanism, powered by the Unified Data Hub. This ensures that feedback can be incorporated reliably and in the timeliest manner.

Combining DataOps with your Agile data engineering organization will allow you to achieve the transformational analytic outcomes that are so often sought, but that so frequently stumble on outdated operational practices and processes. Quickly and reliably responding to the demands presented by the vast array of enterprise data sources and the vast array of consumption use cases will build your “company IQ.” DataOps is the transformational change data engineering teams have been waiting for to fulfill their aspirations of enabling their business to gain analytic advantage through the use of clean, complete, current data.

1 For more on this, see “DataOps: Building a Next Generation Data Engineering Organization” by Andy Palmer and Liam Cleary.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset