Chapter 9. Operator Philosophy

We’ve noted the problems Operators aim to solve, and you’ve stepped through detailed examples of how to build Operators with the SDK. You’ve also seen how to distribute Operators in a coherent way with OLM. Let’s try to connect those tactics to the strategic ideas that underpin them to understand an existential question: what are Operators for?

The Operator concept descends from Site Reliability Engineering (SRE). Back in Chapter 1, we talked about Operators as software SREs. Let’s review some key SRE tenets to understand how Operators apply them.

SRE for Every Application

SRE began at Google in response to the challenges of running large systems with ever-increasing numbers of users and features. A key SRE objective is allowing services to grow without forcing the teams that run them to grow in direct proportion. To run systems at dramatic scale without a team of dramatic size, SREs write code to handle deployment, operations, and maintenance tasks. SREs create software that runs other software, keeps it running, and manages it over time. SRE is a wider set of management and engineering techniques with automation as a central principle. You may have heard its goal referred to by different names, like “autonomous” or “self-driving” software. In the Operator Maturity Model we introduced in Figure 4-1, we call it “Auto Pilot.”

Operators and the Operator Framework make it easier to implement this kind of automation for applications that run on Kubernetes. Kubernetes orchestrates service deployments, making some of the work of horizontal scaling or failure recovery automatic for stateless applications. It represents distributed system resources as API abstractions. Using Operators, developers can extend those practices to complex applications.

The well-known “SRE book” Site Reliability Engineering (O’Reilly), by Betsy Beyer et al. (eds.), is the authoritative guide to SRE principles. Google engineer Carla Geisser’s comments in it typify the automation element of SRE: “If a human operator needs to touch your system during normal operations, you have a bug.”1 SREs write code to fix those bugs. Operators are a logical place to program those fixes for a broad class of applications on Kubernetes. An Operator reduces human intervention bugs by automating the regular chores that keep its application running.

Toil Not, Neither Spin

SRE tries to reduce toil by creating software to perform the tasks required to operate a system. Toil is defined in this context as work that is “automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.”2

Automatable: Work Your Computer Would Like

Work is automatable if a machine could do it. If a task needs the discernment of human judgment, a machine can’t do it. For example, expense reports are subjected to a variety of machine-driven boundary checking, but usually some final human review is required—of items the automated process flagged as out of bounds, if not of every receipt. The approval of reports within bounds may be automatable; the final acceptance or rejection of out-of-bounds cases may not. Work that could be automated by software should be automated by software if it is also repetitive. The cost of building software to perform a repetitive task can be amortized over a lifetime of repetitions.

Running in Place: Work of No Enduring Value

It can be uncomfortable to think of some work as having no value, but in SRE terms, work is “devoid of enduring value” if doing the work doesn’t change the service. Backing up a database server is one example. The database doesn’t go faster, serve more requests, or become inherently more reliable when you back it up. It also doesn’t stop working. Yet despite having no enduring value, backups are clearly worth doing. This kind of work often makes a good job for an Operator.

Growing Pains: Work That Expands with the System

You might design a service so that it scales in the horizontal plane, to serve more requests or run more instances of the service. But if adding a new instance requires an engineer to configure computers and wire them to a network, scaling the service is anything but automatic. In the worst cases of this kind of toil, ops work might scale linearly with your service. Every 10% of service growth—10% more users, 10% more requests per second, or a new feature that uses 10% more CPU—means 10% more custodial labor.

Manual scaling: Just like in the bad old days

Imagine running the stateless web server from Chapter 1. You deploy three instances on three VMs. To add more web server capacity, you spin up new VMs, assign them (unique) IP addresses, and assign (per-IP) ports where the web server binaries listen. Next, you inform the load balancer of the new endpoints so it can route some requests there.

As designed and provisioned, it’s true that your simple stateless web server can grow with demand. It can serve more users and add more features by spreading an increasing load over multiple instances. But the team that runs the service will always have to grow along with it. This effect gets worse as the system gets larger, because adding one VM won’t meaningfully increase the capacity of a thousand instances.

Automating horizontal scaling: Kubernetes replicas

If you deploy your stateless web server on Kubernetes instead, you can scale it up and down with little more than a kubectl command. This is an example of Kubernetes as an implementation of SRE’s automation principles at the platform level. Kubernetes abstracts the infrastructure where the web servers run and the IP addresses and ports through which they serve connections. You don’t have to configure each new web server instance when scaling up, or deliberately free IPs from your range when scaling down. You don’t have to program the load balancer to deliver traffic to a new instance. Software does those chores instead.

Operators: Kubernetes Application Reliability Engineering

Operators extend Kubernetes to extend the principle of automation to complex, stateful applications running on the platform. Consider an Operator that manages an application with its own notions of clustering. When the etcd Operator replaces a failed etcd cluster member, it arranges a new pod’s membership by configuring it and the existing cluster with endpoints and authentication.

If you are on a team responsible for managing internal services, Operators will enable you to capture expert procedures in software and expand the system’s “normal operations”: that is, the set of conditions it can handle automatically. If you’re developing a Kubernetes native application, an Operator lets you think about how users toil to run your app and save them the trouble. You can build Operators that not only run and upgrade an application, but respond to errors or slowing performance.

Control loops in Kubernetes watch resources and react when they don’t match some desired state. Operators let you customize a control loop for resources that represent your application. The first Operator concerns are usually automatic deployment and self-service provisioning of the operand. Beyond that first level of the maturity model, an Operator should know its application’s critical state and how to repair it. The Operator can then be extended to observe key application metrics and act to tune, repair, or report on them.

Managing Application State

An application often has internal state that needs to be synchronized or maintained between replicas. Once an Operator handles installation and deployment, it can move farther along the maturity model by keeping such shared state in line among a dynamic group of pods. Any application with its own concept of a cluster, such as many databases and file servers, has this kind of shared application state. It may include authentication resources, replication arrangements, or writer/reader relationships. An Operator can configure this shared state for a new replica, allowing it to expand or restore the application’s cluster with new members. An Operator might rectify external resources its application requires. For example, consider manipulating an external load balancer’s routing rules as replicas die and new ones replace them.

Golden Signals Sent to Software

Beyer at al. suggest monitoring the “four golden signals”3 for the clearest immediate sense of a system’s health. These characteristics of a service’s basic operation are a good place to start planning what your Operator should watch. In the SRE book that popularized them, golden signals convey something about a system’s state important enough to trigger a call to a human engineer.4 When designing Operators, you should think of anything that might result in a call to a person as a bug you can fix.

Site Reliability Engineering lists the four golden signals as latency, traffic, errors, and saturation.5 Accurate measurements of these four areas, adapted to the metrics that best represent a particular application’s condition, ensure a reasonable understanding of the application’s health. An Operator can monitor these signals and take application-specific actions when they depict a known condition, problem, or error. Let’s take a closer look:

Latency

Latency is how long it takes to do something. It is commonly understood as the elapsed time between a request and its completion. For instance, in a network, latency is measured as the time it takes to send a packet of data between two points. An Operator might measure application-specific, internal latencies like the lag time between actions in a game client and responses in the game engine.

Traffic

Traffic measures how frequently a service is requested. HTTP requests per second is the standard measurement of web service traffic. Monitoring regimes often split this measurement between static assets and those that are dynamically generated. It makes more sense to monitor something like transactions per second for a database or file server.

Errors

Errors are failed requests, like an HTTP 500-series error. In a web service, you might have an HTTP success code but see scripting exceptions or other client-side errors on the successfully delivered page. It may also be an error to exceed some latency guarantee or performance policy, like a guarantee to respond to any request within a time limit.

Saturation

Saturation is a gauge of a service’s consumption of a limited resource. These measurements focus on the most limited resources in a system, typically CPU, memory, and I/O. There are two key ideas in monitoring saturation. First, performance gets worse even before a resource is 100% utilized. For instance, some filesystems perform poorly when more than about 90% full, because the time it takes to create a file increases as available space decreases. Because of similar effects in nearly any system, saturation monitors should usually respond to a high-water mark of less than 100%. Second, measuring saturation can help you anticipate some problems before they happen. Dividing a file service’s free space by the rate at which an application writes data lets your Operator estimate the time remaining until storage is full.

Operators can iterate toward running your service on auto pilot by measuring and reacting to golden signals that demand increasingly complex operations chores. Apply this analysis each time your application needs human help, and you have a basic plan for iterative development of an Operator.

Seven Habits of Highly Successful Operators

Operators grew out of work at CoreOS during 2015 and 2016. User experience with the Operators built there and continuing at Red Hat, and in the wider community, have helped refine seven guidelines set out as the concept of Kubernetes Operators solidified: 6

  1. An Operator should run as a single Kubernetes deployment.

    You installed the etcd Operator in Chapter 2 from one manifest, without the OLM machinery introduced in Chapter 8. While you provide a CSV and other assets to make an OLM bundle for an Operator, OLM still uses that single manifest to deploy the Operator on your behalf.

    To illustrate this, although you usually need to configure RBAC and a service account, you can add the etcd Operator to a Kubernetes cluster with a single command. It is just a deployment:

    $ kubectl create -f https://raw.githubusercontent.com/
      kubernetes-operators-book/chapters/master/ch03/
           etcd-operator-deployment.yaml
    
  2. Operators should define new custom resource types on the cluster.

    Think of the etcd examples back in Chapter 2. You created a CRD, and in it you defined a new kind of resource, the EtcdCluster. That kind represents instances of the operand, a running etcd cluster managed by the Operator. Users create new application instances by creating new custom resources of the application’s kind.

  3. Operators should use appropriate Kubernetes abstractions whenever possible.

    Don’t write new code when API calls can do the same thing in a more consistent and widely tested manner. Some quite useful Operators do little more than manipulate some set of standard resources in a way that suits their application.

  4. Operator termination should not affect the operand.

    When an Operator stops or is deleted from the cluster, the application it manages should continue to function. Return to your cluster and delete either the etcd or the Visitors Site Operator. While you won’t have automatic recovery from failures, you’ll still be able to use the application features of the operand in the absence of the Operator. You can visit the Visitors Site or retrieve a key-value pair from etcd even when the respective Operator isn’t running.

    Note that removing a CRD does affect the operand application. In fact, deleting a CRD will in turn delete its CR instances.

  5. An Operator should understand the resource types created by any previous versions.

    Operators should be backward compatible with the structures of their predecessors. This places a premium on designing carefully and for simplicity, because the resources you define in version 1 will necessarily live on.

  6. An Operator should coordinate application upgrades.

    Operators should coordinate upgrades of their operands, potentially including rolling upgrades across an application cluster and almost certainly including the ability to roll back to a previous version when there is a problem. Keeping software up to date is necessary toil, because only the latest software has the latest fixes for bugs and security vulnerabilities. Automating this upgrade toil is an ideal job for an Operator.

  7. Operators should be thoroughly tested, including chaos testing.

    Your application and its relationship to its infrastructure constitute a complex distributed system. You’re going to trust your Operator to manage that system. Chaos testing intentionally causes failures of system components to discover unexpected errors or degradation. It’s good practice to build a test suite that subjects your Operator to simulated errors and the outright disappearance of pods, nodes, and networking to see where failures arise or cascade between components as their dependencies collapse beneath them.

Summary

Operators tend to advance through phases of maturity ranging from automatic installs, through seamless application upgrades, to a steady normal state of “auto pilot” where they react to and correct emergent issues of performance and stability in their operands. Each phase aims to end a little more human toil.

Making an Operator to distribute, deploy, and manage your application makes it easier to run it on Kubernetes and allows the application to leverage Kubernetes features. An Operator that follows the seven habits outlined here is readily deployed, and can itself be managed through its lifecycle by OLM. That Operator makes its operand easier to run, manage, and potentially to implement. By monitoring its application’s golden signals, an Operator can make informed decisions and free engineers from rote operations tasks.

1 Beyer et al. (eds.), Site Reliability Engineering, 119.

2 Beyer et al. (eds.), Site Reliability Engineering, 120.

3 Beyer et al. (eds.), Site Reliability Engineering, 139.

4 Beyer et al. (eds.), Site Reliability Engineering, 140.

5 Beyer et al. (eds.), Site Reliability Engineering, 139.

6 Brandon Phillips, “Introducing Operators,” CoreOS Blog, November 3, 2016, https://oreil.ly/PtGuh.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset