Testing high availability

Testing high availability takes planning and a deep understanding of your system. The goal of every test is to reveal flaws in the system's design and/or implementation, and to provide good enough coverage that, if the tests pass, you'll be confident that the system behaves as expected.

In the realm of reliability and high availability, it means that you need to figure out ways to break the system and watch it put itself back together.

This requires several pieces, as follows:

A comprehensive list of possible failures (including reasonable combinations)
For each possible failure, it should be clear how the system should respond
A way to induce the failure
A way to observe how the system reacts

None of the pieces are trivial. The best approach in my experience is to do it incrementally and try to come up with a relatively small number of generic failure categories and generic responses, rather than an exhaustive, ever-changing list of low-level failures.

For example, a generic failure category is node-unresponsive; the generic response could be rebooting the node. The way to induce the failure can be stopping the VM of the node (if it's a VM), and the observation should be that, while the node is down, the system still functions properly based on standard acceptance tests. The node is eventually up, and the system gets back to normal. There may be many other things you want to test, such as whether the problem was logged, relevant alerts went out to the right people, and various stats and reports were updated.

Note that, sometimes, a failure can't be resolved in a single response. For example, in our unresponsive node case, if it's a hardware failure, then reboot will not help. In this case, a second line of response comes into play and maybe a new VM is started, configured, and hooked up to the node. In this case, you can't be too generic, and you may need to create tests for specific types of pod/role that were on the node (etcd, master, worker, database, and monitoring).

If you have high-quality requirements, be prepared to spend much more time setting up the proper testing environments and the tests than even the production environment.

One last, important point is to try to be as nonintrusive as possible. This means that, ideally, your production system will not have testing features that allow shutting down parts of it or cause it to be configured to run in reduced capacity for testing. The reason is that it increases the attack surface of your system, and it can be triggered by accident by mistakes in configuration. Ideally, you can control your testing environment without resorting to modifying the code or configuration that will be deployed in production. With Kubernetes, it is usually easy to inject pods and containers with custom test functionality that can interact with system components in the staging environment, but will never be deployed in production.

In this section, we looked at what it takes to actually have a reliable and highly available cluster, including etcd, the API server, the scheduler, and the controller manager. We considered best practices for protecting the cluster itself as well as your data, and paid special attention to the issue of starting environments and testing.

Table of Contents for Testing high availability

Create new playlist

Sign In

Sign Up

Table of Contents for
Testing high availability