The unveiling of a relatively famous OSS project by the team at Netflix called “Chaos Monkey” had a disruptive effect on the IT world. The concept that Netflix had built code that randomly kills various services in their production environment blew people’s minds. When many teams struggle just maintaining their uptime requirements, promoting self-sabotage and attacking oneself seemed absolutely crazy. Yet from the moment Chaos Monkey was born, a new movement arose: chaos engineering.
According to the Principles of Chaos Engineering website, “Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.”
In complex systems (software systems or ecological systems), things can and will fail, but the ultimate goal is to stop catastrophic failure of the overall system. So how do you verify that your overall system—your network of microservices—is in fact resilient? You inject a little chaos. With Istio, this is a relatively simple matter because the istio-proxy is intercepting all network traffic; therefore, it can alter the responses including the time it takes to respond. Two interesting faults that Istio makes easy to inject are HTTP error codes and network delays.
Based on exercises earlier in this book, make sure that recommendation v1 and v2 are both deployed with no code-driven misbehavior or long waits/latency. Now, you will be injecting errors via Istio instead of using Java code:
oc get pods -l app=recommendation -n tutorial NAME READY STATUS RESTARTS AGE recommendation-v1-3719512284 2/2 Running 6 18m recommendation-v2-2815683430 2/2 Running 0 13m
Also, double-check that you do not have any DestinationRule
s or VirtualService
s:
oc delete virtualservices --all -n tutorial oc delete destinationrules --all -n tutorial
We use the combination of Istio’s DestinationRule
and VirtualService
to inject a percentage of faults—in this case, returning the HTTP 503 error code 50% of the time.
The DestinationRule
:
apiVersion: networking.istio.io/v1alpha3 kind: DestinationRule metadata: name: recommendation namespace: tutorial spec: host: recommendation subsets: - labels: app: recommendation name: app-recommendation
The VirtualService
:
apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: recommendation namespace: tutorial spec: hosts: - recommendation http: - fault: abort: httpStatus: 503 percent: 50 route: - destination: host: recommendation subset: app-recommendation
And you apply the DestinationRule
and VirtualService
:
oc -n tutorial create -f istiofiles/destination-rule-recommendation.yml oc -n tutorial create -f istiofiles/virtual-service-recommendation-503.yml
Testing the change is as simple as issuing a few curl commands at the customer endpoint. Make sure to test it a few times, looking for the resulting 503 approximately 50% of the time:
curl customer-tutorial.$(minishift ip).nip.io customer => preference => recommendation v1 from '3719512284': 88 curl customer-tutorial.$(minishift ip).nip.io customer => 503 preference => 503 fault filter abort
You can now see if the preference service is properly handling the exceptions being returned by the recommendation service.
Clean up by removing the VirtualService
but leave the DestinationRule
in place:
oc delete virtualservices --all -n tutorial
The most insidious of possible distributed computing faults is not a “dead” service but a service that is responding slowly, potentially causing a cascading failure in your network of services. More importantly, if your service has a specific service-level agreement (SLA) it must meet, how do you verify that slowness in your dependencies doesn’t cause you to fail in delivering to your awaiting customer?
Much like the HTTP error injection, network delays use the VirtualService
kind, as well. The following manifest injects 7 seconds of delay into 50% of the responses from the recommendation service:
apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: creationTimestamp: null name: recommendation namespace: tutorial spec: hosts: - recommendation http: - fault: delay: fixedDelay: 7.000s percent: 50 route: - destination: host: recommendation subset: app-recommendation
Apply the new VirtualService
:
oc -n tutorial create -f istiofiles/virtual-service-recommendation-delay.yml
Then, send a few requests at the customer endpoint and notice the “time” command at the front. This command will output the elapsed time for each response to the curl
command, allowing you to see that 7-second delay:
#!/bin/bash while true do time curl customer-tutorial.$(minishift ip).nip.io sleep .1 done
Many requests to the customer end point now have a delay. If you are monitoring the logs for recommendation v1 and v2, you will also see the delay happens before the recommendation service is actually called. The delay is in the Istio proxy (Envoy), not in the actual endpoint:
oc logs recommendation-v2-2815683430 -f -c recommendation
In Chapter 4 you saw how to deal with errors bubbling up from your code and now in this chapter, you played the role of self-saboteur by injecting errors/delays via Istio’s VirtualService
. At this time there should be a key question in your mind: “How do I know that these errors are happening within my application?” The answer is in Chapter 6.
Clean up:
oc delete virtualservice recommendation -n tutorial oc delete destinationrule recommendation -n tutorial