Chapter 5. Chaos Testing

The unveiling of a relatively famous OSS project by the team at Netflix called “Chaos Monkey” had a disruptive effect on the IT world. The concept that Netflix had built code that randomly kills various services in their production environment blew people’s minds. When many teams struggle just maintaining their uptime requirements, promoting self-sabotage and attacking oneself seemed absolutely crazy. Yet from the moment Chaos Monkey was born, a new movement arose: chaos engineering.

According to the Principles of Chaos Engineering website, “Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.”

In complex systems (software systems or ecological systems), things can and will fail, but the ultimate goal is to stop catastrophic failure of the overall system. So how do you verify that your overall system—your network of microservices—is in fact resilient? You inject a little chaos. With Istio, this is a relatively simple matter because the istio-proxy is intercepting all network traffic; therefore, it can alter the responses including the time it takes to respond. Two interesting faults that Istio makes easy to inject are HTTP error codes and network delays.

HTTP Errors

Based on exercises earlier in this book, make sure that recommendation v1 and v2 are both deployed with no code-driven misbehavior or long waits/latency. Now, you will be injecting errors via Istio instead of using Java code:

oc get pods -l app=recommendation -n tutorial
NAME                           READY   STATUS   RESTARTS  AGE
recommendation-v1-3719512284   2/2     Running  6         18m
recommendation-v2-2815683430   2/2     Running  0         13m

Also, double-check that you do not have any DestinationRules or VirtualServices:

oc delete virtualservices --all -n tutorial
oc delete destinationrules --all -n tutorial

We use the combination of Istio’s DestinationRule and VirtualService to inject a percentage of faults—in this case, returning the HTTP 503 error code 50% of the time.

The DestinationRule:

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: recommendation
  namespace: tutorial
spec:
  host: recommendation
  subsets:
  - labels:
      app: recommendation
    name: app-recommendation

The VirtualService:

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: recommendation
  namespace: tutorial
spec:
  hosts:
  - recommendation
  http:
  - fault:
      abort:
        httpStatus: 503
        percent: 50
    route:
    - destination:
        host: recommendation
        subset: app-recommendation

And you apply the DestinationRule and VirtualService:

oc -n tutorial create -f 
istiofiles/destination-rule-recommendation.yml

oc -n tutorial create -f 
istiofiles/virtual-service-recommendation-503.yml

Testing the change is as simple as issuing a few curl commands at the customer endpoint. Make sure to test it a few times, looking for the resulting 503 approximately 50% of the time:

curl customer-tutorial.$(minishift ip).nip.io
customer => preference => recommendation v1 from
                                         '3719512284': 88

curl customer-tutorial.$(minishift ip).nip.io
customer => 503 preference => 503 fault filter abort

You can now see if the preference service is properly handling the exceptions being returned by the recommendation service.

Clean up by removing the VirtualService but leave the DestinationRule in place:

oc delete virtualservices --all -n tutorial

Delays

The most insidious of possible distributed computing faults is not a “dead” service but a service that is responding slowly, potentially causing a cascading failure in your network of services. More importantly, if your service has a specific service-level agreement (SLA) it must meet, how do you verify that slowness in your dependencies doesn’t cause you to fail in delivering to your awaiting customer?

Much like the HTTP error injection, network delays use the VirtualService kind, as well. The following manifest injects 7 seconds of delay into 50% of the responses from the recommendation service:

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  creationTimestamp: null
  name: recommendation
  namespace: tutorial
spec:
  hosts:
  - recommendation
  http:
  - fault:
      delay:
        fixedDelay: 7.000s
        percent: 50
    route:
    - destination:
        host: recommendation
        subset: app-recommendation

Apply the new VirtualService:

oc -n tutorial create -f 
istiofiles/virtual-service-recommendation-delay.yml

Then, send a few requests at the customer endpoint and notice the “time” command at the front. This command will output the elapsed time for each response to the curl command, allowing you to see that 7-second delay:

#!/bin/bash
while true
do
time curl customer-tutorial.$(minishift ip).nip.io
sleep .1
done

Many requests to the customer end point now have a delay. If you are monitoring the logs for recommendation v1 and v2, you will also see the delay happens before the recommendation service is actually called. The delay is in the Istio proxy (Envoy), not in the actual endpoint:

oc logs recommendation-v2-2815683430 -f -c recommendation

In Chapter 4 you saw how to deal with errors bubbling up from your code and now in this chapter, you played the role of self-saboteur by injecting errors/delays via Istio’s VirtualService. At this time there should be a key question in your mind: “How do I know that these errors are happening within my application?” The answer is in Chapter 6.

Clean up:

oc delete virtualservice recommendation -n tutorial
oc delete destinationrule recommendation -n tutorial
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset