Creating an alert of an elevated error rate in our application

As a reminder, our helloworld application has a very simple design:

All user traffic is going through an ELB instance. Since we use the HTTP to communicate, we can easily identify when something unexpected is happening.

If you take a closer look at the monitoring tab of one of your load balancer instances (http://amzn.to/2rsEaLY), you will see some of the top-level metrics you should care about:

While each application has its own behavior, an increase in latency or HTTP 5XXs is usually a good signal that someone needs to take a closer look at the service.

We will incorporate the monitoring of those two metrics into our template.

We will reopen our troposphere script: helloworld-ecs-alb-cf-template.py.

We will first add some new import as follows:

from troposphere.cloudwatch import (
    Alarm,
    MetricDimension,
)

Then, we will go to the bottom of the file, where we already created the alarms CPUTooLow and CPUTooHigh.

Just before the last print statement, we will add a new Alarm resource as follows:

t.add_resource(Alarm( 
    "ELBHTTP5xxs", 
    AlarmDescription="Alarm if HTTP 5xxs too high",

We are giving it a reference and a description. To target the proper metric, we need to specify the namespace of the ELB service and the name of the metric, as shown here:

    Namespace="AWS/ELB", 
    MetricName="HTTPCode_Backend_5XX",

We want our alert to target the load balancer instance created with that template. For that, we'll reference the load balancer resource in the metric dimension as follows:

    Dimensions=[ 
        MetricDimension( 
            Name="LoadBalancerName", 
            Value=Ref("LoadBalancer") 
        ), 
    ],

We want the alert to trigger if the number of HTTP 5xx is, on average, greater than 30 for three consecutive periods of one minute. This is done using the following properties:

    Statistic="Average", 
    Period="60", 
    EvaluationPeriods="3", 
    Threshold="30", 
    ComparisonOperator="GreaterThanOrEqualToThreshold",

The last part of the alert consists of selecting the action to perform when the alert triggers. In our case, we will want to send a message with the alert-sms SNS topic. To do that, we need to get this topic's ARN. We can do it using the following command:

$ aws sns list-topics

Once you have the information, you can specify the information in AlarmActions and OKActions. Additionally, we will leave the InsufficientDataActions action empty, as this metric is what we call a sparse metric, meaning that if no 5xxes are emitted, the service will not produce any data points, as opposed to creating an entry with a value of 0. The OKActions is also somewhat optional and is more a question of taste. Configured as such, CloudWatch will emit another SMS when the alert resolves:

    AlarmActions=["arn:aws:sns:us-east-1:511912822958:alert-sms"], 
    OKActions=["arn:aws:sns:us-east-1:511912822958:alert-sms"], 
    InsufficientDataActions=[],

This concludes the creation of that alarm. We can close our open parenthesis:

))

After that new alarm, we will create the alarm to target the latency. Almost everything is identical. We are going to create a new resource and give its identifier and description as follows:

t.add_resource(Alarm( 
    "ELBHLatency", 
    AlarmDescription="Alarm if Latency too high",

We will use the same namespace, but with a different metric name:

    Namespace="AWS/ELB", 
    MetricName="Latency",

The dimensions are the same as before:

    Dimensions=[ 
        MetricDimension( 
            Name="LoadBalancerName", 
            Value=Ref("LoadBalancer") 
        ), 
    ],

For the case of latency, we are looking at five evaluations of one minute and a threshold of 500 ms to trigger the alarm:

    Statistic="Average", 
    Period="60", 
    EvaluationPeriods="5", 
    Threshold="0.5", 
    ComparisonOperator="GreaterThanOrEqualToThreshold", 
    AlarmActions=["arn:aws:sns:us-east-1:511912822958:alert-sms"], 
    OKActions=["arn:aws:sns:us-east-1:511912822958:alert-sms"], 
    InsufficientDataActions=[], 
))

This concludes the creation of our alarms. Your new template should look as follows: http://bit.ly/2v3s0dQ.

You can commit the changes, generate the new CloudFormation template, and deploy it using the usual steps:

$ git add helloworld-ecs-alb-cf-template.py 
$ git commit -m "Creating SNS alarms"
$ git push
$ python helloworld-ecs-alb-cf-template.py > helloworld-ecs-alb-cf.template
$ aws cloudformation update-stack 
      --stack-name staging-alb 
      --template-body file://helloworld-ecs-alb-cf.template
$ aws cloudformation update-stack 
      --stack-name production-alb 
      --template-body file://helloworld-ecs-alb-cf.template

Blameless post-mortems
To close our feedback loop, we need to talk about learning. When failures happen, one of the best approaches to building that learning component is to create post-mortem documents that describe the incident, the timeline, the root cause, and how it was resolved. John Allspaw, one of the "founding fathers" of the DevOps movement, did some extensive thinking in that area and created the concept of blameless post-mortems, which describe in more detail this approach that emphasizes learning over finger-pointing.

One of the restrictions that CloudWatch has is the notion of alarm dimensions. In our last example, the ELB represents only one resource, which made it easy to create our alert, as we could reference the resource name. For more dynamic resources, such as our EC2 instances, we might want to monitor a resource that's not exposed at the load-balancer level.

To accomplish such things, we need to look at CloudWatch events.

Table of Contents for Creating an alert of an elevated error rate in our application

Create new playlist

Sign In

Sign Up

Table of Contents for
Creating an alert of an elevated error rate in our application