Chapter 15
Monitoring and Troubleshooting

THE AWS CERTIFIED DEVELOPER – ASSOCIATE EXAM TOPICS COVERED IN THIS CHAPTER MAY INCLUDE, BUT ARE NOT LIMITED TO, THE FOLLOWING:

  • Domain 5: Monitoring and Troubleshooting
  • check mark 5.1 Write code that can be monitored.

    Content may include the following:

    • Monitoring basics
    • Using Amazon CloudWatch
    • Using AWS CloudTrail
  • check mark 5.2 Perform root cause analysis on faults found in testing or production.

    Content may include the following:

    • Using AWS X-Ray to troubleshoot application issues

Introduction to Monitoring and Troubleshooting

Monitoring the applications and services you build is vital to the success of any information technology (IT) organization. With the AWS Cloud, you can leverage monitoring resources to drive business decisions such as what resources to create, improve, optimize, and secure.

Traditional approaches to monitoring do not scale for cloud architectures. Large systems can be difficult to set up, configure, and scale. These efforts are compounded by the trend away from monolithic installations toward service-oriented architecture (SOA), microservices, and serverless architectures. Monitoring modern IT systems is proportionally difficult. When working on a monolithic application, you can add logging statements and troubleshoot with breakpoints. However, applications today are spread across multiple systems over large networks that make it difficult to track the health of systems and react to issues. For example, using logging statements to monitor execution time and error rates of AWS Lambda functions can become difficult as your infrastructure grows and spreads across multiple AWS Regions.

The AWS Cloud provides fully managed services to help you implement monitoring solutions that are reliable, scalable, and secure. AWS offers services to help you monitor, log, and analyze your applications and infrastructure. In this section, you explore Amazon CloudWatch, AWS CloudTrail, and AWS X-Ray. Figure 15.1 shows the AWS monitoring services available.

The figure shows various monitoring services available on AWS.

Figure 15.1 Various monitoring services on AWS

Monitoring Basics

Before you explore these services, consider why they are essential. As a developer, you are designing systems to provide IT or business solutions to a customer. Success is measured by the effective application of software to business objectives. What are some of the metrics that you must track over time to ensure that these objectives are being met?

Choosing Metrics

AWS takes the approach of “working backward” from the customer. You can accomplish this by starting with the customer and tracing the underlying components that affect the customer’s experience. This provides a foundation for identifying which metrics to monitor, as they correlate directly to the customer experience. Frequently, the top characteristics that directly affect the customer experience are performance and cost. Changes to either have a direct impact on how customers perceive the software they use.

Deciding which metrics to monitor requires that you answer several crucial questions.

Performance and Cost

Question: Are my customers having a good experience with the services or systems that that I provide?

The phrase good experience can be broken down into measurable metrics, such as request latency, time to first byte, error rates, and more. Metrics, such as instance CPU utilization or network bytes in/out, however, may not be representative of the customer experience.

It is good practice to measure any metric that directly affects customers using your software or system. The second question to ask is: “What is the overall cost of my system?” Increases in performance often correlate directly to increases in cost. With unlimited money, it would be easy to design a system that scales infinitely in response to customer usage. However, this is never a reality. Instead, you need to measure the performance of your system to determine what is acceptable performance based on the usage at any point in time. This is the case when metrics that are not customer-facing often take precedence.

Trends

Question: How can I use monitoring to predict changes in customer demand?

With the agility and elasticity of the AWS Cloud, this can be especially useful. Monitoring and measuring customer demand over time allows you to scale your infrastructure predictively to meet changes in customer demand without having to purchase more resources than are necessary. For example, suppose that you have a web application that runs on three Amazon Elastic Compute Cloud (Amazon EC2) instances during the day. In the evenings, demand increases significantly for several hours before decreasing again late at night. On weekends, your application sees almost no traffic. With historical information obtained through monitoring, you can design your application to scale out across more Amazon EC2 instances during the evenings and scale in on the weekends when there is little demand. Predictive scaling occurs before customer demand changes, ensuring a smooth experience while new resources are created and brought online.

Troubleshooting and Remediation

Question: Where do problems occur?

As Werner Vogels, VP and CTO of AWS, once said, “Everything fails all the time.” No system is impervious to failure. By gathering potentially relevant information ahead of time, it becomes easier to determine causes for failure. By collecting this information, you can reduce mean time between failure (MTBF), mean time to resolution (MTTR), and other key operational performance metrics.

Learning and Improvement

Question: “Can you detect or prevent problems in the future?”

By evaluating operational metrics over time, you can reveal patterns and common issues in your systems.

Symbol of Note When choosing metrics, align them closely to your business processes to provide a better customer experience. For example, suppose that you have an application running in AWS Elastic Beanstalk. Unknown to you, the application has a memory leak. Without tracking memory utilization over time, you will not have insight into why customers are experiencing degraded performance. If your Elastic Beanstalk environment is configured to scale out based on CPU utilization, it is possible that no new instances are launched to serve customer requests. In this case, the memory leak prevents new requests from being processed, causing a drop in CPU utilization. Without comprehensive tracking of system performance, issues such as this can go unnoticed until system-wide outages occur.

These factors impact what is referred to as the health of your systems. As a developer and contributor, you are not only responsible for the code that you develop but also for the operational health of these services. It is vital to align operational and health metrics properly with customer expectations and experiences.

Amazon CloudWatch

Amazon CloudWatch is a monitoring and metrics service that provides you with a fully managed system to collect, store, and analyze your metrics and logs. By using CloudWatch, you can create notifications on changes in your environment.

Typical use cases include the following:

  • Infrastructure monitoring and troubleshooting
  • Resource optimization
  • Application monitoring
  • Logging analytics
  • Error reporting and notification

CloudWatch enables you to collect and store monitoring and operations data from logs, metrics, and events that run on AWS and on-premises resources. To ensure that your applications run smoothly, you can use CloudWatch to perform the following tasks:

  • Set alarms
  • Visualize logs and metrics
  • Automate recovery from errors
  • Troubleshoot issues
  • Discover insights that enable you to optimize your resources

How Amazon CloudWatch Works

CloudWatch acts as a metrics repository, storing metrics and logs from various sources. These metrics can come from AWS resources using built-in or custom metrics. Figure 15.2 illustrates the role of CloudWatch in operational health.

The figure shows the role of CloudWatch in operational health.

Figure 15.2 Diagram of Amazon CloudWatch

CloudWatch can process these metrics into statistics that are made available through the CloudWatch console, AWS APIs, the AWS Command Line Interface (AWS CLI), and AWS software development kits (AWS SDKs). Using CloudWatch, you can display graphs, create alarms, or integrate with third-party solutions.

Amazon CloudWatch Metrics

To understand CloudWatch better, especially how data is collected and organized, review the following terms.

Built-In Metrics

A metric is a set of time-series data points that you publish to CloudWatch. For example, a commonly monitored metric for Amazon EC2 instances is CPU utilization. Data points can come from multiple systems, both AWS and on-premises. You can also define custom metrics based on data specific to your system. A metric is identified uniquely by a namespace, a name, and zero or more dimensions.

Namespace

A namespace is a collection of metrics or a container of related metrics; for example, namespaces used by AWS offerings or services that all start with AWS. Amazon EC2 uses the AWS/EC2 namespace. As a developer, you can create namespaces for different components of your applications, such as front-end, backend, and database components.

Name

A name for a given metric defines the attribute or property that you are monitoring; for example, CPU Utilization in the AWS/EC2 namespace. The AWS/EC2 namespace contains various metrics that are important to monitoring the health of Amazon EC2 resources, such as CPU Utilization, Disk I/O, Network I/O, or Status Check. You can also create custom metrics for attributes, such as request latency, HTTP 400/500 response codes, and throttling.

Dimension

A dimension is a name/value pair used to define a metric uniquely. For example, for the namespace AWS/EC2 and name/metric CPUUtilization, the dimension might be InstanceId. For a fleet of Amazon EC2 instances, you can measure CPUUtilization as one metric for multiple dimensions (one for each instance). You can use the dimensions to structure and organize the data points you gather.

Symbol of Tip When you’re creating metrics, consider defining namespaces that align with your different services and assigning dimensions as important metrics that describe the health of that service. For example, if you have a front-end web fleet running NGINX servers, then dimensions, such as requests-per-second, response time, active connections, and response codes, could help you determine what configuration changes would optimize system performance.

Data Points

When data is published to CloudWatch, it is pushed in sets of data points. Each data point contains information such as the timestamp, value, and unit of measurement.

Timestamp

Timestamps are dateTime objects with the complete date and time; for example, 2016-10-31T23:59:59Z. Although not required, AWS recommends formatting times as Coordinated Universal Time (UTC).

Value

The value is the measurement for the data point.

Unit

A unit of measurement is used to label your data. This offers a better understanding of what the value represents. Example units include Bytes, Seconds, Count, and Percent. If you do not specify a unit in CloudWatch, your data point units are designated as None.

CloudWatch stores this data based on the retention period, which is the length of time to keep data points available. Data points are stored in CloudWatch based on how often the data points are published.

  • Data points with a published frequency less than 60 seconds are available for 3 hours. These data points are high-resolution custom metrics.
  • Data points with a published frequency of 60 seconds (1 minute) are available for 15 days.
  • Data points with a published frequency of 300 seconds (5 minutes) are available for 63 days.
  • Data points with a published frequency of 3,600 seconds (1 hour) are available for 455 days (15 months).

From these data points, CloudWatch can calculate statistics to provide you with insight into your application, service, or environment. In the next section, you will discover how CloudWatch calculates and organizes these statistics.

Statistics

CloudWatch provides statistics based on metric data provided to the service. Statistics are aggregations of data points over specified periods of time for specified metrics. A period is the length of time, in seconds. Periods can be defined in values of 1, 5, 10, 30, or any multiple of 60 seconds (up to 86,400 seconds, or 1 day). The available statistics in CloudWatch include the following:

  • Minimum (Min), the lowest value recorded over the specified period
  • Maximum (Max), the highest value recorded over the specified period
  • Sum, the total value of the samples added together over the specified period
  • Average (Avg), the Sum divided by the SampleCount over the specified period
  • SampleCount, the number of data points used in the calculation over the specified period
  • pNN, percentile statistics for tracking metric outliers

Statistics can be used to gain insight into the health of your application and to help you determine the correct settings for various configurations. For example, you may want to implement automatic scaling on your fleet of Amazon EC2 instances in order to avoid having to launch and terminate instances manually. To do so, you must configure an Auto Scaling group. Configuration settings for an Auto Scaling group include the minimum, desired, and maximum number of instances to run in your account. By monitoring statistics over time, you can determine the minimum and maximum number of instances needed to support the average, minimum, and maximum workload.

CloudWatch statistics provide a powerful way to process large amounts of metrics at scale and present insightful data that is easy to consume. Now that you understand how CloudWatch metrics work and are organized, explore the metrics available.

Aggregations

CloudWatch aggregates metrics according to the period of time you specify when retrieving statistics. When you request this statistic, you also can have CloudWatch filter the data points based on the dimensions of the metrics. For example, in Amazon DynamoDB, metrics are fetched across all DynamoDB operations. You can specify a filter on the dimension operations to exclude specific operations, such as GetItem requests. CloudWatch does not aggregate data across regions.

Available Metrics

Table 15.1 describes the available metrics for Elastic Load Balancing resources. To discover all of the available metrics, refer to the AWS documentation.

Table 15.1 Elastic Load Balancing Metrics

Namespace

AWS/ELB

AWS/ApplicationELB

AWS/NetworkELB

Dimensions LoadBalancerName: name of the load balancer
Key metrics
  • HealthyHostCount: number of responding backend servers
  • RequestCount: number of IPv4 and IPv6 requests
  • ActiveConnectionCount: total number of concurrent active connections from clients

Table 15.2 describes the available Amazon EC2 metrics.

Table 15.2 Amazon EC2 Metrics

Namespace AWS/EC2
Dimensions

InstanceId: identifier of a particular Amazon EC2 instance

InstanceType: type of Amazon EC2 instance, such as t2.micro, m4.large

Key metrics
  • CPUUtilization: percentage of vCPU utilization on the instance
  • DiskReadOps, DiskWriteOps: number of operations per second on attached disk
  • DiskReadBytes, DiskWriteBytes: volume of bytes to transfer on attached disk
  • NetworkIn, NetworkOut: number of bytes sent or received by network interfaces
  • NetworkPacketsIn, NetworkPacketsOut: number of packets sent or received by network interfaces

Symbol of Tip Amazon EC2 does not report memory utilization to CloudWatch. This is because memory is allocated in full to an instance by the underlying host. Memory consumption is visible only to the guest operating system (OS) of the instance. However, you can report memory utilization to CloudWatch using the CloudWatch agent.

Table 15.3 describes the AWS Auto Scaling group metrics.

Table 15.3 AWS Auto Scaling Groups

Namespace AWS/AutoScaling
Dimensions AutoScalingGroupName: name of the Auto Scaling group
Key metrics
  • GroupMinSize, GroupMaxSize, GroupDesiredCapacity: minimum, maximum, and desired size of the Auto Scaling group
  • GroupInServiceInstances: number of instances up and running in the Auto Scaling group
  • GroupTotalInstances: total number of instances in the Auto Scaling group, regardless of state

Table 15.4 describes the Amazon Simple Storage Service (Amazon S3) metrics.

Table 15.4 Amazon S3 Metrics

Namespace AWS/S3
Dimensions
  • BucketName: name of a specific Amazon S3 bucket
  • StorageType: the Amazon S3 storage class (STANDARD, STANDARD_IA, and GLACIER storage classes) of the bucket
Key metrics
  • BucketSizeBytes: total size, in bytes, of data stored in an Amazon S3 bucket
  • NumberOfObjects: total number of objects stored in an Amazon S3 bucket
  • AllRequests: total number of requests made to an Amazon S3 bucket

Table 15.5 describes the Amazon DynamoDB metrics.

Table 15.5 Amazon DynamoDB Metrics

Namespace AWS/DynamoDB
Dimensions
  • TableName: name of Amazon DynamoDB table
  • Operation: limits metrics to either a particular operation (PutItem, GetItem, UpdateItem, DeleteItem, Query, Scan, BatchGetItem) or BatchWriteITem
Key metrics
  • ConsumedReadCapacityUnits, ConsumedWriteCapacityUnits: total number of read and write capacity units consumed
  • ThrottledRequests: requests to DynamoDB that exceed the provisioned throughput limits on a resource (such as a table or an index)
  • ReadThrottleEvents: requests to DynamoDB that exceed the provisioned read capacity units for a table or a global secondary index
  • WriteThrottleEvents: requests to DynamoDB that exceed the provisioned write capacity units for a table or a global secondary index
  • ReturnedBytes: size of response returned in request
  • ReturnedItemCount: number of items returned in request

Table 15.6 describes the Amazon API Gateway metrics.

Table 15.6 Amazon API Gateway Metrics

Namespace AWS/ApiGateway
Dimensions
  • ApiName: filters out metrics for a particular API
  • ApiName, Method, Resource, Stage: filters out metrics for a particular API, method, resource, and stage
  • ApiName, Stage: filters out metrics for a particular deployed stage of an API
Key metrics
  • 4XXError: number of HTTP 400 errors
  • 5XXError: number of HTTP 500 errors
  • Latency: time between when Amazon API Gateway receives a request and when it responds to the client

Table 15.7 describes the AWS Lambda metrics.

Table 15.7 AWS Lambda Metrics

Namespace AWS/Lambda
Dimensions FunctionName: name of your AWS Lambda function
Key metrics
  • Invocations: number of executions of your AWS Lambda function
  • Errors: number of executions in which your AWS Lambda function failed
  • Duration: total time for each execution of your AWS Lambda function

Table 15.8 describes the Amazon Simple Queue Service (Amazon SQS) metrics.

Table 15.8 Amazon SQS Metrics

Namespace AWS/SQS
Dimensions QueueName: name of the Amazon SQS queue
Key metrics
  • ApproximateNumberOfMessagesVisible: number of messages currently available for retrieval
  • ApproximateNumberOfMessagesNotVisible: number of messages currently being processed, or messages that are inflight (Visibility Timeout is still active)
  • NumberOfMessagesDeleted: number of messages that have been deleted

Symbol of Tip Amazon SQS does not report the total number of messages in the queue. You can find this value by adding ApproximateNumberOfMessagesVisible and ApproximateNumberOfMessagesNotVisible.

Table 15.9 describes the Amazon Simple Notification Service (Amazon SNS) metrics.

Table 15.9 Amazon SNS Metrics

Namespace AWS/SNS
Dimensions TopicName: name of the Amazon SNS topic
Key metrics
  • NumberOfMessagesPublished: number of messages sent to an SNS topic
  • NumberOfNotificationsDelivered: number of messages that were successfully delivered to subscribers
  • NumberOfNotificationsFailed: number of messages that were unsuccessfully delivered to subscribers

Custom Metrics

In addition to the built-in metrics that AWS provides, CloudWatch also supports custom metrics that you can publish from your systems. This section includes some commands that you can use to publish metrics to CloudWatch.

High-Resolution Metrics

With custom metrics, you have two options for resolution (the time interval between data points) for your metrics. You can use standard resolution for data points that have a granularity of one minute or high resolution for data points that have a granularity of less than one second. By default, most metrics delivered by AWS services have standard resolution.

Publishing Metrics

CloudWatch supports multiple options when you publish metrics. You can publish them as single data points, statistics sets, or zero values. Single data points are optimal for most telemetry. However, statistics sets are recommended for values with high-resolution data points in which you are sampling multiple times per minute. Statistics sets are sets of calculated values, such as minimum, maximum, average, sum, and sample count, as opposed to individual data points. The value 0 is for applications that have periods of inactivity, where no data is sent. The following are some sample scripts using the AWS CLI to publish data points.

USING THE AWS CLI TO PUBLISH SINGLE DATA POINTS

The following commands each publish a single data point under the Metric Name PageViewCount to the Namespace MyService with respective values and timestamps. You are not required to create a metric name or namespace. CloudWatch is aware of the data points to a metric or creates a new metric if it does not exist.

aws cloudwatch put-metric-data 
      --metric-name PageViewCount 
      --namespace MyService 
      --value 2 
      --timestamp 2018-10-20T12:00:00.000Z
 
aws cloudwatch put-metric-data 
      --metric-name PageViewCount 
      --namespace MyService 
      --value 4 
      --timestamp 2018-10-20T12:00:01.000Z
 
aws cloudwatch put-metric-data 
      --metric-name PageViewCount 
      --namespace MyService 
      --value 5 
      --timestamp 2018-10-20T12:00:02.000Z
USING THE AWS CLI TO PUBLISH STATISTICS SETS

The following command publishes a statistic set to the metric-name PageViewCount to the namespace MyService, with values for various statistics (Sum 11, Minimum 2, Maximum 5), and SampleCount 3 with the corresponding timestamp:

aws cloudwatch put-metric-data 
      --metric-name PageViewCount 
      --namespace MyService 
      --statistic-values Sum=11,Minimum=2,Maximum=5,SampleCount=3 
      --timestamp 2018-10-14T12:00:00.000Z
USING THE AWS CLI TO PUBLISH THE VALUE ZERO

The following command publishes a single data point with the value 0 to the metric-name PageViewCount to the namespace MyService with the corresponding timestamp:

aws cloudwatch put-metric-data 
      --metric-name PageViewCount 
      --namespace MyService 
      --value 0 
      --timestamp 2018-10-14T12:00:00.000Z
Retrieving Statistics for a Metric

After you publish data to CloudWatch, you may want to retrieve statistics for a specified metric of a given resource.

USING THE AWS CLI TO RETRIEVE STATISTICS FOR A METRIC

This command retrieves the Sum, Max, Min, Average, and SampleCount statistics for the metric-name PageViewCount to the namespace MyService with a period interval of 60 seconds between the start-time and end-time. This means that CloudWatch will aggregate data points in one-minute intervals to calculate statistics.

aws cloudwatch get-metric-statistics 
      --namespace MyService 
      --metric-name PageViewCount 
      --statistics "Sum" "Maximum" "Minimum" "Average" "SampleCount" 
      --start-time 2018-10-20T12:00:00.000Z 
      --end-time 2018-10-20T12:05:00.000Z 
      --period 60

Example output from this command displays a single data point for the Metric PageViewCount.

{
    "Datapoints": [
        {
            "SampleCount": 3.0,
            "Timestamp": "2016-10-20T12:00:00Z",
            "Average": 3.6666666666666665,
            "Maximum": 5.0,
            "Minimum": 2.0,
            "Sum": 11.0,
            "Unit": "None"
        }
    ],
    "Label": "PageViewCount"
}

Amazon CloudWatch Logs

Though most commercial standard applications already produce some form of logging, most modern applications are deployed in distributed or service-oriented architectures. Collecting and processing these logs can be a challenge as a system grows and expands across multiple regions. Centralized logging using CloudWatch Logs can overcome this challenge. With CloudWatch Logs, you can set up a central log storage location to ingest and process logs at scale.

Log Aggregation

Setting up centralized logging with CloudWatch Logs is a straightforward process. The first step is to install and configure the CloudWatch agent, which is used to collect custom logs and metrics from Amazon EC2 instances or on-premises servers. You can choose which log files you want to ingest by pointing to the locations using a JavaScript Object Notation (JSON) configuration file. The second step is to configure AWS Identity and Access Management (IAM) roles or users to grant permission for the agent to publish logs into CloudWatch. In addition to the CloudWatch agent, you can also send metrics to CloudWatch using the AWS CLI, AWS SDK, or AWS API.

Because you are collecting logs from multiple sources, CloudWatch organizes your logs into three conceptual levels: groups, streams, and events.

Log Groups

A log group is collection of log streams. For example, if you have a service that consists of a cluster of multiple machines, a log group would be a container for the logs from each of the individual instances.

Log Streams

A log stream is a sequence of log events such as a single log file from one of your instances.

Log Events

A log event is a record of some activity from an application, process, or service. This is analogous to a single line in a log file.

CloudWatch stores log events based on your retention settings, which are assigned at the log group. The default configuration is to store log data in Amazon CloudWatch Logs indefinitely. You are charged for any data stored in CloudWatch Logs in addition to data transferred out of the service. You can export CloudWatch Logs to Amazon S3 for long-term storage, which is valuable when regulations require long-term log retention. Long-term retention can be combined with Amazon S3 lifecycle policies to archive data to Amazon S3 Glacier for additional cost savings.

Log Searches

With centralized logging on CloudWatch Logs, you do not need to search through hundreds of individual servers to find a problem. After logs are ingested into CloudWatch Logs, you can search for logs through a central location using metric filters.

Metric Filters

A metric filter is a text pattern used to parse log data for specific events. As an example, consider the log in Table 15.10.

Table 15.10 Example Log

Line Log Event
1 [ERROR] Caught IllegalArgumentException
2 [ERROR] Unhandled Exception
3 Another message
4 Exiting with ERRORCODE: -1
5 [WARN] Some message
6 [ERROR][WARN] Some other message

To look for occurrences of the ERROR event, you use ERROR as your metric filter, as illustrated in Table 15.11. CloudWatch will search for that term across the logs.

Table 15.11 Example Metric Filters

Metric Filter Description
"" Matches all log events.
"ERROR"

Matches log events containing the term “ERROR.”

Based on the events in the example log in Table 15.10, this metric filter would find lines 1–3 and 6.

"ERROR" – "EXITING"

Matches log events containing the term “ERROR” except “EXITING.”

Based on the events in the example log in Table 15.10, this metric filter would find lines 1, 2, and 6.

"ERROR Exceptions"

Matches log events containing both terms “ERROR” and “Exceptions.” This filter is an AND function.

Based on the events in the example log in Table 15.10, this metric filter would find lines 1 and 2.

"?ERROR ?WARN"

Matches log events containing either the term “ERROR” or “WARN.” This filter is an OR function.

Based on the event in the example log in Table 15.10, this metric filter would find lines 1, 2, 4, and 6.

If your logs are structured in JSON format, CloudWatch can also filter object properties. Consider the following example JSON log.

Example AWS CloudTrail JSON Log Event

{
    "user": {
        "id": 1,
        "email": "[email protected]"
              
    },
    "users": [
        {
         "id": 2,
         "email": "[email protected]"
              
        },
        {
         "id": 3,
         "email": "[email protected]"
              
        }
    ],
    "actions": [
        "GET",
        "PUT",
        "DELETE"
    ],
    "coordinates": [
        [0, 1, 2],
        [4, 5, 6],
        [7, 8, 9]
    ]
}
 

You can create a metric filter that selects and compares certain properties of this event, as shown in Table 15.12.

Table 15.12 Example JSON Metric Filters

JSON Metric Filter Description
{ ($.user.id = 1) && ($.users[0] .email = “John.D) }

Check that the property user.id equals 1 and the first user’s email is John.D.

The preceding log event would be returned.

{ ($.user.id = 2 && $.users[0] .email = "John.D) || $.actions[2] = "GET" }

Check that the property user.id equals 2 and the first user’s email is John.D or the second action is GET.

The preceding example would not be returned, because the second action is PUT, not GET.

Log Processing

Instead of having to write additional code to add monitoring to your application, CloudWatch can process logs that you already generate and provide valuable metrics. Using the example from the previous section, the same metric filter can be used to generate metrics corresponding to the number of occurrences of the term ERROR in your logs.

Amazon CloudWatch Alarms

After data points are established in CloudWatch, either as metrics or as logs (from which you generate metrics), you can set alarms to monitor your metrics and trigger actions in response to changes in state. CloudWatch alarms have three possible states: OK, ALARM, and INSUFFICIENT_DATA. Table 15.13 defines each alarm state.

Table 15.13 Alarm States

State Description
OK The metric or expression is within the defined threshold.
ALARM The metric or expression is outside of the defined threshold.
INSUFFICIENT_DATA The alarm has just started, the metric is not available, or not enough data is available for the metric to determine the alarm state.

An ALARM state may not indicate a problem. It means that the given metric is outside the defined threshold. For example, you have two alarms for Auto Scaling groups: one for high CPU utilization and one for low CPU utilization. During normal use, both alarms should be OK, indicating that you have adequate capacity to handle the current workload. If your workload changes, the high CPU utilization metric threshold may be breached, sending the corresponding alarm into ALARM state. With an Auto Scaling group, the alarm’s state change triggers a scale-out event, adding capacity to your infrastructure.

Using Amazon CloudWatch Alarms

When you create an alarm, specify three settings that determine when the alarm should change states: the threshold, period, and data points on which you want to notify, as described in Table 15.14.

Table 15.14 Alarm Settings

Setting Description
Period The length of time (in seconds) to evaluate the metric or expression to create each individual data point for an alarm. If you choose one minute as the period, there is one data point every minute.
Evaluation Period The number of the most recent periods, or data points, to evaluate when determining alarm state.
Data Points to Alarm The number of data points within the evaluation period that must breach the specified threshold to cause the alarm to go to the ALARM state. These data points do not have to be consecutive.

Figure 15.3 illustrates how an alarm works based on configuration settings.

The figure shows a screenshot illustrating how an alarm works based on configuration settings.

Figure 15.3 Alarm evaluation

The figure illustrates a threshold configured to the value 3 (in blue), a period set to 3, and data points in red. Notice how the settings drive the alarm occurrence. Even though the data points breach the threshold after the third period, it is not sustained for the required three periods to be in an ALARM state. Only after the fifth period would the alarm change to an ALARM state (the upper threshold is breached for three periods). Between the fifth and sixth period, the data points drop below the threshold. However, because the state has not dropped below the threshold for three periods, it does not change to an OK state until the eighth period. It remains in the OK state past the ninth period because three consecutive periods exceeding the threshold are necessary for the alarm state to change.

Alarms can trigger Amazon EC2 actions and EC2 Auto Scaling actions. CloudWatch can leverage Amazon SNS or Amazon SQS for alarm state notifications, both of which provide numerous integrations with other AWS services.

Symbol of Tip Exercise caution when creating email notifications for alarms in your environment. This can lead to many unnecessary emails to you or your team. Ultimately, these notifications get filtered as spam or result in “notification fatigue.” Evaluate your alarms and the metrics you are monitoring to determine whether notifications are necessary. If they are only status updates, set notifications sparingly.

Amazon CloudWatch Dashboards

CloudWatch offers a convenient way to observe operational metrics for all of your applications. CloudWatch dashboards are customizable pages in the CloudWatch console that you can use to monitor resources in a single view (see Figure 15.4).

The figure shows a screenshot illustrating the Amazon CloudWatch dashboard.

Figure 15.4 Amazon CloudWatch dashboard

CloudWatch dashboards provide customizable status pages in the CloudWatch console. These status pages can be used to monitor resources across multiple regions and on-premises in a consolidated view using widgets. Each widget can be customized to present information in CloudWatch in a user-friendly way so that educated decisions can be made based on the current status of your system.

AWS CloudTrail

All actions in your AWS account are composed of API calls, regardless of the origin (the AWS Management Console or programmatic/scripted actions). As you create resources in your account, API calls are being made to AWS services in different regions around the world. AWS CloudTrail is a fully managed service that continuously monitors and records API calls and stores them in Amazon S3. You can use these logs to troubleshoot and resolve operational issues, meet and verify regulatory compliance, and monitor or alarm on specific events in your account. CloudTrail supports most AWS services, making it easy for IT and security administrators to analyze activity in accounts. IT auditors can also use log files as compliance aids.

CloudTrail helps answer the following five key questions about monitoring access:

  • Who made the API call?
  • When was the API call made?
  • What was the API call?
  • Which resources were acted upon in the API call?
  • Where was the origin of the API call?

AWS CloudTrail Events

A CloudTrail event is any single API activity in an AWS account. This activity can be an action triggered by any of the following:

  • AWS IAM user
  • AWS IAM role
  • AWS service

CloudTrail tracks two types of events: management events and data events. Events are recorded in the region where the action occurred, except for global service events.

Management Events

Management events give insight into operations performed on AWS resources, such as the following examples:

  • Configuring security: An example is attaching a policy to an IAM role.
  • Configuring routing rules: An example is adding inbound security group rules.

Data Events

Data events give insight into operations that store data in (or extract data from) AWS resources, such as the following examples:

  • Amazon S3 object activity: Examples are GetObject and PutObject operations.
  • AWS Lambda function executions: These use the InvokeFunction operation.

By default, CloudTrail tracks the last 90 days of API history for management events. The following is example output for a CloudTrail event:

{
   "Records": [{
      "eventVersion": "1.01",
      "userIdentity": {
         "type": "IAMUser",
         "principalId": "AIDAJDPLRKLG7UEXAMPLE",
         "arn": "arn:aws:iam::123456789012:user/Alice",
         "accountId": "123456789012",
         "accessKeyId": "AKIAIOSFODNN7EXAMPLE",
         "userName": "Alice",
         "sessionContext": {
            "attributes": {
               "mfaAuthenticated": "false",
               "creationDate": "2014-03-18T14:29:23Z"
            }
         }
      },
      "eventTime": "2014-03-18T14:30:07Z",
      "eventSource": "cloudtrail.amazonaws.com",
      "eventName": "StartLogging",
      "awsRegion": "us-west-2",
      "sourceIPAddress": "198.162.198.64",
      "userAgent": "signin.amazonaws.com",
      "requestParameters": {
         "name": "Default"
      },
      "responseElements": null,
      "requestID": "cdc73f9d-aea9-11e3-9d5a-835b769c0d9c",
      "eventID": "3074414d-c626-42aa-984b-68ff152d6ab7"
   },
    ... additional entries ...
   ]
 

This event provides the following information:

  • The user who made the request from the userIdentityField. In this example, it is the IAM user Alice.
  • When the request was made (the eventTime). In this case, it is 2014-03-18T14:30:07Z.
  • Where the request was made (the sourceIPAddress). In this case, it is 198.162.198.64.
  • The action the request is trying to perform (the eventName). In this case, it is the StartLogging operation.

As a security precaution, you can use events such as this example to configure alerts when an IAM user attempts to sign in to the AWS Management Console too many times.

Global Service Events

Some AWS services allow you to create, modify, and delete resources from any region. These are referred to as global services. Examples of global services include the following:

  • IAM
  • AWS Security Token Service (AWS STS)
  • Amazon CloudFront
  • Amazon Route 53

Global services are logged as occurring in US East (N. Virginia) Region. Any trails created in the CloudTrail console log global services by default, which are delivered to the Amazon S3 bucket for the trail.

Trails

If you need long-term storage of events (for example, for compliance purposes), you can configure a trail of events as log files in CloudTrail. A trail is a configuration that enables delivery of CloudTrail events to an Amazon S3 bucket, Amazon CloudWatch Logs, and Amazon CloudWatch Events. When you configure a trail, you can filter the events that you want to be delivered.

AWS X-Ray

The services covered so far are centered on the concept of using logs as monitoring and troubleshooting tools. Developers often write code, test the code, and inspect the logs. If there are errors, they may add breakpoints, run the test again, and add log statements. This works well in small cases, but it becomes cumbersome as teams, software, and infrastructure grow. Traditional troubleshooting and debugging processes do not work well at scaling across multiple services. Troubleshooting cross-service and cross-region interactions can be especially difficult when different systems use varying log formats.

AWS X-Ray is a service that collects data about requests served by your application. It provides tools you can use to view, filter, and gain insights into that data to identify issues and opportunities for optimization.

AWS X-Ray Use Cases

X-Ray helps developers build, monitor, and improve applications. Use cases include the following:

  • Identifying performance bottlenecks
  • Pinpointing specific service issues
  • Identifying errors
  • Identifying impact to users

X-Ray integrates with the AWS SDK, adding traces to track your application requests as they are generated and received from various services.

Tracking Application Requests

To understand better how X-Ray works, consider the example service shown in Figure 15.5. In this service, the front-end fleet relies on a backend API, which is built using API Gateway, which acts as proxy to Lambda. Lambda then uses Amazon DynamoDB to store data.

The figure shows an example of microservice.

Figure 15.5 Microservice example

X-Ray can track a user request using a trace, segment, and subsegment.

Trace A trace is the path of a request through your application. This is the end-to-end request from the client—from its entry into your environment to the backend and back to the user. A trace ID is passed through the AWS services with the request so that X-Ray can collate related segments.

Segment A segment is data from a particular service. When a segment is reported to X-Ray, a trace ID is reported. Segments are analogous to links in a chain whereby the chain is the request generated by the user. In the example microservice, two segments correspond to two services: the front-end service and the backend API.

Subsegment A subsegment identifies the underlying API calls made from a particular service. Subsegments are collated into segments. In this scenario, the backend API sends requests to Amazon DynamoDB and Amazon SQS.

From these components of a request, X-Ray compiles the traces into a service graph that describes the components and their interactions needed to complete a request. A service graph is a visual representation of the services and resources that make up your application. Figure 15.6 shows an example of a service graph.

The figure shows an example of a service graph.

Figure 15.6 Example service graph for an application

The service graph provides an overview of the health of various aspects of your system, such as average latencies and request rates between your services and dependent resources. The colored circles also show the ratio of different response codes, as listed in Table 15.15.

Table 15.15 AWS X-Ray Service Graph Status Codes

Color Status Code
Purple Throttling or HTTP 5XX codes
Orange Client-side or HTTP 4XX codes
Red Fault application failure
Green OK or HTTP 2XX codes

X-Ray provides a convenient way for you to view system performance and to identify problems or bottlenecks in your applications. However, it does not provide auditing capabilities or the tracking of all requests to a system. X-Ray collects a statistically significant number of requests to a system so that meaningful insights can be provided. These insights enable you to focus on troubleshooting a particular service or improvements to a specific component of your application.

Summary

AWS provides multiple options for monitoring and troubleshooting your applications. As you have discovered, AWS services help you manage logs from various systems, either running on the cloud or on-premises, create triggers that notify you about application health and issues in your infrastructure, and build applications with modern debugging tools for distributed applications. These services overcome the difficulties of creating a centralized logging solution.

Exam Essentials

Know what Amazon CloudWatch is and why it is used. CloudWatch is the service used to aggregate, analyze, and alert on metrics generated by other AWS services. It is used to monitor the resources you create in AWS and the on-premises infrastructure. You can use CloudWatch to store logs from your applications and trigger actions in response to events.

Know what common metrics are available for Amazon Elastic Compute Cloud (Amazon EC2) in Amazon CloudWatch. Amazon EC2 metrics in CloudWatch include the following:

  • CPUUtilization
  • DiskReadOps
  • DiskReadBytes
  • DiskWriteOps
  • DiskWriteBytes
  • NetworkIn
  • NetworkOut
  • StatusCheckFailed

Amazon EC2 does not report OS-level metrics such as memory utilization.

Understand the difference between high-resolution and standard-resolution metrics. High-resolution metrics are delivered in a period of less than one minute. Standard-resolution metrics are delivered in a period greater than or equal to one minute.

Know what AWS CloudTrail is and why it is used. CloudTrail is used to monitor API calls made to the AWS Cloud for various services. CloudTrail helps IT administrators, IT security administrators, DevOps engineers, and auditors to enable compliance and the monitoring of access to AWS resources within an account.

Know what AWS CloudTrail tracks automatically. By default, CloudTrail tracks the last 90 days of activity. These events are limited to management events with create, modify, and delete API calls.

Understand the difference between AWS CloudTrail management and data events. Management events are operations performed on resources in your AWS account. Data events are operations performed on data stored in AWS resources. Examples are creating or deleting objects in Amazon S3 and inserting or updating items in an Amazon DynamoDB table.

Know what AWS X-Ray is and why it is used. X-Ray is a service that collects data about your application requests, including the various subservices or systems that perform tasks to complete a request. X-Ray is commonly used to help developers find bottlenecks in distributed applications and monitor the health of various components in their services.

Know the basics of AWS X-Ray and how it helps troubleshoot applications. X-Ray records requests by initiating a trace ID with the origin of the request. This trace ID is added as a header to the request that propagates to various services. If you enable the X-Ray SDK in your applications, X-Ray submits telemetry and the request as segments for each service and subsegments for downstream services upon which you depend. Using these traces, X-Ray collates the data to view request performance metrics, such as latency and error rates. The data can then be used to create a graph of your application and its dependencies and the health of any requests your application might make.

Resources to Review

Exercises

Exercise 15.1

Create an Amazon CloudWatch Alarm on an Amazon S3 Bucket

It is common to monitor the storage usage of your Amazon S3 buckets and trigger notifications when there is a large increase in storage used. In this exercise, you will use the AWS CLI to configure an Amazon CloudWatch alarm to trigger a notification when more than 1 KB of data is uploaded to an Amazon S3 bucket.

If you need directions while completing this exercise, see “Using Amazon CloudWatch Alarms” here:

https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html

  1. Create an Amazon S3 bucket in your AWS account. For instructions, see this page:

    https://docs.aws.amazon.com/AmazonS3/latest/user-guide/ create-bucket.html

  2. Open the Amazon CloudWatch console at https://console.aws.amazon.com/cloudwatch/.
  3. Select Alarms ➢ Create Alarm.
  4. Choose Select Metric.
    1. Select the All Metrics tab.
    2. Expand AWS Namespaces.
    3. Select S3.
    4. Select Storage Metrics.
    5. Select a metric where BucketName matches the name of the Amazon S3 bucket that you created and where Metric Name is BucketSizeBytes.
  5. Choose Select Metric.
  6. Under Alarm Details:
    1. For Name, enter S3 Storage Alarm.
    2. For the comparator, select >= (greater than or equal to).
    3. Set the value to 1000 for 1 KB.
  7. Under Actions:
    1. For Whenever This Alarm, select State Is ALARM.
    2. For Send Notification To, select New List.
    3. For Name, enter My S3 Alarm List.
    4. For Email List, enter your email address.
  8. Choose Create Alarm.

The alarm is created in your account. If you already have data in your Amazon S3 bucket, it is switched from Insufficient Data to Alarm state. Otherwise, try uploading several files to your bucket to monitor changes in alarm state.

To delete the alarm, follow these steps:

  1. Open the Amazon CloudWatch console at https://console.aws.amazon.com/cloudwatch/.
  2. Select Alarms.
  3. Select the alarm you want to delete.
  4. For Actions, select Delete.

In this exercise, you created an Amazon CloudWatch alarm to notify administrators when large files are uploaded to Amazon S3 buckets in your account.

Exercise 15.2

Enable an AWS CloudTrail Trail on an Amazon S3 Bucket

  1. In this exercise, you will set up access logs to an Amazon S3 bucket in your account to monitor activity.Create an Amazon S3 bucket in your AWS account.

    For instructions on how to do so, see the following:

    https://docs.aws.amazon.com/AmazonS3/latest/user-guide/ create-bucket.html

  2. Open the AWS CloudTrail console at https://console.aws.amazon.com/cloudtrail/.
  3. Select Create Trail.
  4. Set Trail name to s3_logs.
  5. Under Management Events, select None.
  6. Under Data Events, select Add S3 Bucket.
  7. For S3 bucket, enter your Amazon S3 bucket name.
  8. Under Storage Location, for Create A New S3 bucket, select Yes.
  9. For Name, enter a name for your Amazon S3 bucket.
  10. Choose Create.

In this exercise, you enabled AWS CloudTrail to record data events and store corresponding logs to an Amazon S3 bucket.

Exercise 15.3

Create an Amazon CloudWatch Dashboard

In this exercise, you will create an Amazon CloudWatch dashboard to see graphed metric data.

  1. Open the Amazon CloudWatch console at https://console.aws.amazon.com/cloudwatch/.
  2. In the navigation pane, select Dashboards.
  3. Choose Create Dashboard.
  4. For Dashboard Name, enter a name for your dashboard.
  5. Select Create Dashboard.
  6. In the modal window, select the Line graph.
  7. Choose Configure.
  8. From the available metrics, select one or more metrics that you want to monitor.
  9. Choose Create Widget.
  10. To add more widgets, choose Add Widget and repeat steps 6 through 9 for other widget types.
  11. Choose Save Dashboard.

In this exercise, you created an Amazon CloudWatch dashboard to create graphs of important metric data for resources in your account.

Review Questions

  1. You are required to set up dynamic scaling using Amazon CloudWatch alarms.

    Which of the following metrics could you monitor to trigger Auto Scaling events to scale out and scale in your instances?

    1. High CPU utilization to trigger scale-in action, and low CPU utilization to trigger scale-out action
    2. High CPU utilization to trigger scale-out action, and low CPU utilization to trigger scale-in action
    3. High latency to trigger a scale-in action, and low latency to trigger a scale-out action
    4. None of the above
  2. What is the length of time that metrics are stored for a data point with a period of 300 seconds (5 minutes) in Amazon CloudWatch?

    1. The data point is stored for 3 hours.
    2. The data point is stored for 15 days.
    3. The data point is stored for 30 days.
    4. The data point is stored for 63 days.
    5. The data point is stored for 455 days (15 months).
  3. Which of the following does an AWS CloudTrail event not provide?

    1. Who made the request
    2. When the request was made
    3. What request is being made
    4. Why the request was made
    5. Which resource was acted on
  4. You must set up centralized logging for an application and create a cost-effective way to archive logs for compliance purposes.

    • Which is the best solution?

    1. Install the Amazon CloudWatch agent on your servers to ingest the logs and store them indefinitely.
    2. Configure Amazon CloudWatch to ingest logs from your application servers.
    3. Install the Amazon CloudWatch agent on your servers to ingest the logs and set a new retention period for logs with regular exports to Amazon S3 for archival.
    4. None of the above.
  5. Which of the following options allow logs and metrics to be ingested into Amazon CloudWatch? (Select THREE.)

    1. Install the Amazon CloudWatch agent and configure it to ingest logs.
    2. Execute API operations to push metrics to Amazon CloudWatch.
    3. Configure Amazon CloudWatch to pull logs from servers.
    4. Use the AWS CLI to push metrics to Amazon CloudWatch.
  6. The following are Apache HTTP access logs.

    • Which filter pattern would select events matching 404 errors?
    • 127.0.0.1 - - [24/Sep/2013:11:49:52 -0700] "GET /index.html HTTP/1.1" 404 287
    • 127.0.0.1 - - [24/Sep/2013:11:49:52 -0700] "GET /index.html HTTP/1.1" 404 287
    • 127.0.0.1 - - [24/Sep/2013:11:50:51 -0700] "GET /~test/ HTTP/1.1" 200 3
    • 127.0.0.1 - - [24/Sep/2013:11:50:51 -0700] "GET /favicon.ico HTTP/1.1" 404 308
    • 127.0.0.1 - - [24/Sep/2013:11:50:51 -0700] "GET /favicon.ico HTTP/1.1" 404 308
    • 127.0.0.1 - - [24/Sep/2013:11:51:34 -0700] "GET /~test/index.html HTTP/1.1" 200 3

    1. 4xx
    2. 400
    3. 404
    4. None of the above
  7. You build an application and enable AWS X-Ray tracing. You analyze the service graph and determine that the application requests to Amazon DynamoDB are not performing well and a majority of the issues are purple.

    What kind of problem is your application experiencing?

    1. Throttling
    2. Error
    3. Faults
    4. OK
  8. Which AWS service enables you to monitor resources and gather statistics, such as CPU utilization, from a single “pane of glass” interface?

    1. AWS CloudTrail logs
    2. Amazon CloudWatch alarms
    3. Amazon CloudWatch dashboards
    4. Amazon CloudWatch Logs
  9. By default, what is the number of days of AWS account activity that you can view, search, and download from the AWS CloudTrail event history?

    1. 30 days
    2. 60 days
    3. 75 days
    4. 90 days
  10. Which of the following is not able to access AWS CloudTrail data?

    1. AWS CLI
    2. AWS Management Console
    3. AWS CloudTrail API
    4. None of the above
  11. In AWS CloudTrail, which of the following are management events? (Select TWO.)

    1. Adding a row to an Amazon DynamoDB table
    2. Modifying an Amazon S3 bucket policy
    3. Uploading an object to an Amazon S3 bucket
    4. Creating an Amazon Relational Database Service (Amazon RDS) database instance
    5. Sending a notification to Amazon Simple Notification Service (Amazon SNS)
  12. Suppose that you have a custom web application running on an Amazon Elastic Compute Cloud (Amazon EC2) instance.

    • What steps are needed to configure this instance to send custom application logs to Amazon CloudWatch Logs? (Select THREE.)

    1. Install the Amazon CloudWatch Logs agent.
    2. Attach an Elastic IP address to your Amazon EC2 instance.
    3. Configure the agent to send specific logs.
    4. Start the agent.
    5. Install the AWS Systems Manager agent.
  13. Which of the following are not supported Amazon CloudWatch alarm actions?

    1. AWS Lambda functions
    2. Amazon Simple Notification Service (Amazon SNS) topics
    3. Amazon Elastic Compute Cloud (Amazon EC2) actions
    4. EC2 Auto Scaling actions
  14. Which of the following Amazon Elastic Compute Cloud (Amazon EC2) metrics is not directly available through Amazon CloudWatch metrics?

    1. CPU utilization
    2. Network traffic in/out
    3. Disk I/O
    4. Memory (RAM) utilization
  15. Which of the following is the correct Amazon CloudWatch metric namespace for Amazon Elastic Compute Cloud (Amazon EC2) instances?

    1. AWS/EC2
    2. Amazon/EC2
    3. AWS/EC2Instance
    4. Amazon/EC2Instance
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset