THE AWS CERTIFIED DEVELOPER – ASSOCIATE EXAM TOPICS COVERED IN THIS CHAPTER MAY INCLUDE, BUT ARE NOT LIMITED TO, THE FOLLOWING:
Content may include the following:
Content may include the following:
Monitoring the applications and services you build is vital to the success of any information technology (IT) organization. With the AWS Cloud, you can leverage monitoring resources to drive business decisions such as what resources to create, improve, optimize, and secure.
Traditional approaches to monitoring do not scale for cloud architectures. Large systems can be difficult to set up, configure, and scale. These efforts are compounded by the trend away from monolithic installations toward service-oriented architecture (SOA), microservices, and serverless architectures. Monitoring modern IT systems is proportionally difficult. When working on a monolithic application, you can add logging statements and troubleshoot with breakpoints. However, applications today are spread across multiple systems over large networks that make it difficult to track the health of systems and react to issues. For example, using logging statements to monitor execution time and error rates of AWS Lambda functions can become difficult as your infrastructure grows and spreads across multiple AWS Regions.
The AWS Cloud provides fully managed services to help you implement monitoring solutions that are reliable, scalable, and secure. AWS offers services to help you monitor, log, and analyze your applications and infrastructure. In this section, you explore Amazon CloudWatch, AWS CloudTrail, and AWS X-Ray. Figure 15.1 shows the AWS monitoring services available.
Before you explore these services, consider why they are essential. As a developer, you are designing systems to provide IT or business solutions to a customer. Success is measured by the effective application of software to business objectives. What are some of the metrics that you must track over time to ensure that these objectives are being met?
AWS takes the approach of “working backward” from the customer. You can accomplish this by starting with the customer and tracing the underlying components that affect the customer’s experience. This provides a foundation for identifying which metrics to monitor, as they correlate directly to the customer experience. Frequently, the top characteristics that directly affect the customer experience are performance and cost. Changes to either have a direct impact on how customers perceive the software they use.
Deciding which metrics to monitor requires that you answer several crucial questions.
Question: Are my customers having a good experience with the services or systems that that I provide?
The phrase good experience can be broken down into measurable metrics, such as request latency, time to first byte, error rates, and more. Metrics, such as instance CPU utilization or network bytes in/out, however, may not be representative of the customer experience.
It is good practice to measure any metric that directly affects customers using your software or system. The second question to ask is: “What is the overall cost of my system?” Increases in performance often correlate directly to increases in cost. With unlimited money, it would be easy to design a system that scales infinitely in response to customer usage. However, this is never a reality. Instead, you need to measure the performance of your system to determine what is acceptable performance based on the usage at any point in time. This is the case when metrics that are not customer-facing often take precedence.
Question: How can I use monitoring to predict changes in customer demand?
With the agility and elasticity of the AWS Cloud, this can be especially useful. Monitoring and measuring customer demand over time allows you to scale your infrastructure predictively to meet changes in customer demand without having to purchase more resources than are necessary. For example, suppose that you have a web application that runs on three Amazon Elastic Compute Cloud (Amazon EC2) instances during the day. In the evenings, demand increases significantly for several hours before decreasing again late at night. On weekends, your application sees almost no traffic. With historical information obtained through monitoring, you can design your application to scale out across more Amazon EC2 instances during the evenings and scale in on the weekends when there is little demand. Predictive scaling occurs before customer demand changes, ensuring a smooth experience while new resources are created and brought online.
Question: Where do problems occur?
As Werner Vogels, VP and CTO of AWS, once said, “Everything fails all the time.” No system is impervious to failure. By gathering potentially relevant information ahead of time, it becomes easier to determine causes for failure. By collecting this information, you can reduce mean time between failure (MTBF), mean time to resolution (MTTR), and other key operational performance metrics.
Question: “Can you detect or prevent problems in the future?”
By evaluating operational metrics over time, you can reveal patterns and common issues in your systems.
When choosing metrics, align them closely to your business processes to provide a better customer experience. For example, suppose that you have an application running in AWS Elastic Beanstalk. Unknown to you, the application has a memory leak. Without tracking memory utilization over time, you will not have insight into why customers are experiencing degraded performance. If your Elastic Beanstalk environment is configured to scale out based on CPU utilization, it is possible that no new instances are launched to serve customer requests. In this case, the memory leak prevents new requests from being processed, causing a drop in CPU utilization. Without comprehensive tracking of system performance, issues such as this can go unnoticed until system-wide outages occur.
These factors impact what is referred to as the health of your systems. As a developer and contributor, you are not only responsible for the code that you develop but also for the operational health of these services. It is vital to align operational and health metrics properly with customer expectations and experiences.
Amazon CloudWatch is a monitoring and metrics service that provides you with a fully managed system to collect, store, and analyze your metrics and logs. By using CloudWatch, you can create notifications on changes in your environment.
Typical use cases include the following:
CloudWatch enables you to collect and store monitoring and operations data from logs, metrics, and events that run on AWS and on-premises resources. To ensure that your applications run smoothly, you can use CloudWatch to perform the following tasks:
CloudWatch acts as a metrics repository, storing metrics and logs from various sources. These metrics can come from AWS resources using built-in or custom metrics. Figure 15.2 illustrates the role of CloudWatch in operational health.
CloudWatch can process these metrics into statistics that are made available through the CloudWatch console, AWS APIs, the AWS Command Line Interface (AWS CLI), and AWS software development kits (AWS SDKs). Using CloudWatch, you can display graphs, create alarms, or integrate with third-party solutions.
To understand CloudWatch better, especially how data is collected and organized, review the following terms.
A metric is a set of time-series data points that you publish to CloudWatch. For example, a commonly monitored metric for Amazon EC2 instances is CPU utilization. Data points can come from multiple systems, both AWS and on-premises. You can also define custom metrics based on data specific to your system. A metric is identified uniquely by a namespace, a name, and zero or more dimensions.
A namespace is a collection of metrics or a container of related metrics; for example, namespaces used by AWS offerings or services that all start with AWS. Amazon EC2 uses the AWS/EC2 namespace. As a developer, you can create namespaces for different components of your applications, such as front-end, backend, and database components.
A name for a given metric defines the attribute or property that you are monitoring; for example, CPU Utilization in the AWS/EC2 namespace. The AWS/EC2 namespace contains various metrics that are important to monitoring the health of Amazon EC2 resources, such as CPU Utilization, Disk I/O, Network I/O, or Status Check. You can also create custom metrics for attributes, such as request latency, HTTP 400/500 response codes, and throttling.
A dimension is a name/value pair used to define a metric uniquely. For example, for the namespace AWS/EC2 and name/metric CPUUtilization, the dimension might be InstanceId. For a fleet of Amazon EC2 instances, you can measure CPUUtilization as one metric for multiple dimensions (one for each instance). You can use the dimensions to structure and organize the data points you gather.
When you’re creating metrics, consider defining namespaces that align with your different services and assigning dimensions as important metrics that describe the health of that service. For example, if you have a front-end web fleet running NGINX servers, then dimensions, such as requests-per-second, response time, active connections, and response codes, could help you determine what configuration changes would optimize system performance.
When data is published to CloudWatch, it is pushed in sets of data points. Each data point contains information such as the timestamp, value, and unit of measurement.
Timestamps are dateTime objects with the complete date and time; for example, 2016-10-31T23:59:59Z. Although not required, AWS recommends formatting times as Coordinated Universal Time (UTC).
The value is the measurement for the data point.
A unit of measurement is used to label your data. This offers a better understanding of what the value represents. Example units include Bytes, Seconds, Count, and Percent. If you do not specify a unit in CloudWatch, your data point units are designated as None.
CloudWatch stores this data based on the retention period, which is the length of time to keep data points available. Data points are stored in CloudWatch based on how often the data points are published.
From these data points, CloudWatch can calculate statistics to provide you with insight into your application, service, or environment. In the next section, you will discover how CloudWatch calculates and organizes these statistics.
CloudWatch provides statistics based on metric data provided to the service. Statistics are aggregations of data points over specified periods of time for specified metrics. A period is the length of time, in seconds. Periods can be defined in values of 1, 5, 10, 30, or any multiple of 60 seconds (up to 86,400 seconds, or 1 day). The available statistics in CloudWatch include the following:
Statistics can be used to gain insight into the health of your application and to help you determine the correct settings for various configurations. For example, you may want to implement automatic scaling on your fleet of Amazon EC2 instances in order to avoid having to launch and terminate instances manually. To do so, you must configure an Auto Scaling group. Configuration settings for an Auto Scaling group include the minimum, desired, and maximum number of instances to run in your account. By monitoring statistics over time, you can determine the minimum and maximum number of instances needed to support the average, minimum, and maximum workload.
CloudWatch statistics provide a powerful way to process large amounts of metrics at scale and present insightful data that is easy to consume. Now that you understand how CloudWatch metrics work and are organized, explore the metrics available.
CloudWatch aggregates metrics according to the period of time you specify when retrieving statistics. When you request this statistic, you also can have CloudWatch filter the data points based on the dimensions of the metrics. For example, in Amazon DynamoDB, metrics are fetched across all DynamoDB operations. You can specify a filter on the dimension operations to exclude specific operations, such as GetItem requests. CloudWatch does not aggregate data across regions.
Table 15.1 describes the available metrics for Elastic Load Balancing resources. To discover all of the available metrics, refer to the AWS documentation.
Table 15.1 Elastic Load Balancing Metrics
Namespace | AWS/ELB AWS/ApplicationELB AWS/NetworkELB |
Dimensions | LoadBalancerName: name of the load balancer |
Key metrics |
|
Table 15.2 describes the available Amazon EC2 metrics.
Table 15.2 Amazon EC2 Metrics
Namespace | AWS/EC2 |
Dimensions | InstanceId: identifier of a particular Amazon EC2 instance InstanceType: type of Amazon EC2 instance, such as t2.micro, m4.large |
Key metrics |
|
Amazon EC2 does not report memory utilization to CloudWatch. This is because memory is allocated in full to an instance by the underlying host. Memory consumption is visible only to the guest operating system (OS) of the instance. However, you can report memory utilization to CloudWatch using the CloudWatch agent.
Table 15.3 describes the AWS Auto Scaling group metrics.
Table 15.3 AWS Auto Scaling Groups
Namespace | AWS/AutoScaling |
Dimensions | AutoScalingGroupName: name of the Auto Scaling group |
Key metrics |
|
Table 15.4 describes the Amazon Simple Storage Service (Amazon S3) metrics.
Table 15.4 Amazon S3 Metrics
Namespace | AWS/S3 |
Dimensions |
|
Key metrics |
|
Table 15.5 describes the Amazon DynamoDB metrics.
Table 15.5 Amazon DynamoDB Metrics
Namespace | AWS/DynamoDB |
Dimensions |
|
Key metrics |
|
Table 15.6 describes the Amazon API Gateway metrics.
Table 15.6 Amazon API Gateway Metrics
Namespace | AWS/ApiGateway |
Dimensions |
|
Key metrics |
|
Table 15.7 describes the AWS Lambda metrics.
Table 15.7 AWS Lambda Metrics
Namespace | AWS/Lambda |
Dimensions | FunctionName: name of your AWS Lambda function |
Key metrics |
|
Table 15.8 describes the Amazon Simple Queue Service (Amazon SQS) metrics.
Table 15.8 Amazon SQS Metrics
Namespace | AWS/SQS |
Dimensions | QueueName: name of the Amazon SQS queue |
Key metrics |
|
Amazon SQS does not report the total number of messages in the queue. You can find this value by adding ApproximateNumberOfMessagesVisible and ApproximateNumberOfMessagesNotVisible.
Table 15.9 describes the Amazon Simple Notification Service (Amazon SNS) metrics.
Table 15.9 Amazon SNS Metrics
Namespace | AWS/SNS |
Dimensions | TopicName: name of the Amazon SNS topic |
Key metrics |
|
In addition to the built-in metrics that AWS provides, CloudWatch also supports custom metrics that you can publish from your systems. This section includes some commands that you can use to publish metrics to CloudWatch.
With custom metrics, you have two options for resolution (the time interval between data points) for your metrics. You can use standard resolution for data points that have a granularity of one minute or high resolution for data points that have a granularity of less than one second. By default, most metrics delivered by AWS services have standard resolution.
CloudWatch supports multiple options when you publish metrics. You can publish them as single data points, statistics sets, or zero values. Single data points are optimal for most telemetry. However, statistics sets are recommended for values with high-resolution data points in which you are sampling multiple times per minute. Statistics sets are sets of calculated values, such as minimum, maximum, average, sum, and sample count, as opposed to individual data points. The value 0 is for applications that have periods of inactivity, where no data is sent. The following are some sample scripts using the AWS CLI to publish data points.
The following commands each publish a single data point under the Metric Name PageViewCount to the Namespace MyService with respective values and timestamps. You are not required to create a metric name or namespace. CloudWatch is aware of the data points to a metric or creates a new metric if it does not exist.
aws cloudwatch put-metric-data
--metric-name PageViewCount
--namespace MyService
--value 2
--timestamp 2018-10-20T12:00:00.000Z
aws cloudwatch put-metric-data
--metric-name PageViewCount
--namespace MyService
--value 4
--timestamp 2018-10-20T12:00:01.000Z
aws cloudwatch put-metric-data
--metric-name PageViewCount
--namespace MyService
--value 5
--timestamp 2018-10-20T12:00:02.000Z
The following command publishes a statistic set to the metric-name PageViewCount to the namespace MyService, with values for various statistics (Sum 11, Minimum 2, Maximum 5), and SampleCount 3 with the corresponding timestamp:
aws cloudwatch put-metric-data
--metric-name PageViewCount
--namespace MyService
--statistic-values Sum=11,Minimum=2,Maximum=5,SampleCount=3
--timestamp 2018-10-14T12:00:00.000Z
The following command publishes a single data point with the value 0 to the metric-name PageViewCount to the namespace MyService with the corresponding timestamp:
aws cloudwatch put-metric-data
--metric-name PageViewCount
--namespace MyService
--value 0
--timestamp 2018-10-14T12:00:00.000Z
After you publish data to CloudWatch, you may want to retrieve statistics for a specified metric of a given resource.
This command retrieves the Sum, Max, Min, Average, and SampleCount statistics for the metric-name PageViewCount to the namespace MyService with a period interval of 60 seconds between the start-time and end-time. This means that CloudWatch will aggregate data points in one-minute intervals to calculate statistics.
aws cloudwatch get-metric-statistics
--namespace MyService
--metric-name PageViewCount
--statistics "Sum" "Maximum" "Minimum" "Average" "SampleCount"
--start-time 2018-10-20T12:00:00.000Z
--end-time 2018-10-20T12:05:00.000Z
--period 60
Example output from this command displays a single data point for the Metric PageViewCount.
{
"Datapoints": [
{
"SampleCount": 3.0,
"Timestamp": "2016-10-20T12:00:00Z",
"Average": 3.6666666666666665,
"Maximum": 5.0,
"Minimum": 2.0,
"Sum": 11.0,
"Unit": "None"
}
],
"Label": "PageViewCount"
}
Though most commercial standard applications already produce some form of logging, most modern applications are deployed in distributed or service-oriented architectures. Collecting and processing these logs can be a challenge as a system grows and expands across multiple regions. Centralized logging using CloudWatch Logs can overcome this challenge. With CloudWatch Logs, you can set up a central log storage location to ingest and process logs at scale.
Setting up centralized logging with CloudWatch Logs is a straightforward process. The first step is to install and configure the CloudWatch agent, which is used to collect custom logs and metrics from Amazon EC2 instances or on-premises servers. You can choose which log files you want to ingest by pointing to the locations using a JavaScript Object Notation (JSON) configuration file. The second step is to configure AWS Identity and Access Management (IAM) roles or users to grant permission for the agent to publish logs into CloudWatch. In addition to the CloudWatch agent, you can also send metrics to CloudWatch using the AWS CLI, AWS SDK, or AWS API.
Because you are collecting logs from multiple sources, CloudWatch organizes your logs into three conceptual levels: groups, streams, and events.
A log group is collection of log streams. For example, if you have a service that consists of a cluster of multiple machines, a log group would be a container for the logs from each of the individual instances.
A log stream is a sequence of log events such as a single log file from one of your instances.
A log event is a record of some activity from an application, process, or service. This is analogous to a single line in a log file.
CloudWatch stores log events based on your retention settings, which are assigned at the log group. The default configuration is to store log data in Amazon CloudWatch Logs indefinitely. You are charged for any data stored in CloudWatch Logs in addition to data transferred out of the service. You can export CloudWatch Logs to Amazon S3 for long-term storage, which is valuable when regulations require long-term log retention. Long-term retention can be combined with Amazon S3 lifecycle policies to archive data to Amazon S3 Glacier for additional cost savings.
With centralized logging on CloudWatch Logs, you do not need to search through hundreds of individual servers to find a problem. After logs are ingested into CloudWatch Logs, you can search for logs through a central location using metric filters.
A metric filter is a text pattern used to parse log data for specific events. As an example, consider the log in Table 15.10.
Table 15.10 Example Log
Line | Log Event |
1 | [ERROR] Caught IllegalArgumentException |
2 | [ERROR] Unhandled Exception |
3 | Another message |
4 | Exiting with ERRORCODE: -1 |
5 | [WARN] Some message |
6 | [ERROR][WARN] Some other message |
To look for occurrences of the ERROR event, you use ERROR as your metric filter, as illustrated in Table 15.11. CloudWatch will search for that term across the logs.
Table 15.11 Example Metric Filters
Metric Filter | Description |
"" | Matches all log events. |
"ERROR" | Matches log events containing the term “ERROR.” Based on the events in the example log in Table 15.10, this metric filter would find lines 1–3 and 6. |
"ERROR" – "EXITING" | Matches log events containing the term “ERROR” except “EXITING.” Based on the events in the example log in Table 15.10, this metric filter would find lines 1, 2, and 6. |
"ERROR Exceptions" | Matches log events containing both terms “ERROR” and “Exceptions.” This filter is an AND function. Based on the events in the example log in Table 15.10, this metric filter would find lines 1 and 2. |
"?ERROR ?WARN" | Matches log events containing either the term “ERROR” or “WARN.” This filter is an OR function. Based on the event in the example log in Table 15.10, this metric filter would find lines 1, 2, 4, and 6. |
If your logs are structured in JSON format, CloudWatch can also filter object properties. Consider the following example JSON log.
{
"user": {
"id": 1,
"email": "[email protected]"
},
"users": [
{
"id": 2,
"email": "[email protected]"
},
{
"id": 3,
"email": "[email protected]"
}
],
"actions": [
"GET",
"PUT",
"DELETE"
],
"coordinates": [
[0, 1, 2],
[4, 5, 6],
[7, 8, 9]
]
}
You can create a metric filter that selects and compares certain properties of this event, as shown in Table 15.12.
Table 15.12 Example JSON Metric Filters
JSON Metric Filter | Description |
{ ($.user.id = 1) && ($.users[0] .email = “John.D[email protected]") } | Check that the property user.id equals 1 and the first user’s email is John.D[email protected]. The preceding log event would be returned. |
{ ($.user.id = 2 && $.users[0] .email = "John.D[email protected]") || $.actions[2] = "GET" } | Check that the property user.id equals 2 and the first user’s email is John.D[email protected] or the second action is GET. The preceding example would not be returned, because the second action is PUT, not GET. |
Instead of having to write additional code to add monitoring to your application, CloudWatch can process logs that you already generate and provide valuable metrics. Using the example from the previous section, the same metric filter can be used to generate metrics corresponding to the number of occurrences of the term ERROR in your logs.
After data points are established in CloudWatch, either as metrics or as logs (from which you generate metrics), you can set alarms to monitor your metrics and trigger actions in response to changes in state. CloudWatch alarms have three possible states: OK, ALARM, and INSUFFICIENT_DATA. Table 15.13 defines each alarm state.
Table 15.13 Alarm States
State | Description |
OK | The metric or expression is within the defined threshold. |
ALARM | The metric or expression is outside of the defined threshold. |
INSUFFICIENT_DATA | The alarm has just started, the metric is not available, or not enough data is available for the metric to determine the alarm state. |
An ALARM state may not indicate a problem. It means that the given metric is outside the defined threshold. For example, you have two alarms for Auto Scaling groups: one for high CPU utilization and one for low CPU utilization. During normal use, both alarms should be OK, indicating that you have adequate capacity to handle the current workload. If your workload changes, the high CPU utilization metric threshold may be breached, sending the corresponding alarm into ALARM state. With an Auto Scaling group, the alarm’s state change triggers a scale-out event, adding capacity to your infrastructure.
When you create an alarm, specify three settings that determine when the alarm should change states: the threshold, period, and data points on which you want to notify, as described in Table 15.14.
Table 15.14 Alarm Settings
Setting | Description |
Period | The length of time (in seconds) to evaluate the metric or expression to create each individual data point for an alarm. If you choose one minute as the period, there is one data point every minute. |
Evaluation Period | The number of the most recent periods, or data points, to evaluate when determining alarm state. |
Data Points to Alarm | The number of data points within the evaluation period that must breach the specified threshold to cause the alarm to go to the ALARM state. These data points do not have to be consecutive. |
Figure 15.3 illustrates how an alarm works based on configuration settings.
The figure illustrates a threshold configured to the value 3 (in blue), a period set to 3, and data points in red. Notice how the settings drive the alarm occurrence. Even though the data points breach the threshold after the third period, it is not sustained for the required three periods to be in an ALARM state. Only after the fifth period would the alarm change to an ALARM state (the upper threshold is breached for three periods). Between the fifth and sixth period, the data points drop below the threshold. However, because the state has not dropped below the threshold for three periods, it does not change to an OK state until the eighth period. It remains in the OK state past the ninth period because three consecutive periods exceeding the threshold are necessary for the alarm state to change.
Alarms can trigger Amazon EC2 actions and EC2 Auto Scaling actions. CloudWatch can leverage Amazon SNS or Amazon SQS for alarm state notifications, both of which provide numerous integrations with other AWS services.
Exercise caution when creating email notifications for alarms in your environment. This can lead to many unnecessary emails to you or your team. Ultimately, these notifications get filtered as spam or result in “notification fatigue.” Evaluate your alarms and the metrics you are monitoring to determine whether notifications are necessary. If they are only status updates, set notifications sparingly.
CloudWatch offers a convenient way to observe operational metrics for all of your applications. CloudWatch dashboards are customizable pages in the CloudWatch console that you can use to monitor resources in a single view (see Figure 15.4).
CloudWatch dashboards provide customizable status pages in the CloudWatch console. These status pages can be used to monitor resources across multiple regions and on-premises in a consolidated view using widgets. Each widget can be customized to present information in CloudWatch in a user-friendly way so that educated decisions can be made based on the current status of your system.
All actions in your AWS account are composed of API calls, regardless of the origin (the AWS Management Console or programmatic/scripted actions). As you create resources in your account, API calls are being made to AWS services in different regions around the world. AWS CloudTrail is a fully managed service that continuously monitors and records API calls and stores them in Amazon S3. You can use these logs to troubleshoot and resolve operational issues, meet and verify regulatory compliance, and monitor or alarm on specific events in your account. CloudTrail supports most AWS services, making it easy for IT and security administrators to analyze activity in accounts. IT auditors can also use log files as compliance aids.
CloudTrail helps answer the following five key questions about monitoring access:
A CloudTrail event is any single API activity in an AWS account. This activity can be an action triggered by any of the following:
CloudTrail tracks two types of events: management events and data events. Events are recorded in the region where the action occurred, except for global service events.
Management events give insight into operations performed on AWS resources, such as the following examples:
Data events give insight into operations that store data in (or extract data from) AWS resources, such as the following examples:
By default, CloudTrail tracks the last 90 days of API history for management events. The following is example output for a CloudTrail event:
{
"Records": [{
"eventVersion": "1.01",
"userIdentity": {
"type": "IAMUser",
"principalId": "AIDAJDPLRKLG7UEXAMPLE",
"arn": "arn:aws:iam::123456789012:user/Alice",
"accountId": "123456789012",
"accessKeyId": "AKIAIOSFODNN7EXAMPLE",
"userName": "Alice",
"sessionContext": {
"attributes": {
"mfaAuthenticated": "false",
"creationDate": "2014-03-18T14:29:23Z"
}
}
},
"eventTime": "2014-03-18T14:30:07Z",
"eventSource": "cloudtrail.amazonaws.com",
"eventName": "StartLogging",
"awsRegion": "us-west-2",
"sourceIPAddress": "198.162.198.64",
"userAgent": "signin.amazonaws.com",
"requestParameters": {
"name": "Default"
},
"responseElements": null,
"requestID": "cdc73f9d-aea9-11e3-9d5a-835b769c0d9c",
"eventID": "3074414d-c626-42aa-984b-68ff152d6ab7"
},
... additional entries ...
]
This event provides the following information:
As a security precaution, you can use events such as this example to configure alerts when an IAM user attempts to sign in to the AWS Management Console too many times.
Some AWS services allow you to create, modify, and delete resources from any region. These are referred to as global services. Examples of global services include the following:
Global services are logged as occurring in US East (N. Virginia) Region. Any trails created in the CloudTrail console log global services by default, which are delivered to the Amazon S3 bucket for the trail.
If you need long-term storage of events (for example, for compliance purposes), you can configure a trail of events as log files in CloudTrail. A trail is a configuration that enables delivery of CloudTrail events to an Amazon S3 bucket, Amazon CloudWatch Logs, and Amazon CloudWatch Events. When you configure a trail, you can filter the events that you want to be delivered.
The services covered so far are centered on the concept of using logs as monitoring and troubleshooting tools. Developers often write code, test the code, and inspect the logs. If there are errors, they may add breakpoints, run the test again, and add log statements. This works well in small cases, but it becomes cumbersome as teams, software, and infrastructure grow. Traditional troubleshooting and debugging processes do not work well at scaling across multiple services. Troubleshooting cross-service and cross-region interactions can be especially difficult when different systems use varying log formats.
AWS X-Ray is a service that collects data about requests served by your application. It provides tools you can use to view, filter, and gain insights into that data to identify issues and opportunities for optimization.
X-Ray helps developers build, monitor, and improve applications. Use cases include the following:
X-Ray integrates with the AWS SDK, adding traces to track your application requests as they are generated and received from various services.
To understand better how X-Ray works, consider the example service shown in Figure 15.5. In this service, the front-end fleet relies on a backend API, which is built using API Gateway, which acts as proxy to Lambda. Lambda then uses Amazon DynamoDB to store data.
X-Ray can track a user request using a trace, segment, and subsegment.
Trace A trace is the path of a request through your application. This is the end-to-end request from the client—from its entry into your environment to the backend and back to the user. A trace ID is passed through the AWS services with the request so that X-Ray can collate related segments.
Segment A segment is data from a particular service. When a segment is reported to X-Ray, a trace ID is reported. Segments are analogous to links in a chain whereby the chain is the request generated by the user. In the example microservice, two segments correspond to two services: the front-end service and the backend API.
Subsegment A subsegment identifies the underlying API calls made from a particular service. Subsegments are collated into segments. In this scenario, the backend API sends requests to Amazon DynamoDB and Amazon SQS.
From these components of a request, X-Ray compiles the traces into a service graph that describes the components and their interactions needed to complete a request. A service graph is a visual representation of the services and resources that make up your application. Figure 15.6 shows an example of a service graph.
The service graph provides an overview of the health of various aspects of your system, such as average latencies and request rates between your services and dependent resources. The colored circles also show the ratio of different response codes, as listed in Table 15.15.
Table 15.15 AWS X-Ray Service Graph Status Codes
Color | Status Code |
Purple | Throttling or HTTP 5XX codes |
Orange | Client-side or HTTP 4XX codes |
Red | Fault application failure |
Green | OK or HTTP 2XX codes |
X-Ray provides a convenient way for you to view system performance and to identify problems or bottlenecks in your applications. However, it does not provide auditing capabilities or the tracking of all requests to a system. X-Ray collects a statistically significant number of requests to a system so that meaningful insights can be provided. These insights enable you to focus on troubleshooting a particular service or improvements to a specific component of your application.
AWS provides multiple options for monitoring and troubleshooting your applications. As you have discovered, AWS services help you manage logs from various systems, either running on the cloud or on-premises, create triggers that notify you about application health and issues in your infrastructure, and build applications with modern debugging tools for distributed applications. These services overcome the difficulties of creating a centralized logging solution.
Know what Amazon CloudWatch is and why it is used. CloudWatch is the service used to aggregate, analyze, and alert on metrics generated by other AWS services. It is used to monitor the resources you create in AWS and the on-premises infrastructure. You can use CloudWatch to store logs from your applications and trigger actions in response to events.
Know what common metrics are available for Amazon Elastic Compute Cloud (Amazon EC2) in Amazon CloudWatch. Amazon EC2 metrics in CloudWatch include the following:
Amazon EC2 does not report OS-level metrics such as memory utilization.
Understand the difference between high-resolution and standard-resolution metrics. High-resolution metrics are delivered in a period of less than one minute. Standard-resolution metrics are delivered in a period greater than or equal to one minute.
Know what AWS CloudTrail is and why it is used. CloudTrail is used to monitor API calls made to the AWS Cloud for various services. CloudTrail helps IT administrators, IT security administrators, DevOps engineers, and auditors to enable compliance and the monitoring of access to AWS resources within an account.
Know what AWS CloudTrail tracks automatically. By default, CloudTrail tracks the last 90 days of activity. These events are limited to management events with create, modify, and delete API calls.
Understand the difference between AWS CloudTrail management and data events. Management events are operations performed on resources in your AWS account. Data events are operations performed on data stored in AWS resources. Examples are creating or deleting objects in Amazon S3 and inserting or updating items in an Amazon DynamoDB table.
Know what AWS X-Ray is and why it is used. X-Ray is a service that collects data about your application requests, including the various subservices or systems that perform tasks to complete a request. X-Ray is commonly used to help developers find bottlenecks in distributed applications and monitor the health of various components in their services.
Know the basics of AWS X-Ray and how it helps troubleshoot applications. X-Ray records requests by initiating a trace ID with the origin of the request. This trace ID is added as a header to the request that propagates to various services. If you enable the X-Ray SDK in your applications, X-Ray submits telemetry and the request as segments for each service and subsegments for downstream services upon which you depend. Using these traces, X-Ray collates the data to view request performance metrics, such as latency and error rates. The data can then be used to create a graph of your application and its dependencies and the health of any requests your application might make.
It is common to monitor the storage usage of your Amazon S3 buckets and trigger notifications when there is a large increase in storage used. In this exercise, you will use the AWS CLI to configure an Amazon CloudWatch alarm to trigger a notification when more than 1 KB of data is uploaded to an Amazon S3 bucket.
If you need directions while completing this exercise, see “Using Amazon CloudWatch Alarms” here:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html
https://docs.aws.amazon.com/AmazonS3/latest/user-guide/ create-bucket.html
The alarm is created in your account. If you already have data in your Amazon S3 bucket, it is switched from Insufficient Data to Alarm state. Otherwise, try uploading several files to your bucket to monitor changes in alarm state.
To delete the alarm, follow these steps:
In this exercise, you created an Amazon CloudWatch alarm to notify administrators when large files are uploaded to Amazon S3 buckets in your account.
For instructions on how to do so, see the following:
https://docs.aws.amazon.com/AmazonS3/latest/user-guide/ create-bucket.html
In this exercise, you enabled AWS CloudTrail to record data events and store corresponding logs to an Amazon S3 bucket.
In this exercise, you will create an Amazon CloudWatch dashboard to see graphed metric data.
In this exercise, you created an Amazon CloudWatch dashboard to create graphs of important metric data for resources in your account.
You are required to set up dynamic scaling using Amazon CloudWatch alarms. Which of the following metrics could you monitor to trigger Auto Scaling events to scale out and scale in your instances? What is the length of time that metrics are stored for a data point with a period of 300 seconds (5 minutes) in Amazon CloudWatch?
Which of the following does an AWS CloudTrail event not provide?
You must set up centralized logging for an application and create a cost-effective way to archive logs for compliance purposes.
Which of the following options allow logs and metrics to be ingested into Amazon CloudWatch? (Select THREE.)
The following are Apache HTTP access logs.
You build an application and enable AWS X-Ray tracing. You analyze the service graph and determine that the application requests to Amazon DynamoDB are not performing well and a majority of the issues are purple.
What kind of problem is your application experiencing?
Which AWS service enables you to monitor resources and gather statistics, such as CPU utilization, from a single “pane of glass” interface?
By default, what is the number of days of AWS account activity that you can view, search, and download from the AWS CloudTrail event history?
Which of the following is not able to access AWS CloudTrail data?
In AWS CloudTrail, which of the following are management events? (Select TWO.)
Suppose that you have a custom web application running on an Amazon Elastic Compute Cloud (Amazon EC2) instance.
Which of the following are not supported Amazon CloudWatch alarm actions?
Which of the following Amazon Elastic Compute Cloud (Amazon EC2) metrics is not directly available through Amazon CloudWatch metrics?
Which of the following is the correct Amazon CloudWatch metric namespace for Amazon Elastic Compute Cloud (Amazon EC2) instances?
Review Questions