After you've successfully trained a model, you want to make the model available for inference, don't you? ML models are often the product of a business that is ML-driven. Your customers consume the ML prediction from your model, not your training jobs or processed data. How do you provide a satisfying customer experience, starting with a good experience with your ML models?
SageMaker has several options for ML hosting and inferencing, depending on your use case. Options are welcomed in many aspects of life, but it can be difficult to find the best option. This chapter will help you understand how to host models for batch inference and for online real-time inference, how to use multi-model endpoints to save costs, and how to conduct resource optimization for your inference needs.
In this chapter, we will be covering the following topics:
For this chapter, you need to access the code at https://github.com/PacktPublishing/Getting-Started-with-Amazon-SageMaker-Studio/tree/main/chapter07. If you did not run the notebooks in the previous chapter, please run the chapter05/02-tensorflow_sentiment_analysis.ipynb file from the repository before proceeding.
ML models can primarily be consumed in the cloud in two ways, batch inference and live inference. Batch inference refers to model inference performed on data that is in batches, often large batches, and asynchronous in nature. It fits use cases that collect data infrequently, that focus on group statistics rather than individual inference, and that do not need to have inference results right away for downstream processes. Projects that are research oriented, for example, do not require model inference to be returned for a data point right away. Researchers often collect a chunk of data for testing and evaluation purposes and care about overall statistics and performance rather than individual predictions. They can conduct the inference in batches and wait for the prediction for the whole batch to complete before they move on.
Live inference, on the other hand, refers to model inference performed in real time. It is expected that the inference result for an incoming data point is returned immediately so that it can be used for subsequent decision-making processes. For example, an interactive chatbot would require a live inference capability to support such a service. No one would want to wait until the end of the conversation to get responses from the chatbot model, nor would people want to wait for more than even a couple of seconds. Companies looking to provide the best customer experience would want an inference to be made and returned to the customer instantly.
Given the different requirements, the architecture and deployment choices also differ between batch inference and live inference. Amazon SageMaker has it covered as it provides various fully managed options for your inference use cases. SageMaker batch transform is designed to perform batch inference at scale and is cost-effective as the compute infrastructure is fully managed and is de-provisioned when your inference job is complete. SageMaker real-time endpoints aim to provide a robust live hosting option for your ML use cases. Both the SageMaker hosting options are fully managed, meaning you do not have to worry much about the cloud infrastructure.
Let's first take a look at SageMaker batch transform, how it works, and when to use it.
SageMaker batch transform is designed to provide offline inference for large datasets. Depending on how you organize the data, SageMaker batch transform can split a single large text file in S3 by lines into a small and manageable size (mini-batch) that would fit into the memory before making inference against the model; it can also distribute the files by S3 key into compute instances for efficient computation. For example, it could send test1.csv to instance 1 and test2.csv to instance 2.
To demonstrate SageMaker batch transform, we can pick up from our training example in the previous chapter. In Chapter 6, Detecting ML Bias and Explaining Models with SageMaker Clarify, we showed you how to train a TensorFlow model using SageMaker managed training for a movie review sentiment prediction use case in Getting-Started-with-Amazon-SageMaker-Studio/chapter05/02-tensorflow_sentiment_analysis.ipynb. We can deploy the trained model to make a batch inference using SageMaker batch transform in the following steps:
Note
SageMaker batch transform expects the input CSV files to not contain headers. That is, the first row of the CSV should be the first data point.
You should replace training_job_name in the following code with your own:
from sagemaker.tensorflow import TensorFlow
training_job_name='<your-training-job-name>'
estimator = TensorFlow.attach(training_job_name)
Once you have replaced training_job_name and attached it to reload estimator, you should see the history of the job printed in the output.
transformer = estimator.transformer(instance_count=1,
instance_type='ml.c5.xlarge',
max_payload = 2, # MB
accept = 'application/jsonlines',
output_path = s3_output_location,
assemble_with = 'Line')
transformer.transform(test_data_s3,
content_type='text/csv',
split_type = 'Line',
job_name = jobname,
experiment_config = experiment_config)
The estimator.transformer() method creates a Transformer object with the compute resource desired for the inference. Here we request one ml.c5.xlarge instance for predicting 25,000 movie reviews. The max_payload argument allows us to control the size of each mini-batch that SageMaker Batch Transform is splitting. The accept argument determines the output type. SageMaker managed Tensorflow serving container supports 'application/json', and 'application/jsonlines'. assemble_with controls how you assemble the inference results that are in mini-batches. Then we provide the S3 location of the test data (test_data_s3) in the transformer.transform(), and indicate that the input content type to be of 'text/csv' as the file is of CSV format. split_type determines how the input files will be split by SageMaker Batch Transform into mini-batch. We put in a unique job name and SageMaker Experiments configuration so that we can track the inference to the associated training job in the same trial. The Batch Transform job would take around 5 minutes to complete. Like a training job, SageMaker manages the provisioning, computation, and de-provisioning of the instances once the job finishes.
output = transformer.output_path
output_prefix = 'imdb_data/test_output'
!mkdir -p {output_prefix}
!aws s3 cp --recursive {output} {output_prefix}
!head {output_prefix}/{csv_test_filename}.out
{ "predictions": [[0.00371244829], [1.0], [1.0], [0.400452465], [1.0], [1.0], [0.163813606], [0.10115058], [0.793149233], [1.0], [1.0], [6.37737814e-14], [2.10463966e-08], [0.400452465], [1.0], [0.0], [1.0], [0.400452465], [2.65155926e-29], [4.04420768e-11], ……]}
We then collect all 25,000 predictions into a results variable:
results=[]
with open(f'{output_prefix}/{csv_test_filename}.out', 'r') as f:
lines = f.readlines()
for line in lines:
print(line)
json_output = json.loads(line)
result = [float('%.3f'%(item)) for sublist in json_output['predictions']
for item in sublist]
results += result
print(results)
That's how easy it is to make use of SageMaker batch transform to generate inferences on a large dataset. You may wonder, why can't I just use the notebook to make inferences? What's the benefit of using SageMaker batch transform? Yes, you can use the notebook for quick analysis. The advantages of SageMaker batch transform are as follows:
Next, let's see how we can host ML models in the cloud for real-time use cases.
SageMaker real-time inference is a fully managed feature for hosting your model(s) on compute instance(s) for real-time low-latency inference. The deployment process consists of the following steps:
Hosting a real-time endpoint faces one particular challenge that is common when hosting a website or a web application: it can be difficult to scale your compute instances when you have a spike in traffic to your endpoint. You may have 1,000 customers visiting your website per minute in a particular hour and then have 100,000 customers in the next hour. If you only deploy one instance behind your endpoint that is capable of handling 5,000 requests per minute, it would work well in the first hour and would struggle in the next. Autoscaling is a technique in the cloud to help you scale out instances automatically when certain criteria are met so that your application can handle the load at any time.
Let's walk through a SageMaker real-time endpoint example. Like the batch transform example, we continue the ML use case in Chapter 5, Building and Training ML Models with SageMaker Studio IDE and 05/02-tensorflow_sentiment_analysis. ipynb. Please open the notebook in Getting-Started-with-Amazon-SageMaker-Studio/chapter07/02-tensorflow_sentiment_analysis_inference.ipynb and use the Python 3 (TensorFlow 2.3 Python 3.7 CPU Optimized) kernel. We will deploy a trained model to SageMaker as a real-time endpoint, make some predictions as an example, and finally apply an autoscaling policy to help scale the compute instances behind the endpoint. Please follow these steps:
predictor = estimator.deploy(
instance_type='ml.c5.xlarge',
initial_instance_count=1)
Here, we choose ml.c5.xlarge for instance_type argument. initial_instance_ count argument refers to the number of ML instances behind the endpoint when we make this call. Later, we will show you how to use the autoscaling feature, which is designed to help us scale out the instance fleet when the initial settings become insufficient. The deployment process takes about 5 minutes.
prediction=predictor.predict(x_test[data_index])
print(prediction)
{'predictions': [[1.80986511e-11]]}
The next two cells retrieve the review in text and print out the ground truth sentiment and the predicted sentiment with a threshold of 0.5, just like in the batch transform example.
predictor.predict(x_test)
This line will run for a couple of seconds and eventually fail. This is because a SageMaker endpoint is designed to take on requests that are 6 MB in size one at a time. You can request inferences for multiple data points, for example, x_test[:100], but not 25,000 all in one call. In contrast, batch transform does the data splitting (mini-batching) automatically and is better suited to handle large datasets.
sagemaker_client = sess.boto_session.client('sagemaker')
autoscaling_client = sess.boto_session.client('application-autoscaling')
resource_id=f'endpoint/{endpoint_name}/variant/AllTraffic'
response = autoscaling_client.register_scalable_target(
ServiceNamespace='sagemaker',
ResourceId=resource_id,
ScalableDimension='sagemaker:variant:DesiredInstanceCount',
MinCapacity=1,
MaxCapacity=4)
Our target, the SageMaker real-time endpoint, is denoted with resource_id. We set the minimum capacity to 1 and the maximum to 4, meaning that when the load is at the lowest, there will be at least one instance running behind the endpoint. Our endpoint is capable of scaling out to four instances at the most.
response = autoscaling_client.put_scaling_policy(
PolicyName='Invocations-ScalingPolicy',
ServiceNamespace='sagemaker',
ResourceId=resource_id,
ScalableDimension='sagemaker:variant:DesiredInstanceCount',
PolicyType='TargetTrackingScaling',
TargetTrackingScalingPolicyConfiguration={
'TargetValue': 4000.0,
'PredefinedMetricSpecification': {
'PredefinedMetricType':
'SageMakerVariantInvocationsPerInstance'},
'ScaleInCooldown': 600,
'ScaleOutCooldown': 300})
In this example, we employ a scaling strategy called target tracking scaling. Target tracking scaling aims to scale in and out the instances based on a specific target metric, such as instance CPU load, or the number of inference requests per instance per minute. We use the latter (SageMakerVariantInvocationsPerInstance) in this configuration to make sure each instance can share 4,000 requests per minute before scaling out another instance. ScaleInCooldown and ScaleOutCooldown refer to the period of time in seconds after the last scaling activity before autoscaling can scale in and out again. With our configuration, SageMaker will not scale in (remove an instance) within 600 seconds of the last scale-in activity, and will not scale out (add an instance) within 300 seconds of the last scale-out activity.
Note
There are two commonly used advanced scaling strategies for PolicyType: step scaling and scheduled scaling. In step scaling, you can define the number of instances to scale in/out based on the size of the alarm breaches of a certain metric. Read more about step scaling at https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-scaling-simple-step.html. In scheduled scaling, you can set up the scaling based on the schedule. This is particularly useful if the traffic is predictable or has some seasonality. Read more about scheduled scaling at https://docs.aws.amazon.com/autoscaling/ec2/userguide/schedule_time.html.
response = autoscaling_client.describe_scaling_policies(
ServiceNamespace='sagemaker')
for i in response['ScalingPolicies']:
print('')
print(i['PolicyName'])
print('')
if('TargetTrackingScalingPolicyConfiguration' in i):
print(i['TargetTrackingS calingPolicyConfiguration'])
else:
print(i['StepScalingPolicyConfiguration'])
print('')
Invocations-ScalingPolicy
{'TargetValue': 4000.0, 'PredefinedMetricSpecification': {'PredefinedMetricType': 'SageMakerVariantInvocationsPerInstance'}, 'ScaleOutCooldown': 300, 'ScaleInCooldown': 600}
The purpose of hosting an endpoint is to serve the ML models in the cloud so that you can integrate ML as a microservice into your applications or websites. Your model has to be available at all times as long as your main product or service is available. You can imagine that there is a great opportunity and incentive for you to optimize the deployment to minimize the cost while maintaining performance. We just learned how to deploy an ML model in the cloud; we should also learn how to optimize the deployment.
Optimizing model deployment is a critical topic for businesses. No one wants to be spending a dime more than they need to. Because deployed endpoints are being used continuously, and incurring charges continuously, making sure that the deployment is optimized in terms of cost and runtime performance can save you a lot of money. SageMaker has several options to help you reduce costs while optimizing the runtime performance. In this section, we will be discussing multi-model endpoint deployment and how to choose the instance type and autoscaling policy for your use case.
A multi-model endpoint is a type of real-time endpoint in SageMaker that allows multiple models to be deployed behind the same endpoint. There are many use cases in which you would build models for each customer or for each geographic area, and depending on the characteristics of the incoming data point, you would apply the corresponding ML model. Take the telecommunications churn prediction use case that we tackled in Chapter 3, Data Preparation with SageMaker Data Wrangler, as an example. We may get more accurate ML models if we train them by state because there may be regional differences in terms of competition among local telecommunication providers. And if we do train ML models for each US state, you can also easily imagine that the utilization of each model might not be completely equal. Actually, quite the contrary.
Model utilization is inevitably proportional to the population of each state. Your New York model is going to be used more frequently than your Alaska model. In this scenario, if you host an endpoint for each state, you will have to pay for instances, even for the least utilized endpoint. With multi-model endpoints, SageMaker helps you reduce costs by reducing the number of endpoints needed for your use case. Let's take a look at how it works with the telecommunications churn prediction use case. Please open the Getting-Started-with-Amazon-SageMaker-Studio/chapter07/03-multimodel-endpoint.ipynb notebook with the Python 3 (Data Science) kernel and follow the next steps:
df[["Int'l Plan", "VMail Plan"]] = df[["Int'l Plan", "VMail Plan"]].replace(to_replace=['yes', 'no'], value=[1, 0])
df['Churn?'] = df['Churn?'].replace(to_replace=['True.', 'False.'], value=[1, 0])
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df_processed,
test_size=0.1, random_state=42, shuffle=True,
stratify=df_processed['State'])
def launch_training_job(state, train_data_s3, val_data_s3):
...
xgb = sagemaker.estimator.Estimator(image, role,
instance_count=train_instance_count,
instance_type=train_instance_type,
output_path=s3_output,
enable_sagemaker_metrics=True,
sagemaker_session=sess)
xgb.set_hyperparameters(
objective='binary:logistic',
num_round=20)
...
xgb.fit(inputs=data_channels,
job_name=jobname,
experiment_config=experiment_config,
wait=False)
return xgb
dict_estimator = {}
for state in df_processed.State.unique()[:5]:
print(state)
output_dir = f's3://{bucket}/{prefix}/{local_prefix}/by_state'
df_state = df_train[df_train['State']==state].drop(labels='State', axis=1)
df_state_train, df_state_val = train_test_split(df_state, test_size=0.1, random_state=42,
shuffle=True, stratify=df_state['Churn?'])
df_state_train.to_csv(f'{local_prefix}/churn_{state}_train.csv', index=False)
df_state_val.to_csv(f'{local_prefix}/churn_{state}_val.csv', index=False)
sagemaker.s3.S3Uploader.upload(f'{local_prefix}/churn_{state}_train.csv', output_dir)
sagemaker.s3.S3Uploader.upload(f'{local_prefix}/churn_{state}_val.csv', output_dir)
dict_estimator[state] = launch_training_job(state, out_train_csv_s3, out_val_csv_s3)
time.sleep(2)
Each training job should take no more than 5 minutes. We will wait for all of them to complete before proceeding to use the wait_for_training_job_to_complete() function.
model_PA = dict_estimator['PA'].create_model(
role=role, image_uri=image)
mme = MultiDataModel(name=model_name,
model_data_prefix=model_data_prefix,
model=model_PA,
sagemaker_session=sess)
MultiDataModel initialization needs to understand the common model configuration, such as the container image and the network configurations, to configure the endpoint configuration. We pass in the model for PA. Afterward, we deploy the model to one ml.c5.xlarge instance and configure the serializer and deserializer to take CSV as input and produce JSON as output, respectively:
predictor = mme.deploy(
initial_instance_count=hosting_instance_count,
instance_type=hosting_instance_type,
endpoint_name=endpoint_name,
serializer = CSVSerializer(),
deserializer = JSONDeserializer())
for state, est in dict_estimator.items():
artifact_path = est.latest_training_job.describe()['ModelArtifacts']['S3ModelArtifacts']
model_name = f'{state}.tar.gz'
mme.add_model(model_data_source=artifact_path,
model_data_path=model_name)
That's it. We can verify that there are five models associated with this endpoint:
list(mme.list_models())
['MO.tar.gz', 'PA.tar.gz', 'SC.tar.gz', 'VA.tar.gz', 'WY.tar.gz']
state='PA'
test_data=sample_test_data(state)
prediction = predictor.predict(data=test_data[0],
target_model=f'{state}.tar.gz')
In this cell and onwards, we also set up a timer to measure the time it takes models for other states to respond in order to illustrate the nature of dynamic loading of the model from S3 to the endpoint. When the endpoint is first created, there is no model located behind the endpoint. With add_model(), it merely upload the models to an S3 location, model_data_prefix. When a model is first requested, SageMaker dynamically downloads the requested model from S3 to the ML instance and loads it into the inference container. This process has a longer response time when we first run the prediction for each of the state models, up to 1,000 milliseconds. But once the model is loaded into the memory in the container behind the endpoint, the response time is greatly reduced, to around 20 milliseconds. When a model is loaded, it is persisted in the container until the memory of the instance is exhausted by having too many models loaded at once. Then SageMaker unloads models that are not being used anymore from memory while still keeping model.tar.gz on disk in the instance for the next request to avoid downloading it from S3.
In this example, we showed how to host a SageMaker multi-model endpoint that is flexible and cost-effective because it drastically reduces the number of endpoints needed for your use case. So, instead of hosting and paying for five endpoints, we would only host and pay for one endpoint. That's an easy 80% cost saving. With hosting models trained for 50 US states in 1 endpoint instead of 50, that's a 98% cost saving!
With SageMaker multi-model endpoints, you can host as many models as you can in an S3 bucket location. The number of simultaneous models you can load in an endpoint depends on the memory footprint of your models and the amount of RAM on the compute instance. Multi-model endpoints are suitable for use cases where you have models that are built in the same framework (XGBoost in this example), and where it is tolerable to have latency on less frequently used models.
Note
If you have models built from different ML frameworks, for example, a mix of TensorFlow, PyTorch, and XGBoost models, you can use a multi-container endpoint, which allows hosting up to 15 distinct framework containers. Another benefit of multi-container endpoints is that they do not have latency penalties as all containers are running at the same time. Find out more at https://docs.aws.amazon.com/sagemaker/latest/dg/multi-container-endpoints.html.
The other optimization approach is using a technique called load testing to help us choose the instance and autoscaling policy.
Load testing is a technique that allows us to understand how our ML model hosted in an endpoint with a compute resource configuration responds to online traffic. There are factors such as model size, ML framework, number of CPUs, amount of RAM, autoscaling policy, and traffic size that affect how your ML model performs in the cloud. Understandably, it's not easy to predict how many requests can come to an endpoint over time. It is prudent to understand how your model and endpoint behave in this complex situation. Load testing creates artificial traffic and requests to your endpoint and stress tests how your model and endpoint respond in terms of model latency, instance CPU utilization, memory footprint, and so on.
In this section, let's run some load testing against the endpoint we created in chapter07/02-tensorflow_sentiment_analysis_inference.ipynb with some scenarios. In the example, we hosted a TensorFlow-based model to an ml.c5.xlarge instance, which has 4 vCPUs and 8 GiB of memory.
First of all, we need to understand the model's latency and capacity as a function of the type of instance and the number of instances before an endpoint becomes unavailable. Then we vary the instance configuration and autoscaling configuration until the desired latency and traffic capacity has been reached.
Please open the Getting-Started-with-Amazon-SageMaker-Studio/chapter07/04-load_testing.ipynb notebook with the Python 3 (Data Science) kernel and an ml.t3.xlarge instance and follow these steps:
sagemaker_client = sess.boto_session.client('sagemaker')
autoscaling_client = sess.boto_session.client('application-autoscaling')
endpoint_name = '<endpoint-with-ml.c5-xlarge-instance>'
resource_id = f'endpoint/{endpoint_name}/variant/AllTraffic'
response = autoscaling_client.register_scalable_target(
ServiceNamespace='sagemaker',
ResourceId=resource_id,
ScalableDimension='sagemaker:variant: DesiredInstanceCount',
MinCapacity=1,
MaxCapacity=1)
%%sh --bg
export ENDPOINT_NAME='<endpoint-with-ml.c5-xlarge-instance>'
bind_port=5557
locust -f load_testing/locustfile.py --worker --loglevel ERROR --autostart --autoquit 10 --master-port ${bind_port} &
locust -f load_testing/locustfile.py --worker --loglevel ERROR --autostart --autoquit 10 --master-port ${bind_port} &
locust -f load_testing/locustfile.py --headless -u 500 -r 10 -t 60s
--print-stats --only-summary --loglevel ERROR
--autostart --autoquit 10 --master --expect-workers 2 --master-bind-port ${bind_port}
As it is running, let's navigate to the Amazon CloudWatch console to see what's happening from the endpoint's perspective. Please copy the following URL and replace <endpoint-with-ml.c5-xlarge-instance> with your endpoint name and replace the region if you use a region other than us-west-2: https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#metricsV2:graph=~(metrics~(~(~'AWS*2fSageMaker~'InvocationsPerInstance~'EndpointName~'<endpoint-with-ml.c5-xlarge-instance>~'VariantName~'AllTraffic)~(~'.~'ModelLatency~'.~'.~'.~'.~(stat~'Average))~(~'.~'Invocations~'.~'.~'.~'.)~(~'.~'OverheadLatency~'.~'.~'.~'.~(stat~'Average))~(~'.~'Invoca tion5XXErrors~'.~'.~'.~'.)~(~'.~'Invocation4XXErrors~'.~'.~'.~'.))~view~'timeSeries~stacked~false~region~'us-west-2~stat~'Sum~period~60~start~'-PT3H~end~'P0D );query=~'*7bAWS*2fSageMaker*2cEndpointName*2cVariantName*7d*20<endpoint-with-ml.c5-xlarge-instance>
You can see a dashboard in Figure 7.4. The dashboard has captured the most important metrics regarding our SageMaker endpoint's health and status. Invocations and InvocationsPerInstance show the total number of invocations and per-instance counts. Invocation5XXErrors and Invocation4XXErrors are error counts with HTTP codes 5XX and 4XX respectively. ModelLatency (in microseconds) is the time taken by a model inside the container behind a SageMaker endpoint to return a response. OverheadLatency (in microseconds) is the time taken for our SageMaker endpoint to transmit a request and a response. Total latency for a request is ModelLatency plus OverheadLatency. These metrics are emitted by our SageMaker endpoint to Amazon CloudWatch.
In the first load test (Figure 7.4), we can see that there are around 8,221 invocations per minute, 0 errors, with an average ModelLatency of 53,825 microseconds, or 53.8 milliseconds.
With these numbers in mind as a baseline, let's scale up the instance, that is, let's use a larger instance.
from sagemaker.tensorflow import TensorFlow
training_job_name='<your-training-job-name>'
estimator = TensorFlow.attach(training_job_name)
predictor_c5_2xl = estimator.deploy(
initial_instance_count=1,
instance_type='ml.c5.2xlarge')
The deployment process takes a couple of minutes. Then we retrieve the endpoint name with the next cell, predictor_c5_2xl.endpoint_name.
export ENDPOINT_NAME='<endpoint-with-ml.c5-2xlarge-instance>'
Similarly, the traffic that locust was able to generate is around 8,000 invocations per minute (7,783 in Figure 7.5). ModelLatency clocks at 45,871 microseconds (45.8 milliseconds), which is 15% faster than the result from one ml.c5.xlarge instance.
predictor_g4dn_xl = estimator.deploy(
initial_instance_count=1,
instance_type='ml.g4dn.xlarge')
endpoint_name = '<endpoint-with-ml.c5-xlarge-instance>'
resource_id=f'endpoint/{endpoint_name}/variant/AllTraffic'
response = autoscaling_client.register_scalable_target(
ServiceNamespace='sagemaker',
ResourceId=resource_id,
ScalableDimension='sagemaker:variant:DesiredInstanceCount',
MinCapacity=1,
MaxCapacity=4)
You can confirm the scaling policy attached with the next cell in the notebook.
We can see the load test result on the Amazon CloudWatch dashboard, as shown in Figure 7.7. We can see an interesting pattern in the chart. We can clearly see something happened between 18:48 and 18:49. The ModelLatency dropped significantly from around 50,000 microseconds (50 milliseconds) to around 33,839 microseconds (33.8 milliseconds). And the InvocationsPerInstance was cut to half the number of Invocations. We are seeing the effect of SageMaker's autoscaling. Instead of one single instance taking all 8,000 invocations, SageMaker determines that two instances are more appropriate to achieve a target of SageMakerVariantInvocationsPerInstance=4000 and splits the traffic into two instances. A lower ModelLatency is the preferred outcome of having multiple instances to share the load.
After the four load testing experiments, we can conclude that at a load of around 6,000 to 8,000 invocations per minute, the following takes place:
If we consider another dimension, the cost of the instance(s), we can come to an even more interesting situation, as shown in Figure 7.8. In the table, we create a simple compound metric to measure the cost-performance efficiency of a configuration by multiplying ModelLatency by the price per hour of the instance configuration.
If we are constrained by cost, we should consider using the last configuration (row d), where the monthly cost is the lowest yet with the second-best cost-performance efficiency while sacrificing some model latency. If we need a model latency of around 40 milliseconds or lower, by paying the same monthly cost, we would get even more bang for our buck and lower latency with the third configuration (row c) than the second configuration (row b). The first configuration (row a) gives the best model latency and the best cost-performance efficiency. But it is also the most expensive option. Unless there is a strict single-digit model latency requirement, we might not want to use this option.
To reduce cost, when you complete the examples, make sure to uncomment and run the last cells in 02-tensorflow_sentiment_analysis_inference.ipynb, 03-multimodel-endpoint.ipynb, and 04-load_testing.ipynb to delete the endpoints in order to stop incurring charges to your AWS account.
This discussion is based on the example we used, which assumes many factors, such as model framework, traffic pattern, and instance types. You should follow the best practices we introduced for your use case and test out more instance types and autoscaling policies to find the optimal solution for your use case. You can find the full list of instances, specifications, and prices per hour in the real-time inference tab at https://aws.amazon.com/sagemaker/pricing/ to come up with your own cost-performance efficiency analysis.
There are other optimization features in SageMaker that help you reduce latency, such as Amazon Elastic Inference, SageMaker Neo, and Amazon EC2 Inf1 instances. Elastic Inference (https://docs.aws.amazon.com/sagemaker/latest/dg/ei-endpoints.html) attaches fractional GPUs to a SageMaker hosted endpoint. It increases the inference throughput and decreases the model latency for your deep learning models that can benefit from GPU acceleration. SageMaker Neo (https://docs.aws.amazon.com/sagemaker/latest/dg/neo.html) optimizes an ML model for inference in the cloud and supported devices at the edge with no loss in accuracy. SageMaker Neo speeds up prediction and reduces cost with a compiled model and optimized container in SageMaker hosted endpoint. Amazon EC2 Inf1 instances (https://aws.amazon.com/ec2/instance-types/inf1/) provide high performance and low cost in the cloud with AWS Inferentia chips designed and built by AWS for ML inference purposes. You can compile supported ML models using SageMaker Neo and select Inf1 instances to deploy the compiled model in a SageMaker hosted endpoint.
In this chapter, we learned how to efficiently make ML inferences in the cloud using Amazon SageMaker. We followed up with what we trained in the previous chapter—an IMDb movie review sentiment prediction—to demonstrate SageMaker's batch transform and real-time hosting. More importantly, we learned how to optimize for cost and model latency with load testing. We also learned about another great cost-saving opportunity by hosting multiple ML models in one single endpoint using SageMaker multi-model endpoints. Once you have selected the best inference option and instance types for your use case, SageMaker makes deploying your models straightforward. With these step-by-step instructions and this discussion, you will be able to translate what you've learned to your own ML use cases.
In the next chapter, we will take a different route to learn how we can use SageMaker's JumpStart and Autopilot to quick-start your ML journey. SageMaker JumpStart offers solutions to help you see how best practices and ML use cases are tackled. JumpStart model zoos collect numerous pre-trained deep learning models for natural language processing and computer vision use cases. SageMaker Autopilot is an autoML feature that crunches data and trains a performant model without you worrying about data, coding, or modeling. After we have learned the fundamentals of SageMaker—fully managed model training and model hosting—we can better understand how SageMaker JumpStart and Autopilot work.