In the previous chapter, you learned how Amazon SageMaker helps you build and prepare datasets. In a typical machine learning project, the next step would be to start experimenting with algorithms in order to find an early fit and get a sense of the predictive power you could expect from the model.
Whether you work with statistical machine learning or deep learning, three options are available when it comes to selecting an algorithm:
In this chapter, you will learn about Amazon SageMaker Autopilot, an AutoML capability part of Amazon SageMaker. We'll see how to use it in Amazon SageMaker Studio, without writing a single line of code, and also how to use it with the Amazon SageMaker SDK:
You will need an AWS account to run examples included in this chapter. If you haven't got one already, please point your browser at https://aws.amazon.com/getting-started/ to create it. You should also familiarize yourself with the AWS Free Tier (https://aws.amazon.com/free/), which lets you use many AWS services for free within certain usage limits.
You will need to install and configure the AWS Command-Line Interface (CLI) for your account (https://aws.amazon.com/cli/).
You will need a working Python 3.x environment. Be careful not to use Python 2.7, as it is no longer maintained. Installing the Anaconda distribution (https://www.anaconda.com/) is not mandatory, but is strongly encouraged as it includes many projects that we will need (Jupyter, pandas, numpy, and more).
Code examples included in the book are available on GitHub at https://github.com/PacktPublishing/Learn-Amazon-SageMaker. You will need to install a Git client to access them (https://git-scm.com/).
Added to Amazon SageMaker in late 2019, Amazon SageMaker Autopilot is an AutoML capability that takes care of all machine learning steps for you. You only need to upload a columnar dataset to an Amazon S3 bucket, and define the column you want the model to learn (the target attribute). Then, you simply launch an Autopilot job, with either a few clicks in the SageMaker Studio GUI, or a couple of lines of code with the SageMaker SDK.
The simplicity of SageMaker Autopilot doesn't come at the expense of transparency and control. You can see how your models are built, and you can keep experimenting to refine results. In that respect, SageMaker Autopilot should appeal to new and seasoned practitioners alike.
In this section, you'll learn about the different steps of a SageMaker Autopilot job, and how they contribute to delivering high-quality models:
Let's start by seeing how SageMaker Autopilot analyzes data.
This step is first responsible for understanding what type of machine learning problem we're trying to solve. SageMaker Autopilot currently supports linear regression, binary classification, and multi-class classification.
Note:
A frequent question is "how much data is needed to build such models?". This is a surprisingly difficult question. The answer—if there is one—depends on many factors, such as the number of features and their quality. As a basic rule of thumb, some practitioners recommend having 10-100 times more samples than features. In any case, I'd advise you to collect no fewer than hundreds of samples (for each class, if you're building a classification model). Thousands or tens of thousands are better, especially if you have more features. For statistical machine learning, there is rarely a need for millions of samples, so start with what you have, analyze the results, and iterate before going on a data collection rampage!
By analyzing the distribution of the target attribute, SageMaker Autopilot can easily figure out which one is the right one. For instance, if the target attribute has only two values (say, yes and no), it's pretty likely that you're trying to build a binary classification model.
Then, SageMaker Autopilot computes statistics on the dataset and on individual columns: the number of unique values, the mean, median, and so on. Machine learning practitioners very often do this in order to get a first feel for the data, and it's nice to see it automated. In addition, SageMaker Autopilot generates a Jupyter notebook, the data exploration notebook, to present these statistics in a user-friendly way.
Once SageMaker Autopilot has analyzed the dataset, it builds ten candidate pipelines that will be used to train candidate models. A pipeline is a combination of the following:
Next, let's see how Autopilot can be used in feature engineering.
This step is responsible for preprocessing the input dataset according to the pipelines defined during data analysis.
The ten candidate pipelines are fully documented in another auto-generated notebook, the candidate generation notebook. This notebook isn't just descriptive: you can actually run its cells, and manually reproduce the steps performed by SageMaker Autopilot. This level of transparency and control is extremely important, as it lets you understand exactly how the model was built. Thus, you're able to verify that it performs the way it should, and you're able to explain it to your stakeholders. Also, you can use the notebook as a starting point for additional optimization and tweaking if you're so inclined.
Lastly, let's take a look at model tuning in Autopilot.
This step is responsible for training and tuning models according to the pipelines defined during data analysis. For each pipeline, SageMaker Autopilot will launch an automatic model tuning job (we'll cover this topic in detail in a later chapter). In a nutshell, each tuning job will use hyperparameter optimization to train a large number of increasingly accurate models on the processed dataset. As usual, all of this happens on managed infrastructure.
Once the model tuning is complete, you can view the model information and metrics in Amazon SageMaker Studio, build visualizations, and so on. You can do the same programmatically with the Amazon SageMaker Experiments SDK.
Finally, you can deploy your model of choice just like any other SageMaker model using either the SageMaker Studio GUI or the SageMaker SDK.
Now that we understand the different steps of an Autopilot job, let's run a job in SageMaker Studio.
We will build a model using only SageMaker Studio. We won't write a line of machine learning code, so get ready for zero-code AI.
In this section, you'll learn how to do the following:
First, we need a dataset. We'll reuse the direct marketing dataset used in Chapter 2, Handling Data Preparation Techniques. This dataset describes a binary classification problem: will a customer accept a marketing offer, yes or no? It contains a little more than 41,000 labeled customer samples. Let's dive in:
%%sh
apt-get -y install unzip
wget -N https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip
unzip -o bank-additional.zip
import sagemaker
prefix = 'sagemaker/DEMO-autopilot/input'sess = sagemaker.Session()
uri = sess.upload_data(path="./bank-additional/bank-additional-full.csv", key_prefix=prefix)
print(uri)
The dataset will be available in S3 at the following location:
s3://sagemaker-us-east-2-123456789012/sagemaker/DEMO-autopilot/input/bank-additional-full.csv
We set the location of the input dataset using the path returned in step 3. As is visible in the following screenshot, we can either browse S3 buckets or enter the S3 location directly:
As shown in the following screenshot, we set the output location where job artifacts will be copied to: s3://sagemaker-us-east-2-123456789012/sagemaker/DEMO-autopilot/output/:
Once the job is launched, it goes through the three steps that we already discussed, which should take around 5 hours to complete. The new experiment is listed in the Experiments tab, and we can right-click Describe AutoML Job to describe its current status. This opens the following screen, where we can see the progress of the job:
At this point, we could simply deploy the top job, but instead, let's compare the top ten ones using the visualization tools built into SageMaker Studio.
A single SageMaker Autopilot job trains hundreds of jobs. Over time, you may end up with tens of thousands of jobs, and you may wish to compare their properties. Let's see how:
We open the Table Properties panel on the right by clicking on the icon representing a cog. In the Metrics section, we tick the ObjectiveMetric box. In the Type filter section, we only tick Training job. In the main panel, we sort jobs by descending objective metric by clicking on the arrow. We hold Shift and click the top ten jobs to select them. Then, we click on the Add chart button:
As our training jobs are very short (about a minute), there won't be enough data for Time series charts, so let's select Summary statistics instead. We're going to build a scatter plot, putting the maximum training accuracy and validation accuracy in perspective, as shown in the following screenshot. We also color data points with our trial names:
The next step is to deploy a model and start testing it.
The SageMaker Studio GUI makes it extremely easy to deploy a model. Let's see how:
ep_name = 'my-first-autopilot-endpoint'
sample = '56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,261,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0'
import boto3
sm_rt = boto3.Session().client('runtime.sagemaker')
response = sm_rt.invoke_endpoint(EndpointName=ep_name, ContentType='text/csv', Accept='text/csv', Body=sample)
response = response['Body'].read().decode("utf-8")print(response)
This sample is predicted as a no:
no
sm = boto3.Session().client('sagemaker')sm.delete_endpoint(EndpointName=ep_name)
Congratulations, you've successfully built, trained, and deployed your first machine learning model on Amazon SageMaker. That was pretty simple, wasn't it? The only code we wrote was to download the dataset and to predict with our model.
Using the SageMaker Studio GUI is a great way to quickly experiment with a new dataset, and also to let less technical users build models on their own. Now, let's see how we can use SageMaker Autopilot programmatically with the SageMaker SDK.
The Amazon SageMaker SDK includes a simple API for SageMaker Autopilot. You can find its documentation at https://sagemaker.readthedocs.io/en/stable/automl.html.
In this section, you'll learn how to use this API to train a model on the same dataset as in the previous section.
The SageMaker SDK makes it extremely easy to launch an Autopilot job – just upload your data in S3, and call a single API! Let's see how:
import sagemaker
sess = sagemaker.Session()
%%sh
wget -N https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip unzip -o bank-additional.zip
bucket = sess.default_bucket() prefix = 'sagemaker/DEMO-automl-dm'
s3_input_data = upload_data(path="./bank-additional/bank-additional-full.csv", key_prefix=prefix+'input')
from sagemaker.automl.automl import AutoML
auto_ml_job = AutoML( role = sagemaker.get_execution_role(), sagemaker_session = sess, target_attribute_name = 'y', output_path = 's3://{}/{}/output'.format(bucket,prefix), max_runtime_per_training_job_in_seconds = 600, max_candidates = 250, total_job_runtime_in_seconds = 3600 )
auto_ml_job.fit(inputs=s3_input_data, logs=False, wait=False)
The job starts right away. Now let's see how we can monitor its status.
While the job is running, we can use the describe_auto_ml_job() API to monitor its progress:
from time import sleep
job = auto_ml_job.describe_auto_ml_job()job_status = job['AutoMLJobStatus']job_sec_status = job['AutoMLJobSecondaryStatus']
if job_status not in ('Stopped', 'Failed'): while job_status in ('InProgress') and job_sec_status in ('AnalyzingData'):
sleep(30) job = auto_ml_job.describe_auto_ml_job() job_status = job['AutoMLJobStatus'] job_sec_status = job['AutoMLJobSecondaryStatus'] print (job_status, job_sec_status)
job = auto_ml_job.describe_auto_ml_job()
job_candidate_notebook = job['AutoMLJobArtifacts']['CandidateDefinitionNotebookLocation']
job_data_notebook = job['AutoMLJobArtifacts']['DataExplorationNotebookLocation']
print(job_candidate_notebook)print(job_data_notebook)
This prints out the S3 paths for the two notebooks:
s3://sagemaker-us-east-2-123456789012/sagemaker/DEMO-automl-dm/output/automl-2020-04-24-14-21-16-938/sagemaker-automl-candidates/pr-1-a99cb56acb5945d695c0e74afe8ffe3ddaebafa94f394655ac973432d1/notebooks/SageMakerAutopilotCandidateDefinitionNotebook.ipynb
s3://sagemaker-us-east-2-123456789012/sagemaker/DEMO-automl-dm/output/automl-2020-04-24-14-21-16-938/sagemaker-automl-candidates/pr-1-a99cb56acb5945d695c0e74afe8ffe3ddaebafa94f394655ac973432d1/notebooks/SageMakerAutopilotDataExplorationNotebook.ipynb
%%sh -s $job_candidate_notebook $job_data_notebook
aws s3 cp $1 .aws s3 cp $2 .
import pandas as pd
from sagemaker.analytics import ExperimentAnalytics
exp = ExperimentAnalytics( sagemaker_session=sess, experiment_name=job['AutoMLJobName'] + '-aws-auto-ml-job')
df = exp.dataframe()print("Number of jobs: ", len(df))
df = pd.concat([df['ObjectiveMetric - Max'], df.drop(['ObjectiveMetric - Max'], axis=1)], axis=1)
df.sort_values('ObjectiveMetric - Max', ascending=0)[:5]
job_best_candidate = auto_ml_job.best_candidate()
print(job_best_candidate['CandidateName'])print(job_best_candidate['FinalAutoMLJobObjectiveMetric'])
This prints out the name of the best tuning job, along with its validation accuracy:
tuning-job-1-57d7f377bfe54b40b1-030-c4f27053
{'MetricName': 'validation:accuracy', 'Value': 0.9197599935531616}
Then, we can deploy and test the model using the SageMaker SDK. We've covered a lot of ground already, so let's save that for future chapters, where we'll revisit this example.
SageMaker Autopilot creates many underlying artifacts such as dataset splits, pre-processing scripts, pre-processed datasets, models, and so on. If you'd like to clean up completely, the following code snippet will do that. Of course, you could also use the AWS CLI:
import boto3
job_outputs_prefix = '{}/output/{}'.format(prefix, job['AutoMLJobName'])
s3_bucket = boto3.resource('s3').Bucket(bucket)s3_bucket.objects.filter(Prefix=job_outputs_prefix).delete()
Now that we know how to train models using both the SageMaker Studio GUI and the SageMaker SDK, let's take a look under the hood. Engineers like to understand how things really work, right?
In this section, we're going to learn in detail how SageMaker Autopilot processes data and trains models. If this feels too advanced for now, you're welcome to skip this material. You can always revisit it later once you've gained more experience with the service.
First, let's look at the artifacts that SageMaker Autopilot produces.
Listing our S3 bucket confirms the existence of many different artifacts:
$ aws s3 ls s3://sagemaker-us-east-2-123456789012/sagemaker/DEMO-autopilot/output/my-first-autopilot-job/
We can see many new prefixes. Let's figure out what's what:
PRE data-processor-models/PRE preprocessed-data/PRE sagemaker-automl-candidates/PRE transformed-data/PRE tuning/
The preprocessed-data/tuning_data prefix contains the training and validation splits generated from the input dataset. Each split is further broken into small CSV chunks:
The following diagram summarizes the relationship between these artifacts:
In the next sections, we'll take a look at the two auto-generated notebooks, which are one of the most important features in SageMaker Autopilot.
This notebook is available in Amazon S3 once the data analysis step is complete.
The first section, seen in the following screenshot, simply displays a sample of the dataset:
Shown in the following screenshot, the second section focuses on column analysis: percentages of missing values, counts of unique values, and descriptive statistics. For instance, it appears that the pdays field has both a maximum value and a median of 999, which looks suspicious. As explained in the previous chapter, 999 is indeed a placeholder value meaning that a customer has never been contacted before:
As you can see, this notebook saves us the trouble of computing these statistics ourselves, and we can use them to quickly check that the dataset is what we expect.
Now, let's look at the second notebook. As you will see, it's extremely insightful!
This notebook contains the definition of the ten candidate pipelines, and how they're trained. This is a runnable notebook, and advanced practitioners can use it to replay the AutoML process, and keep refining their experiment. Please note that this is totally optional! It's perfectly OK to deploy the top model directly and start testing it.
Having said that, let's run one of the pipelines manually:
from sagemaker_automl import AutoMLInteractiveRunner, AutoMLLocalCandidate
automl_interactive_runner = AutoMLInteractiveRunner(AUTOML_LOCAL_RUN_CONFIG)
automl_interactive_runner.select_candidate( {"data_transformer": { "name": "dpp0", … })
If you're curious about the implementation of RobustImputer or ThresholdOneHotEncoder, hyperlinks take you to the appropriate source file in the sagemaker_sklearn_extension module (https://github.com/aws/sagemaker-scikit-learn-extension/).
This way, you can understand exactly how data has been processed. As these objects are based on scikit-learn objects, they should quickly look very familiar. For instance, we can see that RobustImputer is built on top of sklearn.impute.SimpleImputer, with added functionality. Likewise, ThresholdOneHotEncoder is an extension of sklearn.preprocessing.OneHotEncoder.
automl_interactive_runner.display_candidates()
This is visible in the result here:
This also tells us that data will be processed by the dpp0.py script, and that the model will be trained using the XGBoost algorithm.
numeric_processors = Pipeline( steps=[('robustimputer', RobustImputer(strategy='constant', fill_values=nan))])
categorical_processors = Pipeline( steps=[('thresholdonehotencoder', ThresholdOneHotEncoder(threshold=301))])
column_transformer = ColumnTransformer( transformers=[ ('numeric_processing', numeric_processors, numeric), ('categorical_processing', categorical_processors, categorical)])
return Pipeline(steps=[ ('column_transformer', column_transformer), ('robuststandardscaler', RobustStandardScaler())])
automl_interactive_runner.fit_data_transformers(parallel_jobs=7)
$ aws s3 ls s3://sagemaker-us-east-2-123456789012/sagemaker/DEMO-autopilot/output/my-first-autopilot-job/my-first-a-notebook-run-24-13-17-22/
The first job trains the dpp0 transformers on the input dataset.
The second job processes the input dataset with the resulting model. For the record, this job uses the SageMaker Batch Transform feature, which will be covered in a later chapter.
Going back to SageMaker Studio, let's find out more about these two jobs. Starting from the Experiments tab on the left (the flask icon, remember?), we select Unassigned trial components, and we see our two jobs there: my-first-a-notebook-run-24-13-17-22-dpp0-train-24-13-38-38-aws-training-job and my-first-a-notebook-run-24-13-17-22-dpp0-transform-24-13-38-38-aws-transform-job.
Once data processing is complete, the notebook proceeds with automatic model tuning and model deployment. We haven't yet discussed these topics, so let's stop there for now. I encourage you to go through the rest of the notebook once you're comfortable with them.
As you can see, Amazon SageMaker Autopilot makes it easy to build, train, and optimize machine learning models for beginners and advanced users alike.
In this chapter, you learned about the different steps of an Autopilot job, and what they mean from a machine learning perspective. You also learned how to use both the SageMaker Studio GUI and the SageMaker SDK to build a classification model with minimal coding. Then, we dived deep on the auto-generated notebooks, which give you full control and transparency over the modeling processing. In particular, you learned how to run the Candidate Generation notebook manually, in order to replay all steps involved.
In the next chapter, you will learn how to use the built-in algorithms in Amazon SageMaker to train models for a variety of machine learning problems.