In the previous chapter, you learned how Amazon SageMaker Autopilot makes it easy to build, train, and optimize models automatically, without writing a line of machine learning code.
For problem types that are not supported by SageMaker Autopilot, the next best option is to use one of the algorithms already implemented in SageMaker, and to train it on your dataset. These algorithms are referred to as built-in algorithms, and they cover many typical machine learning problems, from classification to time series to anomaly detection.
In this chapter, you will learn about built-in algorithms for supervised and unsupervised learning, what type of problems you can solve with them, and how to use them with the SageMaker SDK:
You will need an AWS account to run the examples included in this chapter. If you don't already have one, please point your browser to https://aws.amazon.com/getting-started/ to create it. You should also familiarize yourself with the AWS Free Tier (https://aws.amazon.com/free/), which lets you use many AWS services for free within certain usage limits.
You will need to install and to configure the AWS command-line interface for your account (https://aws.amazon.com/cli/).
You will need a working Python 3.x environment. Be careful not to use Python 2.7, as it is no longer maintained. Installing the Anaconda distribution (https://www.anaconda.com/) is not mandatory, but strongly encouraged as it includes many projects that we will need (Jupyter, pandas, numpy, and more).
Code examples included in the book are available on GitHub at https://github.com/PacktPublishing/Learn-Amazon-SageMaker. You will need to install a Git client to access them (https://git-scm.com/).
Built-in algorithms are machine learning algorithms implemented, and in some cases invented, by Amazon (https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html). They let you quickly train and deploy your own models without writing a line of machine learning code. Indeed, since the training and prediction algorithm is readily available, you don't have to worry about implementing it, and you can focus on the machine learning problem at hand. As usual, with SageMaker, infrastructure is fully managed, saving you even more time.
In this section, you'll learn about the built-in algorithms for traditional machine learning problems. Algorithms for computer vision and natural language processing will be covered in the next two chapters.
Supervised learning focuses on problems that require a labeled dataset, such as regression, or classification:
Unsupervised learning doesn't require a labeled dataset, and includes problems such as clustering or anomaly detection:
We'll cover some of these algorithms in detail in the rest of this chapter.
Before we dive into training and deploying models with the algorithms, you may wonder why you should use them instead of their counterparts in well-known libraries such as scikit-learn and R.
First, these algorithms have been implemented and tuned by Amazon teams, who are not exactly newcomers to machine learning! A lot of effort has been put into making sure that these algorithms run as fast as possible on AWS infrastructure, no matter what type of instance you use. In addition, many of these algorithms support distributed training out of the box, letting you split model training across a cluster of fully managed instances.
Thanks to this, benchmarks indicate that these algorithms are generally 10x better than competing implementations. In many cases, they are also much more cost effective. You can learn more about this at the following links:
Of course, these algorithms benefit from all the features present in SageMaker, as you will find out by the end of the book.
Amazon SageMaker lets you train and deploy models in many different configurations. Although it encourages best practices, it is a modular service that lets you do things your own way.
In this section, we first look at a typical end-to-end workflow, where we use SageMaker from data upload all the way to model deployment. Then, we discuss alternative workflows, and how you can cherry pick the features that you need. Finally, we will take a look under the hood, and see what happens from an infrastructure perspective when we train and deploy.
Let's look at a typical SageMaker workflow. You'll see it again and again in our examples, as well as in the AWS notebooks available on GitHub (https://github.com/awslabs/amazon-sagemaker-examples/):
Understanding this workflow is critical in being productive with Amazon SageMaker. Fortunately, the SageMaker SDK has simple APIs that closely match these steps, so you shouldn't be confused about which one to use, and when to use it.
Before we start looking at the SDK, let's consider alternative workflows that could make sense in your business and technical environments.
Amazon SageMaker is a modular service that lets you work your way. Let's first consider a workflow where you would train on SageMaker and deploy on your own server, whatever the reasons may be.
Steps 1-3 would be the same as in the previous example, and then you would do the following:
a) For XGBoost models: Use one of the implementations available at https://xgboost.ai/.
b) For BlazingText models: Use the fastText implementation available at https://fasttext.cc/.
c) For all other models: Use Apache MXNet (https://mxnet.apache.org/).
Now, let's see how you could import an existing model and deploy it on SageMaker:
This is just a quick look. We'll run full examples for both workflows in Chapter 11, Managing Models in Production.
All SageMaker jobs run on managed infrastructure. Let's take a look under the hood, and see what happens when we train and deploy models.
All SageMaker algorithms must be packaged in Docker containers. Don't worry, you don't need to know much about Docker in order to use SageMaker. If you're not familiar with it, I would recommend going through this tutorial to understand key concepts and tools: https://docs.docker.com/get-started/. It's always good to know a little more than actually required!
As you would expect, built-in algorithms are pre-packaged, and containers are readily available for training and deployment. They are hosted in Amazon Elastic Container Registry (ECR), AWS' Docker registry service (https://aws.amazon.com/ecr/). As ECR is a region-based service, you will find a collection of containers in each region where SageMaker is available.
You can find the list of built-in algorithm containers at https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html. For instance, the name of the container for the Linear Learner algorithm in the eu-west-1 region is 438346466558.dkr.ecr.eu-west-1.amazonaws.com/linear-learner:latest. These containers can only be pulled to SageMaker managed instances, so you won't be able to run them on your local machine.
Now let's look at the underlying infrastructure.
When you launch a training job, SageMaker fires up infrastructure according to your requirements (instance type and instance count).
Once a training instance is in service, it pulls the appropriate training container from ECR. Hyperparameters are applied to the algorithm, which also receives the location of your dataset. By default, the algorithm then copies the full dataset from S3, and starts training. If distributed training is configured, SageMaker automatically distributes dataset batches to the different instances in the cluster.
Once training is complete, the model is packaged in a model artifact saved in S3. Then, the training infrastructure is shut down automatically. Logs are available in Amazon CloudWatch Logs. Last but not least, you're only charged for the exact amount of training time.
When you launch a deployment job, SageMaker once again creates infrastructure according to your requirements.
Let's focus on real-time endpoints for now, and not on batch transform.
Once an endpoint instance is in service, it pulls the appropriate prediction container from ECR, and loads your model from S3. Then, the HTTPS endpoint is provisioned, and is ready for prediction within minutes.
If you configured the endpoint with several instances, load balancing and high availability are set up automatically. If you configured Auto Scaling, this is applied as well.
As you would expect, an endpoint stays up until it's deleted explicitly, either in the AWS Console or with a SageMaker API call. In the meantime, you will be charged for the endpoint, so please make sure to delete endpoints that you don't need!
Now that we understand the big picture, let's start looking at the SageMaker SDK, and how we can use it to train and deploy models.
Being familiar with the SageMaker SDK is important to making the most of SageMaker. You can find its documentation at https://sagemaker.readthedocs.io.
Walking through a simple example is the best way to get started. In this section, we'll use the Linear Learner algorithm to train a regression model on the Boston Housing dataset. We'll proceed very slowly, leaving no stone unturned. Once again, these concepts are essential, so please take your time, and make sure you understand every step fully.
Note:
Reminder: I recommend that you follow along and run the code available in the companion GitHub repository. Every effort has been made to check all code samples present in the text. However, for those of you who have an electronic version, copying and pasting may have unpredictable results: formatting issues, weird quotes, and so on.
Built-in algorithms expect the dataset to be in a certain format, such as CSV, protobuf, or libsvm. Supported formats are listed in the algorithm documentation. For instance, Linear Learner supports CSV and recordIO-wrapped protobuf (https://docs.aws.amazon.com/sagemaker/latest/dg/linear-learner.html#ll-input_output).
Our input dataset is already in the repository in CSV format, so let's use that. Dataset preparation will be extremely simple, and we'll run it manually:
import pandas as pd
dataset = pd.read_csv('housing.csv')
print(dataset.shape)
It contains 506 samples and 13 columns:
(506, 13)
dataset[:5]
This prints out the table visible in the following diagram. For each house, we see 12 features, and a target attribute (medv) set to the median value of the house in thousands of dollars:
dataset = pd.concat([dataset['medv'], dataset.drop(['medv'], axis=1)], axis=1)
from sklearn.model_selection import train_test_split
training_dataset, validation_dataset = train_test_split(dataset, test_size=0.1)
training_dataset.to_csv('training_dataset.csv', index=False, header=False)
validation_dataset.to_csv('validation_dataset.csv', index=False, header=False)
import sagemaker
sess = sagemaker.Session()bucket = sess.default_bucket()
prefix = 'boston-housing'
training_data_path = sess.upload_data( path='training_dataset.csv', key_prefix=prefix + '/input/training')
validation_data_path = sess.upload_data( path='validation_dataset.csv', key_prefix=prefix + '/input/validation')
print(training_data_path)print(validation_data_path)
The two S3 paths look like this. Of course, the account number in the default bucket name will be different:
s3://sagemaker-eu-west-1-123456789012/boston-housing/input/training/training_dataset.csv
s3://sagemaker-eu-west-1-123456789012/boston-housing/input/validation/validation_dataset.csv
Now that data is ready in S3, we can configure the training job.
The Estimator object (sagemaker.estimator.Estimator) is the cornerstone of model training. It lets you select the appropriate algorithm, define your training infrastructure requirements, and more.
The SageMaker SDK also includes algorithm-specific estimators, such as sagemaker.LinearLearner or sagemaker.PCA. I generally find them less flexible than the generic estimator (no CSV support, for one thing), and I don't recommend using them. Using the Estimator object also lets you reuse your code across examples, as we will see in the next sections:
import boto3 from sagemaker import image_uris
region = boto3.Session().region_name container = image_uris.retrieve('linear-learner', region)
from sagemaker.estimator import Estimator
ll_estimator = Estimator( container, role=sagemaker.get_execution_role(), instance_count=1, instance_type='ml.m5.large', output_path='s3://{}/{}/output'.format(bucket, prefix))
SageMaker supports plenty of different instance types, with some differences across AWS regions. You can find the full list at https://docs.aws.amazon.com/sagemaker/latest/dg/instance-types-az.html.
Which one should we use here? Looking at the Linear Learner documentation (https://docs.aws.amazon.com/sagemaker/latest/dg/linear-learner.html#ll-instances), we see that you can train the Linear Learner algorithm on single- or multi-machine CPU and GPU instances. Here, we're working with a tiny dataset, so let's select the smallest training instance available in our region: ml.m5.large.
Checking the pricing page (https://aws.amazon.com/sagemaker/pricing/), we see that this instance costs $0.15 per hour in the eu-west-1 region (the one I'm using for this job).
Let's look at the documentation, and see which hyperparameters are mandatory (https://docs.aws.amazon.com/sagemaker/latest/dg/ll_hyperparameters.html). As it turns out, there is only one: predictor_type. It defines the type of problem that Linear Learner is training on (regression, binary classification, or multiclass classification).
Taking a deeper look, we see that the default value for mini_batch_size is 1000: this isn't going to work well with our 506-sample dataset, so let's set it to 32. We also learn that the normalize_data parameter is set to true by default, which makes it unnecessary to normalize data ourselves:
ll_estimator.set_hyperparameters( predictor_type='regressor', mini_batch_size=32)
training_data_channel = sagemaker.TrainingInput( s3_data=training_data_path, content_type='text/csv')
validation_data_channel = sagemaker.TrainingInput( s3_data=validation_data_path, content_type='text/csv')
By default, data served by a channel will be fully copied to each training instance, which is fine for small datasets. We'll study alternatives in Chapter 10, Advanced Training Techniques.
All it takes is one line of code:
ll_estimator.fit({'train': training_data_channel, 'validation': validation_data_channel})
Immediately, the training job starts:
Starting - Starting the training job.
Here are the first few lines, showing the infrastructure being provisioned, as explained earlier in the Using fully managed infrastructure section:
Starting - Starting the training job...Starting - Launching requested ML instances......Starting - Preparing the instances for training...Downloading - Downloading input data...Training - Training image download completed.
#quality_metric: host=algo-1, validation mse <loss>=13.7226685169 #quality_metric: host=algo-1, validation absolute_loss <loss>=2.86944983987
Uploading - Uploading generated training model Completed - Training job completed Training seconds: 49 Billable seconds: 49
We mentioned earlier than the cost for an ml.m5.large instance is $0.15 per hour. As we trained for 49 seconds, this job cost us (49/3600)*0.15= $0.002, one fifth of a penny. Any time spent setting up infrastructure ourselves would have certainly cost more!
%%bash -s "$ll_estimator.output_path"aws s3 ls --recursive $1
You should see the model artifact: model.tar.gz
We'll see in Chapter 11, Deploying Machine Learning Models, what's inside that artifact, and how to deploy the model outside of SageMaker. For now, let's deploy it to a real-time endpoint.
This is my favorite part in SageMaker; we only need one line of code to deploy a model to an HTTPS endpoint:
from time import strftime, gmtime
timestamp = strftime('%d-%H-%M-%S', gmtime())endpoint_name = 'linear-learner-demo-'+timestamp print(endpoint_name)
Here, the endpoint name is linear-learner-demo-29-08-37-25.
ll_predictor = ll_estimator.deploy( endpoint_name=endpoint_name, initial_instance_count=1, instance_type='ml.t2.medium')
While the endpoint is created, we can see it in the Endpoints section of the SageMaker console, and in the Endpoints tab of SageMaker Studio.
ll_predictor.content_type = 'text/csv'll_predictor.serializer = sagemaker.serializers.CSVSerializer()ll_predictor.deserializer = sagemaker.deserializers.CSVDeserializer()
test_sample = '0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,4.98'
response = ll_predictor.predict(test_sample)
print(response)
The prediction output tells us that this house should cost $30,173:
[['{"predictions": [{"score": 30.17342185974121}]}']]
We can also predict multiple samples at a time:
test_samples = ['0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,4.98','0.02731,0.00,7.070,0,0.4690,6.4210,78.90,4.9671,2,242.0,17.80,9.14']
response = ll_predictor.predict(test_samples)print(response)
Now the prediction output is as follows:
[['{"predictions": [{"score": 30.413358688354492}', '{"score": 24.884408950805664}]}']]
When we're done working with the endpoint, we shouldn't forget to delete it to avoid unnecessary charges.
Deleting an endpoint is as simple as calling the delete_endpoint() API:
ll_predictor.delete_endpoint()
At the risk of repeating myself, the topics covered in this section are extremely important, so please make sure you're completely familiar with them, as we'll constantly use them in the rest of the book. Please spend some time reading the service and SDK documentation as well:
Now let's explore other built-in algorithms. You'll see that the workflow and the code are very similar!
In the rest of this chapter, we will run more examples with built-in algorithms, both in supervised and unsupervised mode. This will help you become very familiar with the SageMaker SDK, and learn how to solve actual machine learning problems. The following list shows some of these algorithms:
Let's train a model on the Boston Housing dataset with the XGBoost algorithm (https://github.com/dmlc/xgboost). As we will see in Chapter 7, Using Built-in Frameworks, SageMaker also supports XGBoost scripts:
import boto3 from sagemaker import image_uris
region = boto3.Session().region_name container = image_uris.retrieve('xgboost', region, version='latest')
xgb_estimator = Estimator( container, role=sagemaker.get_execution_role(), instance_count=1, instance_type='ml.m5.large', output_path='s3://{}/{}/output'.format(bucket, prefix))
xgb_estimator.set_hyperparameters( objective='reg:linear', num_round=200, early_stopping_rounds=10)
training_data_channel = sagemaker.TrainingInput( s3_data=training_data_path, content_type='text/csv')
validation_data_channel = sagemaker.TrainingInput( s3_data=validation_data_path, content_type='text/csv')
xgb_estimator.fit({'train': training_data_channel, 'validation': validation_data_channel})
[12]#011train-rmse:1.25702#011validation-rmse:2.43126 <output removed>[22]#011train-rmse:0.722193#011validation-rmse:2.43355
from time import strftime, gmtime
timestamp = strftime('%d-%H-%M-%S', gmtime())endpoint_name = 'xgb-demo'+'-'+timestamp
xgb_predictor = xgb_estimator.deploy( endpoint_name=endpoint_name, initial_instance_count=1, instance_type='ml.t2.medium')
test_sample = '0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,4.98'
xgb_predictor.content_type = 'text/csv'xgb_predictor.serializer = sagemaker.serializers.CSVSerializer()xgb_predictor.deserializer = sagemaker.deserializers.CSVDeserializer()
response = xgb_predictor.predict(test_sample)
print(response)
The result tells us that this house should cost $23,754.
[['23.73023223876953']]
xgb_predictor.delete_endpoint()
As you can see, the SageMaker workflow is pretty simple, and makes it easy to experiment quickly with different algorithms without having to rewrite all your code.
Let's move on to the Factorization Machines algorithm. In the process, we will learn about the highly efficient recordIO-wrapped protobuf format.
Factorization Machines is a generalization of linear models (https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf). They're well-suited for high dimension sparse datasets, such as user-item interaction matrices for recommendation.
In this example, we're going to train a recommendation model based on the MovieLens dataset (https://grouplens.org/datasets/movielens/).
The dataset exists in several versions. To minimize training times, we'll use the 100k version. It contains 100,000 ratings (integer values from 1 to 5) assigned by 943 users to 1,682 movies. The dataset is already split for training and validation.
As you know by now, training and deploying with SageMaker is very simple. Most of the code will be identical to the two previous examples, which is great! This lets us focus on understanding and preparing data.
Imagine building a matrix to store this dataset. It would have 943 lines (one per user) and 1,682 columns (one per movie). Cells would store the ratings. The following diagram shows a basic example:
Hence, the matrix would have 943*1,682=1,586,126 cells. However, as only 100,000 ratings are present, 93.69% of cells would be empty. Storing our dataset this way would be extremely inefficient. It would needlessly consume RAM, storage, and network bandwidth to store and transfer lots of zero values!
In fact, things are much worse, as the algorithm expects the input dataset to look like in the following diagram:
Why do we need to store data this way? The answer is simple: Factorization Machines is a supervised learning algorithm, so we need to train it on labeled samples.
Looking at the preceding diagram, we see that each line represents a movie review. The matrix on the left stores its one-hot encoded features (users and movies), and the vector on the right stores its label. For instance, the last line tells us that user 4 has given movie 5 a "5" rating.
The size of this matrix is 100,000 lines by 2,625 columns (943 movies plus 1,682 movies). The total number of cells is 262,500,000, which are only 0.076% full (200,000 / 262,500,000). If we used a 32-bit value for each cell, we would need almost a gigabyte of memory to store this matrix. This is horribly inefficient, but still manageable.
Just for fun, let's do the same exercise for the largest version of MovieLens, which has 25 million ratings, 62,000 movies and 162,000 users. The matrix would have 25 million lines and 224,000 columns, for a total of 5,600,000,000,000 cells. Yes, that's 5.6 trillion cells, and although they would be 99.999% empty, we would still need over 20 terabytes of RAM to store them. Ouch. If that's not bad enough, consider recommendation models with millions of users and products: the numbers are mind-boggling!
Instead of using a plain matrix, we'll use a sparse matrix, a data structure specifically designed and optimized for sparse datasets. Scipy has exactly the object we need, named lil_matrix (https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.lil_matrix.html). This will help us to get rid of all these nasty zeros.
So how will we pass this sparse matrix to the SageMaker algorithm? As you would expect, we're going to serialize the object, and store it in S3. We're not going to use Python serialization, however. Instead, we're going to use protobuf (https://developers.google.com/protocol-buffers/), a popular and efficient serialization mechanism.
In addition, we're going to store the protobuf-encoded data in a record format called RecordIO (https://mxnet.apache.org/api/faq/recordio/). Our dataset will be stored as a sequence of records in a single file. This has the following benefits:
Don't worry if you're not familiar with protobuf and RecordIO. The SageMaker SDK includes utility functions that hide their complexity.
We will begin building the model using the following steps:
%%sh wget http://files.grouplens.org/datasets/movielens/ml-100k.zip unzip ml-100k.zip
%cd ml-100k !shuf ua.base -o ua.base.shuffled !head -5 ua.base.shuffled
We see four columns: the user ID, the movie ID, the rating, and a timestamp (which we'll ignore in our model):
378 43 3 880056609 919 558 5 875372988 90 285 5 891383687 249 245 2 879571999 416 64 5 893212929
num_users = 943 num_movies = 1682 num_features = num_users+num_movies
num_ratings_train = 90570 num_ratings_test = 9430
import csv import numpy as np from scipy.sparse import lil_matrix
def loadDataset(filename, lines, columns):
X = lil_matrix((lines, columns)).astype('float32') Y = []
line=0 with open(filename,'r') as f: samples=csv.reader(f,delimiter=' ') for userId,movieId,rating,timestamp in samples: X[line,int(userId)-1] = 1 X[line,int(num_users)+int(movieId)-1] = 1 Y.append(int(rating)) line=line+1 Y=np.array(Y).astype('float32') return X,Y
X_train, Y_train = loadDataset('ua.base.shuffled', num_ratings_train, num_features)
X_test, Y_test = loadDataset('ua.test', num_ratings_test, num_features)
print(X_train.shape)print(Y_train.shape)print(X_test.shape)print(Y_test.shape)
This displays the dataset shapes:
(90570, 2625)(90570,)(9430, 2625)(9430,)
import io, boto3 import sagemaker.amazon.common as smac
def writeDatasetToProtobuf(X, Y, bucket, prefix, key): buf = io.BytesIO() smac.write_spmatrix_to_sparse_tensor(buf, X, Y) buf.seek(0) obj = '{}/{}'.format(prefix, key) boto3.resource('s3').Bucket(bucket).Object(obj). upload_fileobj(buf)
return 's3://{}/{}'.format(bucket,obj)
Had our data been stored in a numpy array instead of lilmatrix, we would have used the write_numpy_to_dense_tensor() function instead. It has the same effect.
import sagemaker
bucket = sagemaker.Session().default_bucket()prefix = 'fm-movielens'
train_key = 'train.protobuf'train_prefix = '{}/{}'.format(prefix, 'train')
test_key = 'test.protobuf'test_prefix = '{}/{}'.format(prefix, 'test')
output_prefix = 's3://{}/{}/output'.format(bucket, prefix)
train_data = writeDatasetToProtobuf(X_train, Y_train, bucket, train_prefix, train_key)
test_data = writeDatasetToProtobuf(X_test, Y_test, bucket, test_prefix, test_key)
$ aws s3 ls s3://sagemaker-eu-west-1-123456789012/fm-movielens/train/train.protobuf
5796480 train.protobuf
from sagemaker import image_uris
region=boto3.Session().region_name
container=image_uris.retrieve('factorization-machines', region)
fm=sagemaker.estimator.Estimator( container, role=sagemaker.get_execution_role(), instance_count=1, instance_type='ml.c5.xlarge', output_path=output_prefix)
fm.set_hyperparameters( feature_dim=num_features, predictor_type='regressor', num_factors=64, epochs=10)
Looking at the documentation (https://docs.aws.amazon.com/sagemaker/latest/dg/fact-machines-hyperparameters.html), we see that the required hyperparameters are feature_dim, predictor_type, and num_factors. The default setting for epochs is 1, which feels a little low, so we use 10 instead.
fm.fit({'train': train_data, 'test': test_data})
endpoint_name = 'fm-movielens-100k'
fm_predictor = fm.deploy( endpoint_name=endpoint_name, instance_type='ml.t2.medium', initial_instance_count=1)
import json
def fm_serializer(data): js = {'instances': []} for row in data: js['instances'].append({'features': row.tolist()}) return json.dumps(js)
fm_predictor.content_type = 'application/json'fm_predictor.serializer = fm_serializer
result = fm_predictor.predict(X_test[:3].toarray())print(result)
The prediction looks like this:
{'predictions': [{'score': 3.3772034645080566}, {'score': 3.4299235343933105}, {'score': 3.6053106784820557}]}
fm_predictor.delete_endpoint()
So far, we've only used supervised learning algorithms. In the next section, we'll move on to unsupervised learning with Principal Component Analysis.
Principal Component Analysis (PCA) is a dimension reductionality algorithm. It's often applied as a preliminary step before regression or classification. Let's use it on the protobuf dataset built in the Factorization Machines example. Its 2,625 columns are a good candidate for dimensionality reduction! We will use PCA by observing the following steps:
import boto3 from sagemaker import image_uris
region = boto3.Session().region_name container = image_uris.retrieve('pca', region)
pca = sagemaker.estimator.Estimator( container=container, role=sagemaker.get_execution_role(), instance_count=1, instance_type='ml.c5.xlarge', output_path=output_prefix)
pca.set_hyperparameters(feature_dim=num_features, num_components=64, mini_batch_size=1024)
pca.fit({'train': train_data, 'test': test_data})pca_predictor = pca.deploy( endpoint_name='pca-movielens-100k', instance_type='ml.t2.medium', initial_instance_count=1)
import json
def pca_serializer(data): js = {'instances': []} for row in data: js['instances'].append({'features': row.tolist()}) return json.dumps(js)
pca_predictor.content_type = 'application/json'pca_predictor.serializer = pca_serializer
result = pca_predictor.predict(X_test[0].toarray())
print(result)
This prints out the 64 principal components of the test sample. In real life, we typically would process the dataset with this model, save the results, and use them to train a regression model:
{'projections': [{'projection': [-0.008711372502148151, 0.0019895541481673717, 0.002355781616643071, 0.012406938709318638, -0.0069608548656105995, -0.009556426666676998, <output removed>]}]}
Don't forget to delete the endpoint when you're done. Then, let's run one more unsupervised learning example to conclude this chapter!
Random Cut Forest (RCF) is an unsupervised learning algorithm for anomaly detection (https://proceedings.mlr.press/v48/guha16.pdf). We're going to apply it to a subset of the household electric power consumption dataset (https://archive.ics.uci.edu/ml/), available in the GitHub repository for this book. The data has been aggregated hourly over a period of little less than a year (just under 8,000 values):
import pandas as pd
df = pd.read_csv('item-demand-time.csv', dtype = object, names=['timestamp','value','client'])
df.head(3)
As shown in the following screenshot, the dataset has three columns: an hourly timestamp, the power consumption value (in kilowatt-hours), and the client ID:
import matplotlib import matplotlib.pyplot as plt
df.value=pd.to_numeric(df.value)df_plot=df.pivot(index='timestamp',columns='item', values='value')df_plot.plot(figsize=(40,10))
The plot is shown in the following diagram. We see three time series corresponding to three different clients:
df = df[df['item']=='client_12']df = df.drop(['item', 'timestamp'], axis=1)df.value *= 100 df.value = df.value.astype('int32')df.head()
The following diagram shows the first lines of the transformed dataset:
import boto3 import sagemaker
sess = sagemaker.Session()bucket = sess.default_bucket()prefix = 'electricity'
df.to_csv('electricity.csv', index=False, header=False)
training_data_path = sess.upload_data( path='electricity.csv', key_prefix=prefix + '/input/training')
First, the content type must state that data is not labeled. The reason for this is that RCF can accept an optional test channel where anomalies are labeled (label_size=1). Even though the training channel never has labels, we still need to tell RCF. Second, the only distribution policy supported in RCF is ShardedByS3Key. This policy splits the dataset across the different instances in the training cluster, instead of sending them a full copy. We won't run distributed training here, but we need to set that policy nonetheless:
training_data_channel = sagemaker.TrainingInput( s3_data=training_data_path, content_type='text/csv;label_size=0', distribution='ShardedByS3Key')
rcf_data = {'train': training_data_channel}
from sagemaker.estimator import Estimator from sagemaker import image_uris
role = sagemaker.get_execution_role()region = boto3.Session().region_name container = image_uris.retrieve('randomcutforest', region)
rcf_estimator = Estimator(container, role=role, instance_count=1, instance_type='ml.m5.large', output_path='s3://{}/{}/output'.format(bucket, prefix))
rcf_estimator.set_hyperparameters(feature_dim=1)rcf_estimator.fit(rcf_data)
endpoint_name = 'rcf-demo'rcf_predictor = rcf_estimator.deploy( endpoint_name=endpoint_name, initial_instance_count=1, instance_type='ml.t2.medium')
rcf_predictor.content_type = 'text/csv'rcf_predictor.serializer = sagemaker.serializers.CSVSerializer()
rcf_predictor.deserializer = sagemaker.deserializers.JSONDeserializer()
values = df['value'].astype('str').tolist()response = rcf_predictor.predict(values)print(response)
The response contains the anomaly score for each value in the time series. It looks like this:
{'scores': [{'score': 1.0868037776}, {'score': 1.5307718138}, {'score': 1.4208102841} …
from statistics import mean,stdev
scores = []for s in response['scores']: scores.append(s['score'])
score_mean = mean(scores)score_std = stdev(scores)
df[2000:2500].plot(figsize=(40,10))
plt.figure(figsize=(40,10))plt.plot(scores[2000:2500])plt.autoscale(tight=True)plt.axhline(y=score_mean+3*score_std, color='red')plt.show()
The drop is clearly visible in the following plot:
As you can see on the following score plot, its score is sky high! Beyond a doubt, this value is an anomaly:
Exploring other intervals of the time series, we could certainly find more. Who said machine learning wasn't fun?
rcf_predictor.delete_endpoint()
Having gone through five complete examples, you should now be familiar with built-in algorithms, the SageMaker workflow, and the SDK. To fully master these topics, I would recommend experimenting with your datasets, and running the additional examples available at https://github.com/awslabs/amazon-sagemaker-examples/tree/master/introduction_to_amazon_algorithms.
As you can see, built-in algorithms are a great way to quickly train and deploy models without having to write any machine learning code.
In this chapter, you learned about the SageMaker workflow, and how to implement it with a handful of APIs from the SageMaker SDK, without ever worrying about infrastructure.
You learned how to work with data in CSV and RecordIO-wrapped protobuf format, the latter being the preferred format for large-scale training on bulky datasets.You also learned how to build models with important algorithms for supervised and unsupervised learning: Linear Learner, XGBoost, Factorization Machines, PCA, and Random Cut Forest.
In the next chapter, you will learn how to use additional built-in algorithms to build computer vision models.