An example – AWS and Docker

The rest of this chapter is dedicated to the running of the matplotlib USGS/EROS image generation task in AWS using EC2, S3, and Docker. We are going to need to perform two stages of preparation—work that needs to be done locally and the setup that needs to happen in the Cloud. With these complete, we will be ready to execute our prepared task.

Getting set up locally

Your local setup will include an installation of Docker (and boot2docker if you are using Mac or Windows). It will create or download Dockerfiles, generate images from these files, extend the base images as necessary, and start up a Docker image to ensure that everything is in working order.

Requirements

Here's what you will need for the remainder of this chapter:

  • Docker
  • boot2docker (for easily using Docker from Windows or Mac)

If you're running Linux, you can skip the rest of this section. If you haven't run boot2docker before, you'll need to run the following command first:

$ boot2docker init

If you have previously initiated boot2docker, you can just do the following code:

$ boot2docker up

At this point, you will see an output that looks like the following:

Waiting for VM and Docker daemon to start......ooo
Started.
Writing ~/.boot2docker/certs/boot2docker-vm/ca.pem
Writing ~/.boot2docker/certs/boot2docker-vm/cert.pem
Writing ~/.boot2docker/certs/boot2docker-vm/key.pem
To connect the Docker client to the Docker daemon, please set:
export DOCKER_CERT_PATH=~/.boot2docker/certs/boot2docker-vm
export DOCKER_TLS_VERIFY=1
export DOCKER_HOST=tcp://192.168.59.103:2376

You can either manually export the environment variables, or run the following code from your shell prompt:

$ $(boot2docker shellinit)

The preceding code will set the appropriate variables in your shell environment for you automatically. At this point, Docker is ready for use.

Dockerfiles and the Docker images

The heart of configuration management is the Dockerfile. This will be used to generate the Docker image that you need to run the Docker containers, which is where your matplotlib tasks will actually happen. If you are unfamiliar with Docker, here's a quick summary of how to think about the components that we have just mentioned:

  • Dockerfile: This is the specification that is used to build extendible images
  • The Docker image: This is a read-only template, which is somewhat like a filesystem
  • The Docker container: This is an isolated and secure application platform; this is what actually gets run

One of the features of Docker is that through its underlying use of a unification file system one is able to load images, starting with a base image and adding increasingly more specific images until the desired configuration state is achieved. This is exactly what we will do. The company that you work for, as stated earlier, is a Python 3 shop. So, they've built a Docker image that has all the basic goodies for Python 3 on Ubuntu 15.04. Furthermore, since the research and computation groups make heavy use of NumPy, SciPy, Pandas, and matplotlib, a second image has been created by using the Python 3 image as a base.

Here's what the Python 3 Dockerfile looks like:

In [11]: cat ../docker/python/Dockerfile
         FROM ubuntu:vivid
         MAINTAINER Duncan McGreggor <[email protected]>
         ENV DEBIAN_FRONTEND noninteractive
         RUN apt-get update
         RUN apt-get upgrade -y
         RUN apt-get install -y -q apt-utils
         RUN apt-get install -y -q 
             ca-certificates git build-essential
         RUN apt-get install -y -q 
             libssl-dev libcurl4-openssl-dev
         RUN apt-get install -y -q curl
         RUN apt-get install -y -q 
             cython3 libpython3.4-dev python3.4-dev 
             python3-setuptools python3-pip
         CMD python3

Note that this Dockerfile has not been created from scratch. Rather, it is based on another Docker image—the official ubuntu:vivid image. It has a maintainer that sets an environment variable, which will be available for each of the RUN and CMD directives as well as when the Docker image is running (with and without an interactive session). Each of the RUN commands is executed when building the Docker image. The CMD command is what will be run by default when executing Docker run on the command line.

This Dockerfile has been used to generate an image, which has been published to Docker Hub with the masteringmatplotlib/python tag. As such, you will not need to build this yourself.

The next Dockerfile that we will look at is the one that your group uses for the majority of its scientific computing tasks. Here is a Dockerfile:

In [12]: cat ../docker/scipy/Dockerfile
         FROM masteringmatplotlib/python
         MAINTAINER Duncan McGreggor <[email protected]>
         ENV DEBIAN_FRONTEND noninteractive
         RUN apt-get install -y -q 
             libatlas3-base libblas-dev libblas3 
             libatlas-base-dev libatlas-dev 
             liblapack-dev gfortran
         ENV LAPACK /usr/lib/liblapack.so
         ENV ATLAS /usr/lib/libatlas.so
         ENV BLAS /usr/lib/libblas.so
         RUN apt-get install -y -q 
             python3-six python3-flake8 
             python3-dateutil python3-pyparsing 
             python3-numpy python3-scipy 
             python3-matplotlib python3-pandas
         RUN pip3 install seaborn
         CMD python3

In this case, the Dockerfile is based on the Python 3 Dockerfile. It is extended by additional installations of the libraries that are commonly needed for scientific computing that is performed by using Python. The Dockerfile is used to create an image and pushed to Docker Hub using the masteringmatplotlib/scipy tag. This is the one that we will be extending for our task.

Extending a Docker image

The preceding scipy Docker image has almost everything we need. It's just missing a few dependencies, which are available in this chapter's Git repository. These dependencies include the following:

  • PIL
  • The scikit-image library
  • A custom code to work with the USGS EROS/NASA Landsat 8 data

So, how can we customize the scipy image to include the preceding dependencies? There are two ways to do this:

  • Make changes to the image and commit these changes
  • Create a Dockerfile that is based on the image

We will use the second option so that we are able to easily track changes in the source code of the Dockerfile. We've provided the following file in the notebook repository:

In [13]: cat ../docker/simple/Dockerfile
         FROM masteringmatplotlib/scipy
         MAINTAINER Py3 Hacker <[email protected]>
         ENV HOME /root
         ENV REPO cloud-deploy
         RUN cd $HOME && 
             git clone 
             https://github.com/masteringmatplotlib/${REPO}.git
         RUN cd $HOME/$REPO && 
             make docker-setup
         CMD PYTHONPATH=$HOME/$REPO/lib:$PYTHONPATH 
             python3

Tip

Points to note:

  • The preceding Dockerfile extends the masteringmatplotlib/scipy Docker image.
  • Being able to use the standard development workflows that we are used to, like cloning the required code, is an incredibly powerful tool, which is quite easy to accomplish thanks to the simple design of Docker.
  • For ease of demonstration, we're going to simply use the notebook repository and add it to PYTHONPATH. In most situations, you have to create a setup.py file for your Python library and install it with pip in the Dockerfile build steps. Thus, you don't have to mess with PYTHONPATH when running your commands in the Dockerfile.

Building a new image

Let's build a new image! First, run the following code:

$ docker build -t yourname/eros ./docker/simple/Dockerfile

The -t parameter instructs docker to tag the image with the provided name once it's built. The prefix before / should match the name used on Docker Hub if you're going to publish the image there. This can be a username or an organization.

Once you execute the preceding command, you will see the following output:

Sending build context to Docker daemon  2.56 kB
Sending build context to Docker daemon
Step 0 : FROM ipython/scipystack
 ---> 113395173d25
Step 1 : MAINTAINER Py Hacker <[email protected]>
 ---> Using cache
 ---> fd520c92b33b

[snip]

Removing intermediate container 90983e9fdd54
Step 6 : CMD PYTHONPATH=./cloud-deploy/lib:$PYTHONPATH python3
 ---> Running in b7a022f2ac29
 ---> abde2bb0eeaa
Removing intermediate container b7a022f2ac29
Successfully built abde2bb0eeaa

Let's make sure that the library is present in our new image by using the -i option for docker run to indicate that we will need an interactive session with the container (this keeps STDIN open):

$ docker run -t -i yourname/eros python3
>>> import eros
>>> ^D
$

Looks like our simple image that was built on the top of masteringmatplotlib/scipy worked like a charm. Now, let's make some changes to it.

Preparing for deployment

We need to make a couple of changes to the simple case so that it fulfills the following conditions:

  • Our code will know that it's being called from Docker (used to set the backend to something that doesn't require a DISPLAY environment)
  • We can execute a dispatch function, which will generate the desired type of satellite image

Both of the preceding conditions can be fulfilled simply by changing the Docker CMD directive in the following way:

In [14]: cat ../docker/eros/Dockerfile
FROM masteringmatplotlib/scipy
MAINTAINER Py3 Hacker <[email protected]>
ENV HOME /root
ENV REPO cloud-deploy
RUN cd $HOME && 
    git clone https://github.com/masteringmatplotlib/${REPO}.git
RUN cd $HOME/$REPO && 
    make docker-setup
CMD DOCKER_CONTAINER=true 
    PYTHONPATH=${HOME}/${REPO}/lib:$PYTHONPATH 
    python3 -c "import eros;eros.s3_generate_image();"

The s3_generate_image function is the dispatcher, and depending upon the environment variables that are set when running Docker, it will take different actions. We will discuss this more in a later section.

Getting the setup on AWS

Having prepared the local machine to create the Docker images that we will use in the Cloud, we now need to set up the other end—getting the Cloud ready for our images. In the following sections, we will copy the Landsat image data to a remote storage service, create a virtual machine in the Cloud that will be the host OS for the Docker images, and finally ensure that we can read and write data in our images to and from the storage service.

Pushing the source data to S3

The Landsat 8 data files that we are working with are sizable, with each file ranging from about 150 MB to 600 MB. As such, we want to be selective with regard to what we'll be pushing to S3. For your project, the following Landsat bands are needed:

  • Coastal/aerosol (band 1)
  • Red, green, and blue (bands 4, 3, and 2)
  • SWIR, 2100-2300 nm (band 7)
  • NIR, 845-885 nm (band 5)

All the files for a particular scene weigh over 2 GB, so we'll just want to push the files for the bands we need as per the Landsat bands that were noted in the preceding section. Given that we define the following shell variables:

$ SCENE_PATH="/EROSData/L8_OLI_TIRS"
$ SCENE=LC82260102014232LGN00

The files that we need to upload can be identified with the help of the following code:

$ find $SCENE_PATH/$SCENE 
    -name "*_B[1-5,7].TIF" 
    -exec basename {} ;
LC82260102014232LGN00_B1.TIF
LC82260102014232LGN00_B2.TIF
LC82260102014232LGN00_B3.TIF
LC82260102014232LGN00_B4.TIF
LC82260102014232LGN00_B5.TIF
LC82260102014232LGN00_B7.TIF

Before running the following commands, you need to make sure that the user associated with the access and secret keys has the appropriate S3 permissions (for example, the ability to upload the files). This is done in the AWS Management Console in the IAM screen through various means (your preference with regard to the combination of users, groups, roles, and policies).

Let's start by setting some AWS shell variables in a terminal window on your local machine by using the following code:

$ export AWS_ACCESS_KEY_ID=YOURACCESSKEY
$ export AWS_SECRET_ACCESS_KEY=YOURSECRETKEY

These will be used by the aws command-line utility, which was installed when you ran the make command in the IPython Notebook repository at the beginning of the chapter. Let's also set a bucket name variable by using the following code:

$ S3_BUCKET=scoresbysund

Note that the Amazon S3 bucket names are global like DNS. As such, this bucket may already exist. So, be ready with an alternate name.

Now we can create the S3 bucket in the following way:

$ aws s3 mb s3://$S3_BUCKET

With the new bucket in place, we can now upload the selected Landsat 8 scene files:

$ for FILE in "$SCENE_PATH"/$SCENE/*_B[2-5,7].TIF
    do
    aws s3 cp "$FILE" s3://$S3_BUCKET
    done

That's a total of about 888 MB. So, depending on the upload speed of your Internet connection, you may be in for a wait.

The files that the satellite image processing task will need have been uploaded. The next step is to set up a server on which the Docker container tasks will run.

Creating a host server on EC2

In the previous testing, you discovered that an m3/xlarge EC2 instance, along with its 15 GB RAM, will be required due to the intensive memory requirements for the task of image processing. The next step involved an instance the requires a 7.5 GB RAM; this generated out-of-memory errors, indicating that the RAM was insufficient for the instance.

To create an EC2 instance on AWS, perform the following steps:

  1. Log in to the AWS console and click on the Launch Instance button.
  2. Select your preferred Volume Type (for example, Red Hat, SUSE, Ubuntu, and so on). We will use an Ubuntu 64 EC2 Amazon Machine Image (AMI) with 4 virtual CPUs and 15 GB of RAM.
  3. Select or create the security group that will allow an in-bound Secure Shell (SSH) access (port 22) to the EC2 instance from your workstation.
  4. Launch the EC2 instance.

The following screenshot shows the Review Instance Launch step:

Creating a host server on EC2

Once the instance is up and running, get the IP address from the AWS Management Console from where you launched it and use it to SSH into it:

$ ssh -i /path/to/your-ec2-key-pair.pem 
    ubuntu@instance-ip-address

Once you have activated SSH into the running EC2 instance, prep the instance by installing Docker and saving your AWS credentials on the filesystem. You will need access to these credentials when you start up the Docker containers so that the Python script on the container can read from and write to S3:

ubuntu@ip-address:~$ sudo apt-get install -y docker.io
ubuntu@ip-address:~$ sudo mkdir /etc/aws
ubuntu@ip-address:~$ sudo vi /etc/aws/access
ubuntu@ip-address:~$ sudo vi /etc/aws/secret
ubuntu@ip-address:~$ sudo chmod 600 /etc/aws/*
ubuntu@ip-address:~$ sudo chown ubuntu /etc/aws/*

Using Docker on EC2

Now, you need to pull down the Docker image that we've created for this task and then run a container by using this image in the interactive mode with the help of Python as the shell. You can use either the Docker image that you created or the one that we did (masteringmatplotlib/eros):

ubuntu@ip-address:~$ sudo docker run -i 
    -t masteringmatplotlib/eros python3
Python 3.4.3 (default, Feb 27 2015, 02:35:54)
[GCC 4.9.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

This will attempt to run the masteringmatplotlib/scipy Docker image, which won't be present on your EC2 instance. So, it will then download it from Docker Hub (also downloading all the images upon which it is built). Once this finishes, do a quick test to make sure that everything is in place:

>>> import eros
>>>

You should get no errors. This will indicate that you are all set for the next step!

Reading and writing with S3

In order to read your scene data from the files that you uploaded to S3, you'll need to do the following:

  1. Update your bucket permissions with a policy that allows your EC2 instance to access it.
  2. Obtain the HTTP URL for your bucket on the S3 screen in the AWS Console.
  3. Keep the IP address of your newly started EC2 instance handy.

The easiest way to read data from S3 in EC2 is to open the HTTP URL for the file in question. To do this, you need to do the following:

  1. Go to the S3 section in the AWS Console.
  2. Click on the bucket that you will be using.
  3. In the new page that loads, click on Properties.
  4. In the Properties section, click on Edit bucket policy.
  5. In the form that appears, paste the following, substituting your EC2 IP address:
    {
        "Version": "2012-10-17",
        "Id": "S3ScoresbySundGetPolicy",
        "Statement": [
            {
                "Sid": "IPAllow",
                "Effect": "Allow",
                "Principal": "*",
                "Action": "s3:*",
                "Resource": "arn:aws:s3:::scoresbysund/*",
                "Condition" : {
                    "IpAddress" : {
                        "aws:SourceIp": "YOUREC2IPADDRESS/32" 
                    }
                } 
            } 
        ]
    }

    The following screenshot shows the code pasted in the Bucket Policy Editor:

    Reading and writing with S3

With this change, your storage dependencies are now complete and your scripts on the EC2 instance, which use the appropriate AWS access credentials, will be able to read from and write to S3.

Running the task

As it is evident from the work that we have done till now, the operation of processing in remote environments requires a great deal of preparation and attention to detail. We have now made the way clear for our matplotlib satellite image generation task. We are ready to pass the parameters to Docker, which will let us flexibly handle many tasks with one image. We are also going to first tweak the Python code to handle these parameters and then finally execute our task.

Environment variables and Docker

When the EC2 instance starts up a Docker container that has to build images from the Landsat 8 data files, the Docker container will need to know a few things, which are as follows:

  • The Landsat 8 scene ID
  • The AWS access key that is used to access S3
  • The AWS secret key that is used to access S3

We can pass this information to the Docker container by using the -e flag, which will set the environment variables in the container once it starts. Before we try using this in a script, let's make sure that the feature behaves according to our expectations by starting up a Docker container in the EC2 instance in the following way:

ubuntu@ip-address:~$ sudo docker run -i 
  -e "PYTHONPATH=/root/cloud-deploy" 
  -e "EROS_SCENE_ID=LC82260102014232LGN00" 
  -e "AWS_ACCESS_KEY_ID=`cat /etc/aws/access`" 
  -e "AWS_SECRET_ACCESS_KEY=`cat /etc/aws/secret`" 
  -t masteringmatplotlib/eros 
  python3

This will drop us into a Python prompt in the container, where we can check out the environment variables:

>>> import os
>>> os.getenv("EROS_SCENE_ID")
'dummy001'
>>> os.getenv("AWS_ACCESS_KEY_ID")
'YOURACCESSKEY'
>>> os.getenv("AWS_SECRET_ACCESS_KEY")
'YOURSECRETACCESSKEY'

Everything worked perfectly, just as one might have expected.

Changes to the Python module

Now that you've confirmed that you can get the data that you need into the Docker containers, you can update your code to check for some specific data that you will set when running a container to generate satellite images. For instance, towards the beginning of the lib/ec2s3eros.py module, we have the following:

bucket_name = os.environ.get("S3_BUCKET_NAME")
scene_id = os.environ.get("EROS_SCENE_ID")
s3_path = os.environ.get("S3_PATH")
s3_title = os.environ.get("S3_IMAGE_TITLE")
s3_filename = os.environ.get("S3_IMAGE_FILENAME")
s3_image_type = os.environ.get("S3_IMAGE_TYPE", "").lower()
access_key = os.environ.get("AWS_ACCESS_KEY_ID")
secret_key = os.environ.get("AWS_SECRET_ACCESS_KEY")

The preceding lines of code are what the code will use to create the suitable image and save it to the appropriate place with the expected name. You can see this clearly if you scroll towards the end of the file. Here's an example of one of these variables getting used to dispatch the appropriate image-generating function:

def s3_generate_image():
    if s3_image_type == "rgb":
        s3_image_rgb()
    elif s3_image_type == "swir2nirg":
        s3_image_swir2nirg()

There's another important change that we had to make. In order for matplotlib to run successfully on EC2, we need to set an explicit backend. The matplotlib module is only smart enough to choose a backend based on the operating system. As it has been designed for use with GUIs, it makes an assumption that you not only have a DISPLAY environment variable set, but more importantly, there is an actual display to which this variable points.

On EC2 and other Cloud environments, this will almost always not be the case. If you look at the top of the lib/ec2s3eros.py module, you will see the following:

import matplotlib as mpl
if  os.environ.get("DOCKER_CONTAINER") == "true":
    mpl.use("Agg")

The environment variable that you see in the preceding code is the one that set in the CMD directive of the Dockerfile:

CMD DOCKER_CONTAINER=true 
    PYTHONPATH=${HOME}/${REPO}/lib:$PYTHONPATH 
    python3 -c "import eros;eros.s3_generate_image();"

As you can see in the module, we used the environment variable to determine whether the module is being used in a Docker container (with no variable set). If it is being used in Docker, we explicitly set the backend to something that will not throw errors if there is no display.

In the preceding example, we have done this in the code, since the file already existed and it was just a two-line change. However, we can also provide a custom matplotlibrc file, which will set the default backend. For the long term, this is probably the better approach because of the following reasons:

  • The new file will only need to be created once in the Dockerfile that installs matplotlib (for us, this was the one that generated the masteringmatplotlib/scipy Docker image)
  • The images that extend that one will then benefit from the presence of the matplotlibrc file, and you will not need to make any code changes to run in virtualized environments.

Subsequently, the developer and user experience for these Docker images will be greatly improved. The administrators who are responsible for the creation of new images with these as the basis will have less work to do and the users will have one less error to face when getting started.

Back on your workstation, having made the necessary changes to your custom Dockerfile, you can now create an updated version of your Docker image with the help of the following code:

$ docker build -t yourname/eros ./docker/eros/

Next, you'll need to publish the image to Docker Hub so that you can pull it down on your EC2 instance:

$ docker push yourname/eros

On your EC2 instance, get the latest version of yourname/eros that you just published:

ubuntu@ip-address:~$ sudo docker pull yourname/eros

With the last step, everything is now in place and your jobs are ready to be executed.

Execution

At this point, you can run a Docker container from your latest Docker image to generate a file for the RGB satellite image data by using the following code:

ubuntu@ip-address:~$ export S3_BUCKET=scoresbysund
ubuntu@ip-address:~$ export SCENE=LC82260102014232LGN00
ubuntu@ip-address:~$ export IMGTYPE=rgb
ubuntu@ip-address:~$ sudo docker run 
  -e "S3_IMAGE_TITLE=RGB Image: Scene $SCENE" 
  -e "S3_IMAGE_TYPE=$IMGTYPE" 
  -e "S3_IMAGE_FILENAME=$SCENE-$IMGTYPE-`date "+%Y%m%d%H%M%S"`.png" 
  -e "S3_BUCKET_NAME=$BUCKET" 
  -e "S3_PATH=https://s3-us-west-2.amazonaws.com/$S3_BUCKET" 
  -e "EROS_SCENE_ID=$SCENE" 
  -e "AWS_ACCESS_KEY_ID=`cat /etc/aws/access`" 
  -e "AWS_SECRET_ACCESS_KEY=`cat /etc/aws/secret`" 
  -t yourname/eros

As the task runs, you will see the following output:

Generating scene image ...
Saving image to S3 ...
0.0/100
27.499622350156933/100
54.99924470031387/100
82.4988670504708/100
100.0/100

Remember that on a relatively modern iMac, this job took about 7 to 8 minutes. Executing it on EC2 just now only took about 15 seconds.

For the false-color short-wave and the IR image, you can run a similar command, as follows:

ubuntu@ip-address:~$ export IMGTYPE=swir2nirg
ubuntu@ip-address:~$ sudo docker run 
  -e "S3_IMAGE_TITLE=False-Color Image: Scene $SCENE" 
  -e "S3_IMAGE_TYPE=$IMGTYPE" 
  -e "S3_IMAGE_FILENAME=$SCENE-$IMGTYPE-`date "+%Y%m%d%H%M%S"`.png" 
  -e "S3_BUCKET_NAME=$BUCKET" 
  -e "S3_PATH=https://s3-us-west-2.amazonaws.com/$S3_BUCKET" 
  -e "EROS_SCENE_ID=$SCENE" 
  -e "AWS_ACCESS_KEY_ID=`cat /etc/aws/access`" 
  -e "AWS_SECRET_ACCESS_KEY=`cat /etc/aws/secret`" 
  -t yourname/eros
Generating scene image ...
Saving image to S3 ...
0.0/100
26.05542715379233/100
52.11085430758466/100
78.16628146137698/100
100.0/100

You can confirm that both of the images have been saved to your bucket by refreshing the S3 screen in your AWS Console.

Though it may seem awkward to parameterize the Docker container with so many environment variables, this allows you to easily change the data that you pass without having to regenerate the Docker image. Your Docker image produces containers that are generally useful for the task at hand—and potentially many other tasks—allowing you to process any scene without any code changes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset