The rest of this chapter is dedicated to the running of the matplotlib USGS/EROS
image generation task in AWS using EC2, S3, and Docker. We are going to need to perform two stages of preparation—work that needs to be done locally and the setup that needs to happen in the Cloud. With these complete, we will be ready to execute our prepared task.
Your local setup will include an installation of Docker (and boot2docker
if you are using Mac or Windows). It will create or download Dockerfiles
, generate images from these files, extend the base images as necessary, and start up a Docker image to ensure that everything is in working order.
Here's what you will need for the remainder of this chapter:
boot2docker
(for easily using Docker from Windows or Mac)If you're running Linux, you can skip the rest of this section. If you haven't run boot2docker
before, you'll need to run the following command first:
$ boot2docker init
If you have previously initiated boot2docker
, you can just do the following code:
$ boot2docker up
At this point, you will see an output that looks like the following:
Waiting for VM and Docker daemon to start......ooo Started. Writing ~/.boot2docker/certs/boot2docker-vm/ca.pem Writing ~/.boot2docker/certs/boot2docker-vm/cert.pem Writing ~/.boot2docker/certs/boot2docker-vm/key.pem To connect the Docker client to the Docker daemon, please set: export DOCKER_CERT_PATH=~/.boot2docker/certs/boot2docker-vm export DOCKER_TLS_VERIFY=1 export DOCKER_HOST=tcp://192.168.59.103:2376
You can either manually export the environment variables, or run the following code from your shell prompt:
$ $(boot2docker shellinit)
The preceding code will set the appropriate variables in your shell environment for you automatically. At this point, Docker is ready for use.
The heart of configuration management is the Dockerfile
. This will be used to generate the Docker image that you need to run the Docker containers, which is where your matplotlib tasks will actually happen. If you are unfamiliar with Docker, here's a quick summary of how to think about the components that we have just mentioned:
One of the features of Docker is that through its underlying use of a unification file system one is able to load images, starting with a base image and adding increasingly more specific images until the desired configuration state is achieved. This is exactly what we will do. The company that you work for, as stated earlier, is a Python 3 shop. So, they've built a Docker image that has all the basic goodies for Python 3 on Ubuntu 15.04. Furthermore, since the research and computation groups make heavy use of NumPy, SciPy, Pandas, and matplotlib, a second image has been created by using the Python 3 image as a base.
Here's what the Python 3 Dockerfile
looks like:
In [11]: cat ../docker/python/Dockerfile FROM ubuntu:vivid MAINTAINER Duncan McGreggor <[email protected]> ENV DEBIAN_FRONTEND noninteractive RUN apt-get update RUN apt-get upgrade -y RUN apt-get install -y -q apt-utils RUN apt-get install -y -q ca-certificates git build-essential RUN apt-get install -y -q libssl-dev libcurl4-openssl-dev RUN apt-get install -y -q curl RUN apt-get install -y -q cython3 libpython3.4-dev python3.4-dev python3-setuptools python3-pip CMD python3
Note that this Dockerfile
has not been created from scratch. Rather, it is based on another Docker image—the official ubuntu:vivid
image. It has a maintainer that sets an environment variable, which will be available for each of the RUN
and CMD
directives as well as when the Docker image is running (with and without an interactive session). Each of the RUN
commands is executed when building the Docker image. The CMD
command is what will be run by default when executing Docker run on the command line.
This
Dockerfile
has been used to generate an image, which has been published to
Docker Hub with the masteringmatplotlib/python
tag. As such, you will not need to build this yourself.
The next Dockerfile
that we will look at is the one that your group uses for the majority of its scientific computing tasks. Here is a Dockerfile
:
In [12]: cat ../docker/scipy/Dockerfile FROM masteringmatplotlib/python MAINTAINER Duncan McGreggor <[email protected]> ENV DEBIAN_FRONTEND noninteractive RUN apt-get install -y -q libatlas3-base libblas-dev libblas3 libatlas-base-dev libatlas-dev liblapack-dev gfortran ENV LAPACK /usr/lib/liblapack.so ENV ATLAS /usr/lib/libatlas.so ENV BLAS /usr/lib/libblas.so RUN apt-get install -y -q python3-six python3-flake8 python3-dateutil python3-pyparsing python3-numpy python3-scipy python3-matplotlib python3-pandas RUN pip3 install seaborn CMD python3
In this case, the Dockerfile
is based on the Python 3 Dockerfile
. It is extended by additional installations of the libraries that are commonly needed for scientific computing that is performed by using Python. The Dockerfile
is used to create an image and pushed to Docker Hub using the masteringmatplotlib/scipy
tag. This is the one that we will be extending for our task.
The preceding scipy
Docker image has almost everything we need. It's just missing a few dependencies, which are available in this chapter's Git repository. These dependencies include the following:
scikit-image
librarySo, how can we customize the scipy
image to include the preceding dependencies? There are two ways to do this:
Dockerfile
that is based on the imageWe will use the second option so that we are able to easily track changes in the source code of the Dockerfile
. We've provided the following file in the notebook repository:
In [13]: cat ../docker/simple/Dockerfile FROM masteringmatplotlib/scipy MAINTAINER Py3 Hacker <[email protected]> ENV HOME /root ENV REPO cloud-deploy RUN cd $HOME && git clone https://github.com/masteringmatplotlib/${REPO}.git RUN cd $HOME/$REPO && make docker-setup CMD PYTHONPATH=$HOME/$REPO/lib:$PYTHONPATH python3
Points to note:
Dockerfile
extends the masteringmatplotlib/scipy
Docker image.PYTHONPATH
. In most situations, you have to create a setup.py
file for your Python library and install it with pip
in the Dockerfile
build steps. Thus, you don't have to mess with PYTHONPATH
when running your commands in the Dockerfile
.Let's build a new image! First, run the following code:
$ docker build -t yourname/eros ./docker/simple/Dockerfile
The -t
parameter instructs docker
to tag the image with the provided name once it's built. The prefix before /
should match the name used on Docker Hub if you're going to publish the image there. This can be a username or an organization.
Once you execute the preceding command, you will see the following output:
Sending build context to Docker daemon 2.56 kB Sending build context to Docker daemon Step 0 : FROM ipython/scipystack ---> 113395173d25 Step 1 : MAINTAINER Py Hacker <[email protected]> ---> Using cache ---> fd520c92b33b [snip] Removing intermediate container 90983e9fdd54 Step 6 : CMD PYTHONPATH=./cloud-deploy/lib:$PYTHONPATH python3 ---> Running in b7a022f2ac29 ---> abde2bb0eeaa Removing intermediate container b7a022f2ac29 Successfully built abde2bb0eeaa
Let's make sure that the library is present in our new image by using the -i
option for docker run
to indicate that we will need an interactive session with the container (this keeps STDIN
open):
$ docker run -t -i yourname/eros python3 >>> import eros >>> ^D $
Looks like our simple image that was built on the top of masteringmatplotlib/scipy
worked like a charm. Now, let's make some changes to it.
We need to make a couple of changes to the simple case so that it fulfills the following conditions:
DISPLAY
environment)dispatch
function, which will generate the desired type of satellite imageBoth of the preceding conditions can be fulfilled simply by changing the Docker CMD
directive in the following way:
In [14]: cat ../docker/eros/Dockerfile FROM masteringmatplotlib/scipy MAINTAINER Py3 Hacker <[email protected]> ENV HOME /root ENV REPO cloud-deploy RUN cd $HOME && git clone https://github.com/masteringmatplotlib/${REPO}.git RUN cd $HOME/$REPO && make docker-setup CMD DOCKER_CONTAINER=true PYTHONPATH=${HOME}/${REPO}/lib:$PYTHONPATH python3 -c "import eros;eros.s3_generate_image();"
The s3_generate_image
function is the dispatcher, and depending upon the environment variables that are set when running Docker, it will take different actions. We will discuss this more in a later section.
Having prepared the local machine to create the Docker images that we will use in the Cloud, we now need to set up the other end—getting the Cloud ready for our images. In the following sections, we will copy the Landsat image data to a remote storage service, create a virtual machine in the Cloud that will be the host OS for the Docker images, and finally ensure that we can read and write data in our images to and from the storage service.
The Landsat 8 data files that we are working with are sizable, with each file ranging from about 150 MB to 600 MB. As such, we want to be selective with regard to what we'll be pushing to S3. For your project, the following Landsat bands are needed:
All the files for a particular scene weigh over 2 GB, so we'll just want to push the files for the bands we need as per the Landsat bands that were noted in the preceding section. Given that we define the following shell variables:
$ SCENE_PATH="/EROSData/L8_OLI_TIRS" $ SCENE=LC82260102014232LGN00
The files that we need to upload can be identified with the help of the following code:
$ find $SCENE_PATH/$SCENE -name "*_B[1-5,7].TIF" -exec basename {} ; LC82260102014232LGN00_B1.TIF LC82260102014232LGN00_B2.TIF LC82260102014232LGN00_B3.TIF LC82260102014232LGN00_B4.TIF LC82260102014232LGN00_B5.TIF LC82260102014232LGN00_B7.TIF
Before running the following commands, you need to make sure that the user associated with the access and secret keys has the appropriate S3 permissions (for example, the ability to upload the files). This is done in the AWS Management Console in the IAM screen through various means (your preference with regard to the combination of users, groups, roles, and policies).
Let's start by setting some AWS shell variables in a terminal window on your local machine by using the following code:
$ export AWS_ACCESS_KEY_ID=YOURACCESSKEY $ export AWS_SECRET_ACCESS_KEY=YOURSECRETKEY
These will be used by the aws
command-line utility, which was installed when you ran the make
command in the IPython Notebook repository at the beginning of the chapter. Let's also set a bucket name variable by using the following code:
$ S3_BUCKET=scoresbysund
Note that the Amazon S3 bucket names are global like DNS. As such, this bucket may already exist. So, be ready with an alternate name.
Now we can create the S3 bucket in the following way:
$ aws s3 mb s3://$S3_BUCKET
With the new bucket in place, we can now upload the selected Landsat 8 scene files:
$ for FILE in "$SCENE_PATH"/$SCENE/*_B[2-5,7].TIF do aws s3 cp "$FILE" s3://$S3_BUCKET done
That's a total of about 888 MB. So, depending on the upload speed of your Internet connection, you may be in for a wait.
The files that the satellite image processing task will need have been uploaded. The next step is to set up a server on which the Docker container tasks will run.
In the previous testing, you discovered that an m3/xlarge EC2
instance, along with its 15 GB RAM, will be required due to the intensive memory requirements for the task of image processing. The next step involved an instance the requires a 7.5 GB RAM; this generated out-of-memory errors, indicating that the RAM was insufficient for the instance.
To create an EC2 instance on AWS, perform the following steps:
The following screenshot shows the Review Instance Launch step:
Once the instance is up and running, get the IP address from the AWS Management Console from where you launched it and use it to SSH into it:
$ ssh -i /path/to/your-ec2-key-pair.pem ubuntu@instance-ip-address
Once you have activated SSH into the running EC2 instance, prep the instance by installing Docker and saving your AWS credentials on the filesystem. You will need access to these credentials when you start up the Docker containers so that the Python script on the container can read from and write to S3:
ubuntu@ip-address:~$ sudo apt-get install -y docker.io ubuntu@ip-address:~$ sudo mkdir /etc/aws ubuntu@ip-address:~$ sudo vi /etc/aws/access ubuntu@ip-address:~$ sudo vi /etc/aws/secret ubuntu@ip-address:~$ sudo chmod 600 /etc/aws/* ubuntu@ip-address:~$ sudo chown ubuntu /etc/aws/*
Now, you need to pull down the Docker image that we've created for this task and then run a container by using this image in the interactive mode with the help of Python as the shell. You can use either the Docker image that you created or the one that we did (masteringmatplotlib/eros
):
ubuntu@ip-address:~$ sudo docker run -i -t masteringmatplotlib/eros python3 Python 3.4.3 (default, Feb 27 2015, 02:35:54) [GCC 4.9.2] on linux Type "help", "copyright", "credits" or "license" for more information. >>>
This will attempt to run the masteringmatplotlib/scipy
Docker image, which won't be present on your EC2 instance. So, it will then download it from Docker Hub (also downloading all the images upon which it is built). Once this finishes, do a quick test to make sure that everything is in place:
>>> import eros >>>
You should get no errors. This will indicate that you are all set for the next step!
In order to read your scene data from the files that you uploaded to S3, you'll need to do the following:
The easiest way to read data from S3 in EC2 is to open the HTTP URL for the file in question. To do this, you need to do the following:
{ "Version": "2012-10-17", "Id": "S3ScoresbySundGetPolicy", "Statement": [ { "Sid": "IPAllow", "Effect": "Allow", "Principal": "*", "Action": "s3:*", "Resource": "arn:aws:s3:::scoresbysund/*", "Condition" : { "IpAddress" : { "aws:SourceIp": "YOUREC2IPADDRESS/32" } } } ] }
The following screenshot shows the code pasted in the Bucket Policy Editor:
With this change, your storage dependencies are now complete and your scripts on the EC2 instance, which use the appropriate AWS access credentials, will be able to read from and write to S3.
As it is evident from the work that we have done till now, the operation of processing in remote environments requires a great deal of preparation and attention to detail. We have now made the way clear for our matplotlib satellite image generation task. We are ready to pass the parameters to Docker, which will let us flexibly handle many tasks with one image. We are also going to first tweak the Python code to handle these parameters and then finally execute our task.
When the EC2 instance starts up a Docker container that has to build images from the Landsat 8 data files, the Docker container will need to know a few things, which are as follows:
We can pass this information to the Docker container by using the -e
flag, which will set the environment variables in the container once it starts. Before we try using this in a script, let's make sure that the feature behaves according to our expectations by starting up a Docker container in the EC2 instance in the following way:
ubuntu@ip-address:~$ sudo docker run -i -e "PYTHONPATH=/root/cloud-deploy" -e "EROS_SCENE_ID=LC82260102014232LGN00" -e "AWS_ACCESS_KEY_ID=`cat /etc/aws/access`" -e "AWS_SECRET_ACCESS_KEY=`cat /etc/aws/secret`" -t masteringmatplotlib/eros python3
This will drop us into a Python prompt in the container, where we can check out the environment variables:
>>> import os >>> os.getenv("EROS_SCENE_ID") 'dummy001' >>> os.getenv("AWS_ACCESS_KEY_ID") 'YOURACCESSKEY' >>> os.getenv("AWS_SECRET_ACCESS_KEY") 'YOURSECRETACCESSKEY'
Everything worked perfectly, just as one might have expected.
Now that you've confirmed that you can get the data that you need into the Docker containers, you can update your code to check for some specific data that you will set when running a container to generate satellite images. For instance, towards the beginning of the lib/ec2s3eros.py
module, we have the following:
bucket_name = os.environ.get("S3_BUCKET_NAME") scene_id = os.environ.get("EROS_SCENE_ID") s3_path = os.environ.get("S3_PATH") s3_title = os.environ.get("S3_IMAGE_TITLE") s3_filename = os.environ.get("S3_IMAGE_FILENAME") s3_image_type = os.environ.get("S3_IMAGE_TYPE", "").lower() access_key = os.environ.get("AWS_ACCESS_KEY_ID") secret_key = os.environ.get("AWS_SECRET_ACCESS_KEY")
The preceding lines of code are what the code will use to create the suitable image and save it to the appropriate place with the expected name. You can see this clearly if you scroll towards the end of the file. Here's an example of one of these variables getting used to dispatch the appropriate image-generating function:
def s3_generate_image(): if s3_image_type == "rgb": s3_image_rgb() elif s3_image_type == "swir2nirg": s3_image_swir2nirg()
There's another important change that we had to make. In order for matplotlib to run successfully on EC2, we need to set an explicit backend. The matplotlib module is only smart enough to choose a backend based on the operating system. As it has been designed for use with GUIs, it makes an assumption that you not only have a DISPLAY
environment variable set, but more importantly, there is an actual display to which this variable points.
On EC2 and other Cloud environments, this will almost always not be the case. If you look at the top of the lib/ec2s3eros.py
module, you will see the following:
import matplotlib as mpl if os.environ.get("DOCKER_CONTAINER") == "true": mpl.use("Agg")
The environment variable that you see in the preceding code is the one that set in the CMD
directive of the Dockerfile
:
CMD DOCKER_CONTAINER=true PYTHONPATH=${HOME}/${REPO}/lib:$PYTHONPATH python3 -c "import eros;eros.s3_generate_image();"
As you can see in the module, we used the environment variable to determine whether the module is being used in a Docker container (with no variable set). If it is being used in Docker, we explicitly set the backend to something that will not throw errors if there is no display.
In the preceding example, we have done this in the code, since the file already existed and it was just a two-line change. However, we can also provide a custom matplotlibrc
file, which will set the default backend. For the long term, this is probably the better approach because of the following reasons:
Dockerfile
that installs matplotlib (for us, this was the one that generated the masteringmatplotlib/scipy
Docker image)matplotlibrc
file, and you will not need to make any code changes to run in virtualized environments.Subsequently, the developer and user experience for these Docker images will be greatly improved. The administrators who are responsible for the creation of new images with these as the basis will have less work to do and the users will have one less error to face when getting started.
Back on your workstation, having made the necessary changes to your custom Dockerfile
, you can now create an updated version of your Docker image with the help of the following code:
$ docker build -t yourname/eros ./docker/eros/
Next, you'll need to publish the image to Docker Hub so that you can pull it down on your EC2 instance:
$ docker push yourname/eros
On your EC2 instance, get the latest version of yourname/eros
that you just published:
ubuntu@ip-address:~$ sudo docker pull yourname/eros
With the last step, everything is now in place and your jobs are ready to be executed.
At this point, you can run a Docker container from your latest Docker image to generate a file for the RGB satellite image data by using the following code:
ubuntu@ip-address:~$ export S3_BUCKET=scoresbysund ubuntu@ip-address:~$ export SCENE=LC82260102014232LGN00 ubuntu@ip-address:~$ export IMGTYPE=rgb ubuntu@ip-address:~$ sudo docker run -e "S3_IMAGE_TITLE=RGB Image: Scene $SCENE" -e "S3_IMAGE_TYPE=$IMGTYPE" -e "S3_IMAGE_FILENAME=$SCENE-$IMGTYPE-`date "+%Y%m%d%H%M%S"`.png" -e "S3_BUCKET_NAME=$BUCKET" -e "S3_PATH=https://s3-us-west-2.amazonaws.com/$S3_BUCKET" -e "EROS_SCENE_ID=$SCENE" -e "AWS_ACCESS_KEY_ID=`cat /etc/aws/access`" -e "AWS_SECRET_ACCESS_KEY=`cat /etc/aws/secret`" -t yourname/eros
As the task runs, you will see the following output:
Generating scene image ... Saving image to S3 ... 0.0/100 27.499622350156933/100 54.99924470031387/100 82.4988670504708/100 100.0/100
Remember that on a relatively modern iMac, this job took about 7 to 8 minutes. Executing it on EC2 just now only took about 15 seconds.
For the false-color short-wave and the IR image, you can run a similar command, as follows:
ubuntu@ip-address:~$ export IMGTYPE=swir2nirg ubuntu@ip-address:~$ sudo docker run -e "S3_IMAGE_TITLE=False-Color Image: Scene $SCENE" -e "S3_IMAGE_TYPE=$IMGTYPE" -e "S3_IMAGE_FILENAME=$SCENE-$IMGTYPE-`date "+%Y%m%d%H%M%S"`.png" -e "S3_BUCKET_NAME=$BUCKET" -e "S3_PATH=https://s3-us-west-2.amazonaws.com/$S3_BUCKET" -e "EROS_SCENE_ID=$SCENE" -e "AWS_ACCESS_KEY_ID=`cat /etc/aws/access`" -e "AWS_SECRET_ACCESS_KEY=`cat /etc/aws/secret`" -t yourname/eros Generating scene image ... Saving image to S3 ... 0.0/100 26.05542715379233/100 52.11085430758466/100 78.16628146137698/100 100.0/100
You can confirm that both of the images have been saved to your bucket by refreshing the S3 screen in your AWS Console.
Though it may seem awkward to parameterize the Docker container with so many environment variables, this allows you to easily change the data that you pass without having to regenerate the Docker image. Your Docker image produces containers that are generally useful for the task at hand—and potentially many other tasks—allowing you to process any scene without any code changes.