Chapter 7. Deploying matplotlib in Cloud Environments

With this chapter, we will move into the topics that focus on the computationally intensive matplotlib tasks. This is not something that is usually associated with matplotlib directly, but rather with libraries like NumPy, Pandas, or scikit-learn, which are often brought to bear on large number-crunching jobs. However, there are a number of situations in which organizations or individual researchers need to generate a large number of plots. In the remainder of the book, our exploration of matplotlib in advanced usage scenarios will rely on the free or low-cost modern techniques that are available to the public. In the early 1960s, the famous computer scientist John McCarthy predicted a day when computational resources would be available like the public utilities of electricity and water. This has indeed come to pass, and we will now turn our focus to these types of environments.

We will cover the following topics in this chapter:

  • Making a use case for matplotlib in the Cloud
    • Preparing a well-defined workflow
    • Choosing technologies
  • AWS and Docker
    • Local setup
    • Using Docker
    • Thinking about deployment
    • Working with AWS
    • Running matplotlib tasks

To follow along with this chapter's code, clone the notebook's repository and start up IPython by using the following code:

$ git clone https://github.com/masteringmatplotlib/cloud-deploy.git
$ cd cloud-deploy
$ make

Making a use case for matplotlib in the Cloud

At first blush, it may seem odd that we are contemplating the distributed use of a library that has historically been focused on desktop-type environments. However, if we pause to ponder over this, we will see its value. You will have probably noticed that with large data sets or complex plots, matplotlib runs more slowly than we might like. What should we do when we need to generate a handful of plots for very large data sets or thousands of plots from diverse sources? If this sounds far-fetched, keep in mind that there are companies that have massive PDF-generating farms of servers for such activities.

This chapter will deal with a similar use case. You are a researcher working for a small company, tracking climactic patterns and possible changes at both the poles. Your team is focused on the Arctic and your assignment is to process the satellite imagery for the east coast of Greenland, which includes not only the new images as they come (every 16 days), but also the previously captured satellite data. For the newer material (2013 onwards), you will be utilizing the Landsat 8 data, which was made available through the combined efforts of the United States Geological Survey (USGS) and the NASA Landsat 8 project and the USGS EROS data archival services.

The data source

For your project, you will be acquiring data from the EROS archives by using the USGS EarthExplorer site (downloads require a registered user account, which is free). You will use their map to locate scenes—specific geographic areas of satellite imagery that can be downloaded using the EarthExplorer Bulk Download Application (BDA). Your initial focus will be data from scoresbysund, the largest and longest fjord system in the world. Your first set of data will come from the LC82260102014232LGN00 scene ID, a capture that was taken in August 2014 as shown in the following screenshot:

The data source

Once you have marked the area in EarthExplorer, click the Data Sets view, expand Landsat Archive, and then select L8 OLI/TIRS. After clicking on Results, you will be presented with a list of scenes, each with a preview. You can click on the thumbnail image to see the preview, to check whether you have the right scene. Once you have located the scene, click on the tiny little package icon (it will be light brown/tan in color). If you haven't logged in, you will be prompted to. Add it to your cart, and then go to your cart to complete the free order.

Next, you will need to open the BDA and download your order from there (when the BDA opens, it will show the pending orders and present you with the option of downloading them). BDA will download a tarball (tar archive) to the given directory. From that directory, you can create a scene directory and unpack the files.

Defining a workflow

Before creating a Cloud workflow, we need to step through the process manually to identify all the steps and indicate those that may be automated. We will be using a data set from a specific point in time, but what we will define here should be usable by any Landsat 8 data, and some of it will be usable by the older satellite remote sensing data.

We will start by organizing the data. The BDA will save its downloads to a specific location (different for each operating system). Let's move all the data that you've taken with the EarthExplorer BDA to a location that we can easily reference in our IPython Notebook—/EROSData/L8_OLI_TIRS. Ensure that your scene data is in the LC82260102014232LGN00 directory.

Next, let's perform the necessary imports and the variable definitions, as follows:

In [1]: import matplotlib
        matplotlib.use('nbagg')
        %matplotlib inline
In [2]: from IPython.display import Image 
        import sys
        sys.path.append("../lib")
        import eros

With the last line of code, we bring in the custom code created for this chapter and task (based on the demonstration code by Milos Miljkovic, which he delivered as a part of a talk at the PyData 2014 conference). Here are the variables that we will be using:

In [3]: path = "/EROSData/L8_OLI_TIRS"
        scene_id = "LC82260102014232LGN00"

With this in place, we're ready to read some Landsat data and write them to the files in the following way:

In [4]: rgb_image = eros.extract_rgb(path, scene_id)

If you examine the source for the last two function calls, you will see that identifying the files that are associated with the Landsat band data, extracting them from the source data, and then creating a data structure to represent the red, blue, and green channels needed for digital color images. The Landsat 8 bands are as follows:

  • Band 1: It's represented by deep blue and violet. It is useful for tracking coastal waters and fine particles in the air.
  • Band 2: It's represented by visible blue light.
  • Band 3: It's represented by visible green light.
  • Band 4: It's represented by visible red light.
  • Band 5: It's represented by Near-Infrared (NIR) light. It is useful for viewing healthy plants.
  • Band 6 and 7: They are represented by Short-Wavelength Infrared (SWIR) light. It is useful for identifying wet and dry earth. It shows the contrast between the rocks and soil.
  • Band 8: It's represented by a panchromatic emulsion like a black and white film; this band combines all the colors. Due to its sharp contrast and high resolution, it's useful if you want to zoom in on the details.
  • Band 9: This is a narrow slice of wavelengths that is used by a few space-based instruments. It is useful for examining the cloud cover and very bright objects.
  • Band 10 and 11: They are represented by a thermal infrared light. It is useful when you want to obtain the temperature of the air.

In your task, you will use bands 1 through 4 for water and RGB, 5 for vegetation, and 7 to pick out the rocks. Let's take a look at the true-color RGB image for the bands that we just extracted by using the following code:

In [5]: eros.show_image(
    rgb_image, "RGB image, data set " + scene_id,
    figsize=(20, 20))

The following image is the result of the preceding code:

Defining a workflow

The preceding image isn't very clear. The colors are all quite muted. We can gain some insight into this by looking at a histogram of the data files for each color channel by using the following code:

In [6]: eros.show_color_hist(
    rgb_image, xlim=[5000,20000], ylim=[0,10000],
    figsize=(20, 7))

The following histogram is the result of the preceding code:

Defining a workflow

As you can see, a major part of the color information is concentrated in a narrower band, while the other data is still included. Let's create a new image by using ranges based on a visual assessment of the preceding histogram. We'll limit the red channel to the range of 5900-11000, green to 6200-11000, and blue to 7600-11000:

In [7]: rgb_image_he = eros.update_image(
    rgb_image,
    (5900, 11000), (6200, 11000), (7600, 11000))
        eros.show_image(
    rgb_image_he, "RGB image, histogram equalized",
    figsize=(20, 20))

The following image is the result of the preceding code:

Defining a workflow

With the preceding changes, the colors really pop out of the satellite data. Next, you need to create your false-color image.

You will use the Landsat 8 band 1 (coastal aerosol) as blue, band 5 (NIR) as green, and band 7 (SWIR) as red to gain an insight into the presence of water on land, ice coverage, levels of healthy vegetation, and exposed rock or open ground. These will be used to generate a false color image. You will use the same method as with the previous image—generating a histogram, analyzing it for the best points of the color spectrum, and then displaying the image. This can be done with the help of the following code:

In [8]: swir2nirg_image = eros.extract_swir2nirg(path, scene_id)
In [9]: eros.show_color_hist(
    swir2nirg_image, xlim=[4000,30000], ylim=[0,10000],
    figsize=(20, 7))

The following histogram is the result of the preceding code:

Defining a workflow

Let's create the image for the histogram using the following code:

In[10]: swir2nirg_image_he = eros.update_image(
    swir2nirg_image,
    (5900, 15000), (6200, 15000), (7600, 15000))
        eros.show_image(swir2nirg_image_he, "",
    figsize=(20, 20))

The following is the resultant image:

Defining a workflow

On a 2009 iMac (Intel Core i7, 8 GB RAM), the processing of the Landsat 8 data and the generation of the associated images took about 7 minutes with the RAM usage peaking at around 9 GB. For multiple runs, the IPython kernel needs to be restarted just to free up the RAM quickly enough. It's quite clear that performing these tasks on a moderately equipped workstation would be a logistically and economically unfeasible proposition for thousands (or even hundreds) of scenes.

So, you will instead accomplish these tasks with the help of utility computing. The following are the necessary steps that are used to carry out the tasks:

  1. Define a Landsat 8 scene ID.
  2. Ensure that the data is available.
  3. Extract the coastal/aerosol, RGB, NIR, and SWIR data.
  4. Identify the optimum ranges for display in each channel. We'll skip this step when we automate. However, this is an excellent exercise for the motivated readers.
  5. Generate the image files for the extracted data.

These need to be migrated to the desired Cloud platform and augmented according to the needs of the associated tools. This brings us to the important question: Which technology should we use?

Choosing technologies

There is a dizzying array of choices when it comes to selecting a combination of a vendor, an operating system, vendor service options, a configuration management solution, and deployment options for Cloud environments. You can select one of the several OpenStack providers, such as Google Cloud Platform, Amazon AWS, Heroku, and Docker's dotCloud. Linux or BSD is probably the best choice for the host and guest OS, but even that leaves open many possibilities. Some vendors offer RESTful web services, SOAP, or dedicated client libraries that either wrap one of these or provide direct access.

In your case, you've done some testing on the speed needed to transfer considerably large files that are approximately 150 MB in size for each Landsat 8 band from a storage service to a running instance. Combining the speed requirements with usability, you found out that at the current time, Amazon's AWS came up as the winner in a close race against its contending Cloud service platforms. Since we will be using the recent versions of Python (3.4.2) and matplotlib (1.4.2) and we need a distribution that provides these pre-built, we have opted for Ubuntu 15.04. You will spin up the guest OS instances to run each image processing job, but now you need to decide how to configure these and determine the level of automation that is needed.

Configuration management

It was in this capacity that Docker made its way into the Linux world of configuration management. Systems administrators were looking for more straightforward solutions to problems that did not require the feature sets and complexities of larger tools. Configuration management can encompass topics such as version control, packaging, software distribution, build management, deployment, and change control, just to name a few. For our purposes, we will focus on configuration management as it concerns the following:

  • High-level dependency management
  • The creation and management of baseline systems as well as the task of building on the same
  • Deployment of a highly specified system to a particular environment

In the world of open source software configuration management, there are two giants that stand out—Chef and Puppet. Both of these were originally written in Ruby, with the Chef server having been rewritten in Erlang. In the world of Python, Salt and Ansible have risen to great prominence. Unfortunately, neither of the Python solutions currently support Python 3. Systems like Chef and Puppet are fairly complex and suited to addressing the problems of managing large numbers of systems with a multitude of possible configurations under continually changing circumstances. Unless one already has an expertise in these systems, their use is outside the scope of our current task.

This brings us to an interesting option that is almost outside the realm of configuration management—Docker. Docker is a software that wraps access to the Linux container technology, allowing the fast creation of operating system images that can then be run on a host system. Thus, this software utilizes a major part of the underlying system while providing an isolated environment, which can be customized according to the specific needs.

It was in this capacity that Docker made its way into the Linux world of configuration management via the system administrators, who were looking for more straightforward solutions for problems that did not require the feature sets and complexities of larger tools. Likewise, it is a perfect match for our needs. As a part of this chapter, we have provided various baselines for your use. These baselines are as follows:

  • masteringmatplotlib/python: This is a Python 3.4.2 image built on the official Ubuntu 15.05 Docker image
  • masteringmatplotlib/scipy: This is a NumPy 1.8.2, SciPy 0.14.1, and matplotlib 1.4.2 image that is based on the masteringmatplotlib/python image
  • masteringmatplotlib/eros: This is a custom image that contains not only the software used in this chapter based on the masteringmatplotlib/scipy image, but also the Python Python Imaging Library (PIL) and scikit-image libraries

We will discuss the Docker images in more detail shortly.

Types of deployment

Docker had its genesis in a Cloud platform and as one might guess, it is ideally suited for deployments on multiple Cloud platforms including the likes of Amazon AWS, Google Cloud Platform, Azure, OpenStack, dotCloud, Joyent, and OpenShift. Each of these differs from the others—only slightly when it comes to some features and enormously with regard to the others. Conceptually though, they offer utility-scale virtualization services, which is something that is particularly well suited to Docker. Which one is the best for your general needs depends on many of the same criteria that exist for any hosting scenario, regardless of the underlying technology.

Each of these will also let you spin up multiple Docker containers, allowing for entire application stacks to run an assortment of Docker images. With Docker's support of orchestration with a new set of tools, the number of possibilities for deployment options and the associated flexibility has been greatly increased.

As mentioned previously, in your tests for the Landsat 8 data scenario, you assessed AWS as the best fit. You looked at Elastic Beanstalk, but opted for a very simple solution that offers you more control—you will deploy a large Elastic Compute Cloud (EC2) Linux instance and use it to fire up the satellite-data-processing Docker containers as needed.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset