With this chapter, we will move into the topics that focus on the computationally intensive matplotlib tasks. This is not something that is usually associated with matplotlib directly, but rather with libraries like NumPy, Pandas, or scikit-learn, which are often brought to bear on large number-crunching jobs. However, there are a number of situations in which organizations or individual researchers need to generate a large number of plots. In the remainder of the book, our exploration of matplotlib in advanced usage scenarios will rely on the free or low-cost modern techniques that are available to the public. In the early 1960s, the famous computer scientist John McCarthy predicted a day when computational resources would be available like the public utilities of electricity and water. This has indeed come to pass, and we will now turn our focus to these types of environments.
We will cover the following topics in this chapter:
To follow along with this chapter's code, clone the notebook's repository and start up IPython by using the following code:
$ git clone https://github.com/masteringmatplotlib/cloud-deploy.git $ cd cloud-deploy $ make
At first blush, it may seem odd that we are contemplating the distributed use of a library that has historically been focused on desktop-type environments. However, if we pause to ponder over this, we will see its value. You will have probably noticed that with large data sets or complex plots, matplotlib runs more slowly than we might like. What should we do when we need to generate a handful of plots for very large data sets or thousands of plots from diverse sources? If this sounds far-fetched, keep in mind that there are companies that have massive PDF-generating farms of servers for such activities.
This chapter will deal with a similar use case. You are a researcher working for a small company, tracking climactic patterns and possible changes at both the poles. Your team is focused on the Arctic and your assignment is to process the satellite imagery for the east coast of Greenland, which includes not only the new images as they come (every 16 days), but also the previously captured satellite data. For the newer material (2013 onwards), you will be utilizing the Landsat 8 data, which was made available through the combined efforts of the United States Geological Survey (USGS) and the NASA Landsat 8 project and the USGS EROS data archival services.
For your project, you will be acquiring data from the EROS archives by using the USGS EarthExplorer site (downloads require a registered user account, which is free). You will use their map to locate scenes—specific geographic areas of satellite imagery that can be downloaded using the EarthExplorer Bulk Download Application (BDA). Your initial focus will be data from scoresbysund
, the largest and longest fjord system in the world. Your first set of data will come from the LC82260102014232LGN00
scene ID, a capture that was taken in August 2014 as shown in the following screenshot:
Once you have marked the area in EarthExplorer, click the Data Sets view, expand Landsat Archive, and then select L8 OLI/TIRS. After clicking on Results, you will be presented with a list of scenes, each with a preview. You can click on the thumbnail image to see the preview, to check whether you have the right scene. Once you have located the scene, click on the tiny little package icon (it will be light brown/tan in color). If you haven't logged in, you will be prompted to. Add it to your cart, and then go to your cart to complete the free order.
Next, you will need to open the BDA and download your order from there (when the BDA opens, it will show the pending orders and present you with the option of downloading them). BDA will download a tarball (tar archive) to the given directory. From that directory, you can create a scene directory and unpack the files.
Before creating a Cloud workflow, we need to step through the process manually to identify all the steps and indicate those that may be automated. We will be using a data set from a specific point in time, but what we will define here should be usable by any Landsat 8 data, and some of it will be usable by the older satellite remote sensing data.
We will start by organizing the data. The BDA will save its downloads to a specific location (different for each operating system). Let's move all the data that you've taken with the EarthExplorer BDA to a location that we can easily reference in our IPython Notebook—/EROSData/L8_OLI_TIRS
. Ensure that your scene data is in the LC82260102014232LGN00
directory.
Next, let's perform the necessary imports and the variable definitions, as follows:
In [1]: import matplotlib matplotlib.use('nbagg') %matplotlib inline In [2]: from IPython.display import Image import sys sys.path.append("../lib") import eros
With the last line of code, we bring in the custom code created for this chapter and task (based on the demonstration code by Milos Miljkovic, which he delivered as a part of a talk at the PyData 2014 conference). Here are the variables that we will be using:
In [3]: path = "/EROSData/L8_OLI_TIRS" scene_id = "LC82260102014232LGN00"
With this in place, we're ready to read some Landsat data and write them to the files in the following way:
In [4]: rgb_image = eros.extract_rgb(path, scene_id)
If you examine the source for the last two function calls, you will see that identifying the files that are associated with the Landsat band data, extracting them from the source data, and then creating a data structure to represent the red, blue, and green channels needed for digital color images. The Landsat 8 bands are as follows:
In your task, you will use bands 1 through 4 for water and RGB, 5 for vegetation, and 7 to pick out the rocks. Let's take a look at the true-color RGB image for the bands that we just extracted by using the following code:
In [5]: eros.show_image( rgb_image, "RGB image, data set " + scene_id, figsize=(20, 20))
The following image is the result of the preceding code:
The preceding image isn't very clear. The colors are all quite muted. We can gain some insight into this by looking at a histogram of the data files for each color channel by using the following code:
In [6]: eros.show_color_hist( rgb_image, xlim=[5000,20000], ylim=[0,10000], figsize=(20, 7))
The following histogram is the result of the preceding code:
As you can see, a major part of the color information is concentrated in a narrower band, while the other data is still included. Let's create a new image by using ranges based on a visual assessment of the preceding histogram. We'll limit the red channel to the range of 5900-11000, green to 6200-11000, and blue to 7600-11000:
In [7]: rgb_image_he = eros.update_image( rgb_image, (5900, 11000), (6200, 11000), (7600, 11000)) eros.show_image( rgb_image_he, "RGB image, histogram equalized", figsize=(20, 20))
The following image is the result of the preceding code:
With the preceding changes, the colors really pop out of the satellite data. Next, you need to create your false-color image.
You will use the Landsat 8 band 1 (coastal aerosol) as blue, band 5 (NIR) as green, and band 7 (SWIR) as red to gain an insight into the presence of water on land, ice coverage, levels of healthy vegetation, and exposed rock or open ground. These will be used to generate a false color image. You will use the same method as with the previous image—generating a histogram, analyzing it for the best points of the color spectrum, and then displaying the image. This can be done with the help of the following code:
In [8]: swir2nirg_image = eros.extract_swir2nirg(path, scene_id) In [9]: eros.show_color_hist( swir2nirg_image, xlim=[4000,30000], ylim=[0,10000], figsize=(20, 7))
The following histogram is the result of the preceding code:
Let's create the image for the histogram using the following code:
In[10]: swir2nirg_image_he = eros.update_image( swir2nirg_image, (5900, 15000), (6200, 15000), (7600, 15000)) eros.show_image(swir2nirg_image_he, "", figsize=(20, 20))
The following is the resultant image:
On a 2009 iMac (Intel Core i7, 8 GB RAM), the processing of the Landsat 8 data and the generation of the associated images took about 7 minutes with the RAM usage peaking at around 9 GB. For multiple runs, the IPython kernel needs to be restarted just to free up the RAM quickly enough. It's quite clear that performing these tasks on a moderately equipped workstation would be a logistically and economically unfeasible proposition for thousands (or even hundreds) of scenes.
So, you will instead accomplish these tasks with the help of utility computing. The following are the necessary steps that are used to carry out the tasks:
These need to be migrated to the desired Cloud platform and augmented according to the needs of the associated tools. This brings us to the important question: Which technology should we use?
There is a dizzying array of choices when it comes to selecting a combination of a vendor, an operating system, vendor service options, a configuration management solution, and deployment options for Cloud environments. You can select one of the several OpenStack providers, such as Google Cloud Platform, Amazon AWS, Heroku, and Docker's dotCloud. Linux or BSD is probably the best choice for the host and guest OS, but even that leaves open many possibilities. Some vendors offer RESTful web services, SOAP, or dedicated client libraries that either wrap one of these or provide direct access.
In your case, you've done some testing on the speed needed to transfer considerably large files that are approximately 150 MB in size for each Landsat 8 band from a storage service to a running instance. Combining the speed requirements with usability, you found out that at the current time, Amazon's AWS came up as the winner in a close race against its contending Cloud service platforms. Since we will be using the recent versions of Python (3.4.2) and matplotlib (1.4.2) and we need a distribution that provides these pre-built, we have opted for Ubuntu 15.04. You will spin up the guest OS instances to run each image processing job, but now you need to decide how to configure these and determine the level of automation that is needed.
It was in this capacity that Docker made its way into the Linux world of configuration management. Systems administrators were looking for more straightforward solutions to problems that did not require the feature sets and complexities of larger tools. Configuration management can encompass topics such as version control, packaging, software distribution, build management, deployment, and change control, just to name a few. For our purposes, we will focus on configuration management as it concerns the following:
In the world of open source software configuration management, there are two giants that stand out—Chef and Puppet. Both of these were originally written in Ruby, with the Chef server having been rewritten in Erlang. In the world of Python, Salt and Ansible have risen to great prominence. Unfortunately, neither of the Python solutions currently support Python 3. Systems like Chef and Puppet are fairly complex and suited to addressing the problems of managing large numbers of systems with a multitude of possible configurations under continually changing circumstances. Unless one already has an expertise in these systems, their use is outside the scope of our current task.
This brings us to an interesting option that is almost outside the realm of configuration management—Docker. Docker is a software that wraps access to the Linux container technology, allowing the fast creation of operating system images that can then be run on a host system. Thus, this software utilizes a major part of the underlying system while providing an isolated environment, which can be customized according to the specific needs.
It was in this capacity that Docker made its way into the Linux world of configuration management via the system administrators, who were looking for more straightforward solutions for problems that did not require the feature sets and complexities of larger tools. Likewise, it is a perfect match for our needs. As a part of this chapter, we have provided various baselines for your use. These baselines are as follows:
masteringmatplotlib/python
: This is a Python 3.4.2 image built on the official Ubuntu 15.05 Docker imagemasteringmatplotlib/scipy
: This is a NumPy 1.8.2, SciPy 0.14.1, and matplotlib 1.4.2 image that is based on the masteringmatplotlib/python
imagemasteringmatplotlib/eros
: This is a custom image that contains not only the software used in this chapter based on the masteringmatplotlib/scipy
image, but also the Python Python Imaging Library (PIL) and scikit-image
librariesWe will discuss the Docker images in more detail shortly.
Docker had its genesis in a Cloud platform and as one might guess, it is ideally suited for deployments on multiple Cloud platforms including the likes of Amazon AWS, Google Cloud Platform, Azure, OpenStack, dotCloud, Joyent, and OpenShift. Each of these differs from the others—only slightly when it comes to some features and enormously with regard to the others. Conceptually though, they offer utility-scale virtualization services, which is something that is particularly well suited to Docker. Which one is the best for your general needs depends on many of the same criteria that exist for any hosting scenario, regardless of the underlying technology.
Each of these will also let you spin up multiple Docker containers, allowing for entire application stacks to run an assortment of Docker images. With Docker's support of orchestration with a new set of tools, the number of possibilities for deployment options and the associated flexibility has been greatly increased.
As mentioned previously, in your tests for the Landsat 8 data scenario, you assessed AWS as the best fit. You looked at Elastic Beanstalk, but opted for a very simple solution that offers you more control—you will deploy a large Elastic Compute Cloud (EC2) Linux instance and use it to fire up the satellite-data-processing Docker containers as needed.