Chapter 9. Clustering for matplotlib

In the final chapter of this book, we will address a topic that has been alluded to several times—clustering and parallel programming for matplotlib. Our motivation to discuss this is nearly identical to that which drove our investigation into working with large datasets. Although matplotlib lib itself isn't a library that makes direct use of large datasets or provides an API that can be used with clusters, advanced users of the library will very likely encounter situations where they may want to utilize matplotlib.

Not to put too fine a point on this, we live in a new world of computing. This was presented exceptionally well in the oft-quoted article, The Free Lunch Is Over, by Herb Sutter. With the drastic limitations faced by the semiconductor industry, yearly gains in computing power are no longer a result of faster chips. Instead, we get this benefit through the addition of cores in a single machine. Unfortunately, common practices in programming that have persisted over the past half century leave us ill-prepared to take advantage of this increasingly common form of additional computing power. Programmers need to learn new skills.

One of the most effective means of utilizing multiple cores is parallelization, which is achieved by either converting some old code to execute in parallel, or adopting infrastructure and code paradigms that allow us to easily start with parallelization from the start. The programmer who wants to make full use of multiple cores will benefit greatly from learning parallel programming. Fortunately, matplotlib coders can also take advantage of this.

To provide an entry point to learn more about this topic in the context of matplotlib and scientific computing, we will cover the following topics in this chapter:

  • Clustering and parallel programming
  • Creating a custom worker cluster by using ZeroMQ
  • Using IPython to create clusters
  • Further clustering options

To follow along with this chapter's code, clone the notebook's repository and start up IPython, as follows:

$ git clone https://github.com/masteringmatplotlib/clustering.git
$ cd clustering

Tip

To compile all the dependencies in this chapter's notebook, you may need to set the CC environment variable, for example export CC=gcc. If you are using Mac OS X, you can use export CC=clang.

Now, you can finish the chapter start-up, as follows:

$ make

Clustering and parallel programming

The term clustering may have a number of operational definitions depending on the situation that one is facing or the organization that one is working with. In this chapter, we will use the term in a very general sense to indicate a system of computing nodes across which a task may be split and whose parts may be executed in parallel with all the system nodes. We won't specify what nodes are, as they may be anything from a collection of processes on the same machine or a computer network to virtual machines or physical computers on a network.

The word "cluster" alludes to a logical collection, but in our definition there is a more important word—parallel. For our purposes, clusters exist to make running code in parallel more efficient or convenient. The topic of parallel computing is a vast one and has an interesting history. However, it rose to greater prominence in 2003 due to the physical limitations that were encountered by the chip-making industry—increased CPU heat, power consumption, and current leakage problems in circuits. As such, CPU performance gains started coming from the addition of more cores to a system. This was discussed in detail by Herb Sutter in his article, The Free Lunch Is Over, which was published in 2005. Ever since, a greater number of mainstream programmers have become interested in taking advantage of the increased number of system cores through the application of parallel programming techniques.

In essence, parallel programming describes scenarios where computationally intensive code may be broken down into smaller code, which can then be run concurrently by taking advantage of a larger number of processing resources and solving problems in a shorter span of time. There are several ways in which one may write parallel code, but our focus will be on the following:

  • Data parallelization: In this, the same calculation is performed on different datasets (and sometimes on the same datasets)
  • Task parallelization: In this, a task is broken down into subtasks, and the subtasks are executed in parallel

The sort of problems that are amenable to parallelization include the following:

  • N-body problems (for example, simulating physics to understand the structure of physical reality such as the work done on Millennium Run)
  • Structured grid problems (for example, computational fluid dynamics)
  • The Monte Carlo simulation (for example, computational biology, artificial intelligence, and investment evaluations in finance)
  • Combinational logic (for example, the brute-force cryptographic techniques)
  • Graph traversal (for example, calculating the least expense and least time for shipping companies)
  • Dynamic programming (for example, mathematical and computational optimizations, RNS structure prediction, and optimal consumption and saving in economic modeling)
  • Bayesian networks (for example, risk analysis, decision systems, document classification, and biological belief modeling)

However, the reader will be relieved to know that we will focus on a simple example in order to more clearly apply the basic principles of parallel programming. In particular, the examples in this chapter will utilize a parallelizable means to estimate the value of π.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset