Chapter 8. matplotlib and Big Data

In the spirit of adapting the established tools to new challenges, the last chapter saw us finding ways to work around the limitations of matplotlib on a single workstation. In this chapter, we will explore ways around some of the other limitations that matplotlib users may run up against when working on problems with very large datasets. Note that this investigation will often cause us to bump up against the topic of clustering. We will be setting these explorations aside for now though. Lest you feel that a critical aspect of the problem domain is being ignored, take heart—this will be the primary focus of the next chapter.

The material in the final two chapters of this book attempt to provide the reader with enough additional context to easily understand the origins of these technologies and their uses and thus apply them to their own computation, analysis, and ultimately their plotting needs.

There are two major areas of the problem domain that we will cover in this chapter:

  • Preparing large data for use by matplotlib
  • Visualizing the prepared data

These are the two distinct areas, each with their own engineering problems that need to be solved and with which matplotlib needs to be able to function. We will take a look at several aspects of each area.

We will cover the following topics in this chapter:

  • Big data and its use in matplotlib
  • Working with large datasets
    • Data on local filesystems
    • Distributed data
  • Visualizing large datasets
    • Determining the limits of matplotlib
    • Working around the limits

To follow along with this chapter's code, clone the notebook's repository, start up IPython, and execute the following command lines:

$ git clone https://github.com/masteringmatplotlib/big-data.git
$ cd big-data
$ make

Big data

The term "big data" is semantically ambiguous due to the varying contexts to which it is applied and the motivations of the users applying it. The first question that may have occurred to you on seeing this chapter's title is "how is this applicable to matplotlib or even plotting in general?" Before we answer this question though, let's establish a working definition of big data.

The Wikipedia article on big data opens with the following informal definition:

"Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate."

This description is honest, as it admits that the definition is imprecise. It also implies that the definition may change given the differing contexts. The words large and complex are relative, and the term traditional data processing is not going to mean the same thing in different segments of the industry. In fact, different departments in a single organization may have widely varying data processing "traditions".

The canonical example of big data is related to its origins in web search. Google is generally credited with starting the big data movement with the publication of the paper MapReduce: Simplified Data Processing on Large Clusters, by Jeffrey Dean and Sanjay Ghemawat. The paper describes the means by which Google was able to quickly search an enormous volume of textual data (crawled web pages and log files, for example) amounting, in 2004, to around 20 terabytes of data. In the decade that followed, more and more companies, institutions, and even individuals were faced with the need to quickly process datasets varying in sizes, from hundreds of gigabytes to multiples of exabytes.

Every scenario encompassed in this spectrum can be viewed as a big data-related problem. To a small business that used to manage hundreds of megabytes and is now facing several orders of magnitude in data sources for analysis, 250 gigabytes is considered big data. For intelligence agencies storing information from untold data sources, even a few terabytes is a small amount of data. For them, hundreds of petabytes is considered big data.

For each organization though, the general problem remains the same—what worked before on smaller datasets is no longer feasible. New methodologies, novel approaches towards the usage of hardware, communication protocols, data distribution, search, analysis, and visualization, among many others, are required.

Finally, no matter which methodologies are used to support a big data project, one of the last steps in most of them is the presentation of the analyzed data to human eyes. This can be anything from a decision maker to an end user, but the need is the same—a visual representation of the data collected, searched, and analyzed. This is where tools such as matplotlib come into play.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset