Chapter 5. High-level Plotting and Data Analysis

A significant aspect of gaining matplotlib mastery is familiarizing oneself with the use of the other Python tools in the scientific programming ecosystem. Libraries such as NumPy, SciPy, Pandas, or SymPy are just the beginning. The tools available in the community cover an enormous amount of ground, entailing the spectrum of many fields and subspecialties. Projects such as scikit-learn, AstroPy, h5py, and so on, build upon the foundations provided by others, thus being able to provide more functionality quicker than, if they had to start from scratch themselves.

Those who may want to look more deeply into these and other tools may benefit from a guided tour into one area that could serve as a template for future exploration into many other areas. This is our mission in this chapter, with our entry points for further examination being the following:

  • A background and overview of high-level plotting
  • Practical high-level plotting, using a data analysis example

We will use the term high-level plotting to describe things such as wrapping matplotlib functionality for use in new contexts, combining different libraries for particular plots that are not available in matplotlib, and visualization of complex datasets using wrapper functions, classes, and libraries.

The following high-level plotting topics will be covered in this chapter:

  • Historical background
  • matplotlib libraries, and high-level plotting
  • An introduction to the grammar of graphics
  • Libraries inspired by the grammar of graphics

When speaking of data analysis, our focus will be on the pragmatic aspects of parsing, grouping, filtering, applying computational workflows to, and subjecting to statistical methods to various sources of data, all in the context of high-level data visualization. The topics we will cover in this area will include:

  • Selected functions and methods from Pandas, SciPy, and NumPy
  • Examination and manipulation of a Pandas dataset
  • A tour of the various plots that are useful to have at one's fingertips when performing visualization tasks of a statistical nature

We have another IPython Notebook for you to work with while reading this chapter. You can clone the repository and run it with the following:

$ git clone https://github.com/masteringmatplotlib/high-level.git
$ cd high-level
$ make

Note

Some of the examples in this chapter use Pygraphviz, which needs the graphviz C header files that are present on the system. This is usually accomplished by installing graphviz and its development libraries, although you may need to upset Pygraphviz's setup.py, to point to the location of the graphviz header files.

High-level plotting

In this book, when we mention high-level, we are referring not to some assessment of value or improvement over something, but rather to layers of abstraction, or more precisely, layers of interaction. When engaged in high-level plotting, we expect that users and developers will be creating visualizations of complex data with fewer commands or steps required than by using matplotlib's basic functionality directly. This is a result of complex tasks wrapping a greater number of smaller, simpler tasks.

Plotting itself is a high-level activity: raw data and often calculations on that data are combined, processed, some more in anticipation of user consumption, arranged or grouped in ways suitable for conveying the desired information, and then applied to some medium in ways that one hopes will render greater insight. By our definition, each activity upon that original raw data is, in some way, high-level.

Before we look at examples of modern high-level plotting, let us gain some perspective through examining the historical context by which we arrived at matplotlib and its ecosystem of related libraries.

Historical background

In 2005, Princeton Alum and Professor of Psychology, Michael Friendly, published the paper, Milestones in the History of Data Visualization: A Case Study in Statistical Historiography, which provided an excellent overview of data visualization and perhaps the first comprehensive summary of the entire development of visual thinking and the visual representation of data. Dr. Friendly's paper and his related, and extraordinary, work on the Milestones Timeline are the sources used for the background presented in this section.

Flemish astronomer Michael Florent van Langren is credited with the first visual representation of statistical data in a 1644 graph of 12 contemporary estimates for the distance between Toledo and Rome.

Historical background

Visual representation of statistical data of distances by Michael Florent van Langren, 1644

In 1669, Christiaan Huygens created the first graph of a continuous distribution function. A few years later, in 1686, Edmond Halley created a plot predicting barometric pressure versus altitude, which was derived from experimental observation.

Historical background

Graph of pressure prediction by Edmond Halley, 1686

The first line graph and bar chart came in 1786, with the pie chart and circle graph following in 1801, both due to noted Scottish engineer and political economist William Playfair. This is commonly considered to be the birth of modern graphical methods in statistics:

Historical background

First line graph by William Playfair, 1786

Though the intervening years did bring advances in visual representation, it wasn't until the mid-20th century that significant strides were made, both in the deeper understanding of visual methods themselves as well as a growing set of tools via the rise of computing science. In 1968, the Macsyma (short for Mac's Symbolic Manipulator) project was started at MIT (short for Massachusetts Institute of Technology). Written in Lisp, this was the first comprehensive symbolic mathematics system created, many of whose ideas were later adopted by software programs such as Mathematica and Maple. In 1976, the S programming language was invented at Bell Labs. It was from this language that the R programming language was derived, since then gaining fame as a highly respected platform for data analysis and visualization. In 1979, Chris Cole and Stephen Wolfram created the computer algebra system SMP, often considered version 0 of Mathematica. 1980 saw the first release of the Maple computer algebra system, with MATLAB arriving on the scene in 1984, and Mathematica in 1988.

It is this heritage to which matplotlib owes its existence: both the historical work done with ink and paper, as well as the advances made in the software. At the time of matplotlib's genesis, the Python programming language was beginning to establish itself in the world of high-level languages, already seeing adoption in applications of the scientific computing field. As we can see, the plotting that we do now has a richer and more diverse background that we might have initially imagined, and provides the foundation for the lofty aspirations of high-level plotting.

matplotlib

As we have discussed, matplotlib provides mid-level access to the mechanics plotting using the programmatic, object-oriented API of the backend and artist layers. One could then make the argument that the scripting layer, as represented by pyplot or the deprecated pylab, provides APIs for high-level plotting. A better example would be if there was any use of pyplot in matplotlib itself, employed as a means of providing a simple interface for a complex plotting task. It turns out that there is an example of this in the codebase, and it occurs in the sankey.py file.

The Sankey diagram is named after Captain Matthew Henry Phineas Riall Sankey, who used it in 1898 to visually depict the thermal efficiency of steam engines. Sankey diagrams in matplotlib have been supported since 2010 when the matplotlib.sankey module was contributed by Yannick Copin and Kevin Davies. The following diagram is that of a Rankine power cycle, another example of Sankey diagrams:

matplotlib

Note

Though named after Captain Sankey, the first such diagram was created years earlier by French civil engineer Charles Joseph Minard.

In this module, pyplot is imported and used simply to generate a figure and an axes object. However, the impressive demo image is only fully appreciated when the contents of sankey.py are examined and one sees the extensive logic used, to render these flow diagrams in matplotlib. The module not only uses pyplot, but also combines paths, patches, and transforms to give users the ability to generate plots containing these extraordinary diagrams—an excellent and concise example of high-level plotting.

For the rest of the chapter, we will look at other libraries offering similar examples of wrapping matplotlib plotting functionality. Each of these accomplish a great deal with considerably less effort that would be exerted than if we were left to our own devices and had to use only matplotlib to produce the desired effect.

NetworkX

In Chapter 2, The matplotlib Architecture, we encountered the graph library NetworkX and used it in conjunction with matplotlib, something that NetworkX supports directly. Let us take a deeper look at this library from the perspective of our high-level plotting topic.

We'll start with a commented code sample and plot, and then go into more details. The following example is adapted from one in the NetworkX gallery by Aric Hagberg of Los Alamos National Laboratory:

In [5]: import sys
        sys.path.append("../lib")
        import lanl

        # Set up the plot's figure instance
        plt.figure(figsize=(14,14))

        # Generate the data graph structure representing
        # the route relationships
        G = lanl.get_routes_graph(debug=True)

        # Perform the high-level plotting operations in
        # NetworkX
        pos = nx.graphviz_layout(G, prog="twopi", root=0)
        nx.draw(G, pos,
                node_color=[G.rtt[v] for v in G],
                with_labels=False,
                alpha=0.5,
                node_size=50,
                cmap=cmap)

        # Update the ranges
        xmax = 1.02 * max(xx for xx, _ in pos.values())
        ymax = 1.02 * max(yy for _, yy in pos.values())

        # Final matplotlib tweaks and rendering
        plt.xlim(0, xmax)
        plt.ylim(0, ymax)
        plt.show()

The following plot is the result of the preceding code:

NetworkX

In Chapter 2, The matplotlib Architecture, we needed to employ some custom logic to refine graph relationships, which accounted for both the structure of the modules and the conceptual architecture of matplotlib. Dr. Hagberg had to do something similar when rendering the Internet routes from Los Alamos National Laboratory. We've put this code in the lanl module for this notebook repository; that is where all the logic is defined for converting the route data to graph relationships.

We can see clearly from the code comments where the high-level plotting occurs:

  • The call to nx.graphviz_layout
  • The call to nx.draw

We can learn how NetworkX acts as a high-level plotting library by taking a look at these, starting with the layout function.

NetworkX provides several possible graph library backends, and to do so in a manner that makes it easier for the end user, some of the imports can be quite obscured. Let us get the location of the graphviz_layout function the easy way:

In [6]: nx.graphviz_layout
Out [6]: <function networkx.drawing.nx_agraph.graphviz_layout>

If you open that file (either in your virtual environment's site-packages or on GitHub), you can see that graphviz_layout wraps the pygraphviz_layout function. From there, we see that NetworkX is converting pygraphviz's node data structure to something general, which can be used for all NetworkX backends. At this point, we're already several layers deep in NetworkX's high-level API internals. Let us continue:

In [7]: nx.draw
Out [7]: <function networkx.drawing.nx_pylab.draw>

nx_pylab gives us a nice hint that we're getting closer to matplotlib itself. In fact, the draw function makes direct use of matplotlib.pyplot in order to achieve the following:

  • Get the current figure from pyplot
  • Or, if it exists, from the axes object
  • Hold and un-hold the matplotlib figures
  • Call a matplotlib draw function

It also makes a subsequent call to the NetworkX graph backend to draw the actual edges and nodes. Theses additional calls get node, edge, and label data and make further calls to matplotlib draw functions. None of which we have to do; we simply call nx.draw (with appropriate parameters). Thus the benefits of high-level plotting!

Pandas

The following example is from a library whose purpose is to provide Python users and developers extensive support for high-level data analysis. Pandas offers several high performant data structures for this purpose, in a large part, built around the NumPy scientific computing library.

How does this relate to a high-level plotting? In addition to providing such things as its Series, DataFrame, and Panel data structures, Pandas incorporate a plotting functionality into some of these as well.

Let us take a look at an example, where we generate some random data and then utilize the plot function made available on the DataFrame object. We'll start with generating some random data samples:

In [8]: from scipy.stats import norm, rayleigh

        a = rayleigh.rvs(loc=5, scale=2, size=1000) + 1
        b = rayleigh.rvs(loc=5, scale=2, size=1000)
        c = rayleigh.rvs(loc=5, scale=2, size=1000) – 1

With these, we can populate our Pandas data structure:

In [9]: data = pd.DataFrame(
            {"a": a, "b": b, "c": c},
            columns=["a", "b", "c"])

And then, view it via a call in IPython:

In [10]: data.plot(
            kind="hist", stacked=True, bins=30,
            figsize=(16, 8))
         axes.set_title("Fabricated Wind Speed Data",
            fontsize=20)
         axes.set_xlabel("Mean Hourly Wind Speed (km/hr)",
            fontsize=16)
         _ = axes.set_ylabel("Velocity Counts", fontsize=16)

The following plot is the result of the preceding code:

Pandas

Let us go spelunking in the Pandas source, to get a better understanding of how Pandas is doing this. In the Pandas source code directory, open the file pandas/core/frame.py. This is where the DataFrame object is defined. If you search for DataFrame.plot, you will see that plot is actually an attribute of DataFrame, not a defined method. Furthermore, the code for the plot implementation is in pandas.tools.plotting.plot_frame.

After opening that module's file, search for def plot_frame. What we see here is a short chain of functions that are handing all sorts of configuration and options for us, allowing us to easily use a plot method on the data structure. The Pandas developers have very kindly returned the matplotlib result of the plot call (the top-level axes object) so that we may work with it in the same way as other matplotlib results.

We're going to shift gears a bit now, and take a new look at high-level plotting. It is what many consider to be the future of data visualization: the grammar of graphics.

The grammar of graphics

In the section, where we covered the historical background of plotting, we briefly made reference to the rebirth of data visualization in the mid-20th century. One of the prominent works of this time was by famed French cartographer Jacques Bertin, author of the Semiologie Graphique. Published in 1967, this was the first significant work dedicated to identifying the theoretical underpinnings of visualized information. In 1977, the American mathematician known for creating of one of the most common FFT algorithms, John Tukey, published the book Exploratory Data Analysis, Sage Publications, in which he introduced the world to the box plot. The method of data analysis described in this work inspired development in the S programming language, which later carried over to the R programming language. This work allowed statisticians to better identify trends and recognize patterns in large datasets. Dr. Tukey set the data visualization world on its current course by advocating for the examination of data itself, to lead to insights. The next big leap in the visualization of data for statistical analysis came with the 1985 publication of the book, The Elements of Graphing Data, William S. Cleveland, Hobart Press, representing 20 years of work in active research and scientific journal article publication.

32 years after Jacques Bertin's seminal work, during which time leaders of the field had been pursuing meta-graphical concepts, Leland Wilkinson published the book, The Grammar of Graphics, Springer Publishing, which, as had been the case with Dr. Cleveland, was the culmination from a combined background of academic research and teaching with professional experience of developing statistical software platforms. The first software implementation that was inspired by this book was SPSS's nViZn (pronounced as envision). This was followed by R's ggplot2, and in the Python world, Bokeh, among others.

But what is this grammar? What did three decades of intensive research and reflection reveal about the nature of data visualization and plotting of statistical results? In essence, the grammar of graphics did for the world of statistical data plotting and visualization what design patterns did for a subset of programming, and a pattern language did for architecture and urban design. The grammar of graphics explores the space of data, its graphical representation, the human minds that view these, and the ways in which these are connected, both obviously and subtly. The book provides a conceptual framework for the cognitive analysis of our statistical tools and how we can make them better, allowing us to ultimately create visualizations that are more clear, meaningful, and reveal more of the underlying problem space.

A grammar such as this is not only helpful in providing a consistent framework for concisely describing and discussing plots, but it is of an inestimable value for developers who wish to create a well thought-out and logically structured plotting library. The grammar of graphics provides a means by which we can clearly organize components such as geometric objects, scales, or coordinate systems while relating this to both the data they will represent (including related statistical use cases) and a well-defined visual aesthetic.

Bokeh

One of the first Python libraries to explore the space of the grammar of graphics was the Bokeh project. In many ways, Bokeh views itself as a natural successor to matplotlib, offering their view of improvements in the overall architecture, scalability of problem datasets, APIs, and usability. However, in contrast to matplotlib, Bokeh focuses its attention on the web browser.

Since this is a matplotlib book, and not a Bokeh book, we won't go into too much detail, but it is definitely worth mentioning, that Bokeh provides a matplotlib compatibility layer. It doesn't cover the complete matplotlib API usage a given project may entail, but enough so that one should be able to very easily incorporate Bokeh into existing matplotlib projects.

The ŷhat ggplot

A few years ago, the ŷhat company open-sourced a project of theirs: a clone of R's ggplot2 for Python. The developers at ŷhat wanted to have a Python API that matched ggplot2 so that they could move easily between the two.

A quick view of the project's web site shows the similarity with ggplot2:

The ŷhat ggplot

The comparison of the following two code samples shows the extraordinary similarity between R's ggplot2 and Python's ggplot. The code for R's ggplot2 is as follows:

library(ggplot2)

ggplot(movie_data, aes(year, budget)) +
  geom_line(colour='red') +
  scale_x_date(breaks=date_breaks('7 years') +
  scale_y_continuous(labels=comma)

And the code for Python's ggplot:

from ggplot import *

ggplot(movie_data, aes('year','budget')) + 
    geom_line(color='red') + 
    scale_x_date(breaks=date_breaks('7 years')) + 
    scale_y_continuous(labels='comma')

A demonstration of how the Python ggplot provides a high-level experience for the developer is given, when examining the matplotlib code necessary, to duplicate the preceding ggplot code:

import matplotlib.pyplot as plt
from matplotlib.dates import YearLocator

tick_every_n = YearLocator(7)
x = movie_data.date
y = movie_data.budget
fig, ax = plt.subplots()
ax.plot(x, y, 'red')
ax.xaxis.set_major_locator(tick_every_n)
plt.show()

Here's an example of ggplot usage from the IPython Notebook for this chapter:

In [12]: import ggplot
         from ggplot import components, geoms, scales, stats
         from ggplot import exampledata

In [13]: data = exampledata.movies
         aesthetics = components.aes(x='year', y='budget')

         (ggplot.ggplot(aesthetics, data=data) +
          stats.stat_smooth(span=.15, color='red', se=True) +
          geoms.ggtitle("Movie Budgets over Time") +
          geoms.xlab("Year") +
          geoms.ylab("Dollars"))

Out[13]:

The following plot is the result of the preceding code:

The ŷhat ggplot

New styles in matplotlib

Not to be left behind, matplotlib has embraced the sensibilities of the ggplot world and has supported the ggplot style since its 1.4 release. You can view the available styles in matplotlib with the following:

In [20]: plt.style.available
Out[20]: ['ggplot', 'fivethirtyeight', 'dark_background',
          'grayscale', 'bmh']

To enable the ggplot style, simply do this:

In [21]: plt.style.use('ggplot')

Here's a comparison of several plots before and after enabling; the plots on the right are the ones using the ggplot style:

New styles in matplotlib

Seaborn

The development of Seaborn has been greatly inspired by the grammar of graphics, and R's ggplot in particular. The original goals of Seaborn were twofold: to make computationally-based research more reproducible, and to improve the visual presentation of statistical results.

This is further emphasized in the introductory material on the project site, with Seaborn's stated aims being to make visualization a central part of exploring and understanding data. Its goals are similar to those of R's ggplot2, though Seaborn takes a different approach: it uses a combined imperative and object-oriented approach with a focus on easy, straight-forward construction of sophisticated plots.

The fact that Seaborn has accomplished undeniable success in these aims is evident by looking at the impressive example plots that it provides, which are generated by relatively few lines of code. The notebook for this chapter shows several examples; we'll highlight just one here, the facet grid plot.

When you want to split up a dataset by one or more variables and then group subplots of these separated variables, you will probably want to use a facet grid. Another use case for the facet grid plot is when you need to examine repeated runs of an experiment to reveal potentially conditional relationships between variables. Below is a concocted instance of the latter from the Seaborn examples. It displays data from a generated dataset, simulating repeated observations of a walking behavior, examining positions of each step of a multi-step walk.

The following demo assumes that you have previously performed the following imports in the notebook, to add this chapter's library to the Python's search path:

In [18]: import sys
         sys.path.append("../lib")
         import mplggplot

With that done, let us run the demo:

In [25]: import seademo

         sns.set(style="ticks")
         data = seademo.get_data_set()
         grid = sns.FacetGrid(data, col="walk", hue="walk",
                              col_wrap=5, size=2)
         grid.map(plt.axhline, y=0, ls=":", c=".5")
         grid.map(plt.plot, "step", "position", marker="o", ms=4)
         grid.set(xticks=np.arange(5), yticks=[-3, 3],
                  xlim=(-.5, 4.5), ylim=(-3.5, 3.5))
         grid.fig.tight_layout(w_pad=1)

The following plot is the result of the preceding code:

Seaborn

There will be more Seaborn examples in the hands-on part of this chapter, where we will save huge amounts of time by using several very high-level Seaborn functions for creating sophisticated plots.

With this, we conclude our overview of high-level plotting with regard to the topic of the grammar of graphics in the Python (particularly matplotlib) world. Next, we will look at high-level plotting examples in the context of a particular dataset and various methods for analyzing trends in that data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset