Visualizing large data

The majority of this notebook has been dedicated to processing large datasets and plotting histograms. This was done intentionally because by using such an approach, the number of artists on the matplotlib canvas is limited to something in the order of hundreds, which is better than attempting to plot millions of artists. In this section, we will address the problem of displaying the actual elements of large datasets. We will then return to the last HDF5 table in the remainder of the chapter.

As a refresher on the volume that we're looking at, the number of data points in our dataset can be calculated in the following way:

In [45]: data_len = len(tab)
         data_len
Out[45]: 288000000

Again, our dataset has nearly one third of a billion points. That is almost certainly more than matplotlib can handle. In fact, one often sees comments online that warn users not to attempt plotting more than ten thousand or one hundred thousand points.

However, is this a good advice? It might be better to advise users to switch to PyTables or numpy.memmap and then, based on that, make a recommendation about the upper limits for the plotting of the data points. Let's use our data to establish a baseline for matplotlib's comfort zone.

Finding the limits of matplotlib

We're going to attempt plotting an increasing number of points from our dataset. We'll use the HDF5 table and start modestly with 1000 points, as follows:

In [46]: limit = 1000
         (figure, axes) = plt.subplots()
         axes.plot(tab[:limit]["temp"], ".")
         plt.show()

The following plot is the result of the preceding code:

Finding the limits of matplotlib

The output was rendered very quickly—most likely under a second. Let's bump up our dataset size by an order of magnitude, as follows:

In [47]: limit = 10000
         (figure, axes) = plt.subplots()
         axes.plot(tab[:limit]["temp"], ".")
         plt.show()

The following plot is the result of the preceding code:

Finding the limits of matplotlib

Again, that was very fast. There was no noticeable difference between this render and the previous one. Let's keep going by again increasing the order of magnitude, as follows:

In [48]: limit = 100000
         (figure, axes) = plt.subplots()
         axes.plot(tab[:limit]["temp"], ".")
         plt.show()

The following plot is the result of the preceding code:

Finding the limits of matplotlib

At 100,000 points, you will start seeing a tiny delay. The previous code took about a second to render. This looks better than what we had been led to believe. Let's try a million and then ten million points, as follows:

In [49]: limit = 1000000
         (figure, axes) = plt.subplots()
         axes.plot(tab[:limit]["temp"], ".")
         plt.show()

The following plot is the result of the preceding code:

Finding the limits of matplotlib
In [50]: limit = 10000000
         (figure, axes) = plt.subplots()
         axes.plot(tab[:limit]["temp"], ".")
         plt.show()

The following plot is the result of the preceding code:

Finding the limits of matplotlib

One million points were rendered in about 2 to 3 seconds, which is pretty good considering the fact that we were expecting a limit of around 10,000! However, if we had to plot hundreds or thousands of datasets like these for a project, the delay for the same would be prohibitive. 10 million points took about 15 seconds. Therefore, that wouldn't be an option for even a moderate number of plots that needed to be rendered in a timely manner.

Agg rendering with matplotlibrc

If we use lines instead of points in our plot, we will hit another limit—the inability of the Agg backend to handle a large number of artists. We can see this when we switch from the preceding point plots to the matplotlib default of line plots, as follows:

In [51]: (figure, axes) = plt.subplots()
         axes.plot(tab[:10000000]["temp"])
         plt.show()
...
FormatterWarning: Exception in image/png formatter:
Allocated too many blocks
...
<matplotlib.figure.Figure at 0x160587240>

If you run into an error like this, it may be worth tweaking an advanced configuration value in the matplotlibrc file—chunksize. Normally, the Agg path chunksize is configured and has a value of 0, but a recommended value to start off with is 20,000. Let's give this a try and then attempt to render again, as follows:

In [52]: mpl.rcParams["agg.path.chunksize"] = 20000
In [53]: (figure, axes) = plt.subplots()
         axes.plot(tab[:10000000]["temp"])
         plt.show()

The following plot is the result of the preceding code:

Agg rendering with matplotlibrc

This feature was marked as an experimental in 2008, and it has remained so even in 2015. A warning to the user—enabling the Agg backend to plot in chunks instead of doing so all at once may introduce visual artifacts into the plots. In the case of quickly checking one's data in IPython, this might not be a concern for you. However, sharing experimental results in publications will make the plots worthless.

More practically though, we lucked out. It just so happened that the presence of 10 million lines in our data was something that our backend could handle when using the chunked approach. Another order of magnitude or so, and we'd likely be back in the same situation. As the dataset sizes grow beyond the capabilities of matplotlib, we must turn to some other approaches.

Decimation

One way of preparing large datasets to render carries the unfortunate name of the brutal practice employed by the Roman army against large groups that were guilty of capital offenses—"the removal of a tenth", which is more commonly known by its Latin name, decimation. We will use this term in the book more generally. It indicates the removal of a fraction, which is sufficient to give us our desired performance that, as it turns out, will be much more than a tenth.

As you may have noticed in our preceding exploration, we couldn't spot any appreciable visual difference in the nature of the plots after 100,000 points. There are certainly some additional outliers that we can point to, but the structure is hidden by the sheer numbers after the threshold is passed.

If we want to limit our plot to 100,000 points but cover the entire spectrum of our dataset, we just need to divide the size of the dataset by the desired point number to calculate the decimation value, as follows:

In [54]: frac = math.floor(data_len / 100000)
         frac
Out[54]: 2880

Because PyTables uses the NumPy arrays, we can take advantage of an array-slicing feature that lets us extract every nth value—data[::n]. Let's use this to plot a representation of the dataset across its entire spectrum, as follows:

In [55]: xvals = range(0, data_len, frac)
         (figure, axes) = plt.subplots()
         axes.plot(xvals, tab[::frac]["temp"], ".", alpha=0.2)
         plt.show()

The following plot is the result of the preceding code:

Decimation

We also provided x values that matched the skipping that we did when selecting every y value. Had we not done this, the x axis would have ranged from 0 to 100,000. As you can see, it ranges instead to 300 million, showing our data's end at 288 million.

When taking an approach like this, we need to keep in mind that we're essentially dumping data from our plot. Potentially important data points (such as significant outliers) might be removed in this process. Furthermore, depending on the distribution, statistical values may be altered. However, the most significant issue with this approach is that it has the potential to exaggerate the outliers that remain in the dataset. This form of distortion is known as aliasing, and there are filtering techniques that one can employ to minimize it.

If you are working with digital signals or periodic data, you may find the scipy.signal.decimate and scipy.signal.resmple functions useful.

Additional techniques

Even with an approach as simple as decimation, we need to consider applying filters. Depending on one's data, there are a number of additional techniques that one may utilize to make large datasets more digestible. Data can be quantized or binned. In particular, we took advantage of binning data by using the histogram plots early

in this chapter and thus sidestepping the need to worry about rendering plots with massive dataset sizes. Similarly, matplotlib, Seaborn, and several other libraries offer heat maps and hexbin plots. When applied intelligently, these features can provide invaluable insights without the need to display every single point from a dataset.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset