Interactive Visual Analysis of Big Data 63
10.0
7.5
2.5
5.0
Log (votes)
Alpha = 0.01
Alpha = 0.01
Alpha = 0.005
Alpha = 0.005
Overplotting amount
10.07.52.5 5.0
Rating
750
Count
500
250
10.0
10.0
7.5
7.5
2.5
2.5
5.0
5.0
Log (votes)
Rating
1.00
0.75
0.25
0.50
Total opacity
0.00
1.00
0.75
0.25
0.50
Total opacity
0.00
0 250 500 750 1000
Overplotting amount (log scale)
100010
10.0
10.0
7.5
7.5
2.5
2.5
5.0
5.0
Log (votes)
Rating
FIGURE 5.2
Overplotting becomes a problem in the visualization of large datasets, and using
semitransparent shapes is not an effective solution. The mathematics of transparency and
its perception work against us. Most of the perceptual range (the set of different possible
total opacities) is squeezed into a small range of the overplotting amount, and this range
depends on the specific per-point opacity that is chosen (left column). As a result, different
opacities highlight different parts of the data, but no single opacity is appropriate (middle
column). Color mapping via bin counts, however, is likely to work better (right column).
Finally, any one single visualization will very likely not be enough, again because there
are simply too many observations. Here, interactive visualization is our current best bet. As
Shneiderman puts it, overview first, zoom, and filter the details on demand [20]. For big data
visualizations, fast interaction, filtering, and querying of the data are highly recommended,
under the same latency constraints for visualization updates.
Besides binning, the other basic strategy for big data visualization is sampling. Here, the
insight is similar to that of resampling techniques such as jackknifing and bootstrapping [7]:
it is possible to take samples from a dataset of observations as if the dataset were an
actual exhaustive census of the population, and still obtain meaningful results. Sampling-
based strategies for visualization have the advantage of providing a natural progressive
representation: as new samples are added to the visualization, it becomes closer to that of
the population.
This is roughly our roadmap for this chapter, then. First, we will see how to bin datasets
appropriately for exploratory analysis and how to turn those bins into visual primitives that
are pleasing and informative. Then we will briefly discuss techniques based on sampling, and
finally we will discuss how exploratory modeling fits into this picture. Table 5.1 provides a
summary of the techniques we will cover in this chapter.
With the material presented here, you should be able to apply the current state-of-the-art
techniques and visualize large-scale datasets, and also understand in what contexts the avail-
able technology falls short. The good news is that for one specific type of approach, many
open-source techniques are available. The techniques we present in this chapter have one
main idea in common: if repeated full scans of the dataset are too slow to be practical, then
we need to examine the particulars of our setting and extract additional structure. Specifi-
cally, we now have to bring the visualization requirements into our computation and analysis
infrastructure, instead of making visualization a separate concern, handled via regular SQL
queries or CSV flat files. One crucial bit of additional structure is that of the visualization
technique itself. Although not explored to its generality in any of the techniques presented