66 Handbook of Big Data
step for imMens computes all possible sums where the All placeholder appears in at
least n − 4 of the columns. In other words, imMens’s partial data cubes are at most four
dimensional. There is a very good reason for this particular choice: imMens was built to
enable extremely fast brushing as a form of interactive exploration [4]. In this mode, a user
moves his or her mouse over one particular two-dimensional projection of a dataset, and
the position of the mouse implies a restriction on the values they are interested in.
So that is one restricted dimension for the X position and another dimension for the Y
position.Now,foreveryother plot in a brushed multidimensional histogram, we want to
compute the sum of events in any particular bin. Those are, potentially, two extra dimensions
for a total of four. Because most of the interactive exploration in scatterplots happens with
one active brush, the representation in imMens works precisely in the right way. When
higher levels of aggregation are necessary (for storage reasons, sometimes imMens only
computes three-dimensional tables), imMens uses the GPU parallel power to perform this
aggregation quickly. As a result, after the data tiles are loaded on the client, imMens can
sustain interaction rates of 50 frames per second for datasets with hundreds of millions to
billions of events.
Another defining characteristic of imMens’s storage is that its binning scheme is dense:
bins in which no events are stored take as much space as bins in which many events are
stored. This limits the effective resolution of imMens’s binning scheme and is a significant
limitation. Many datasets of interest have features at multiple scales: in time series analysis,
for example, features can happen at week-long scales or at hour-long scales.
Nanocubes, by contrast, uses a sparse, multilevel scheme for its address space, essentially
computing a nested sequence of 2
d
-ary trees (binary trees for single-dimensional values,
quad-trees for spatial values, etc.) [15]. In order to avoid the obvious exponential blowup,
nanocubes reuse large portions of the allocated data structures: the algorithm detects that
further refinements of a query do not change the output sets and share those portions
across the data structure. As a simplified example, consider that every query for “rich men
in Seattle” will include “former CEOs of Microsoft” Bill Gates. It makes no sense, then, to
store separate results for “rich men in Seattle AND former CEOs of Microsoft” and “rich
men in Seattle”: the two result sets are identical. The construction algorithm for nanocubes
is essentially a vast generalization of this kind of rule and results in large storage gains.
Specifically, datasets that would take petabytes of storage to precompute in dense schemes
can be stored in a nanocube in a few tens of gigabytes.
Both nanocubes and imMens expose their application program interface (API) via web-
based visualization. Compared to imMens, the main disadvantage of nanocubes is that it
requires a special server process to answer its queries. imMens, however, preprocesses the
dataset into data tile files, which are served via a regular web server; all of the subsequent
computation happens on the client side. For deployment purposes, this is a very favorable
setup.
In addition, both nanocubes and imMens (in their currently available implementations)
share a limitation in the format of the data stored in their multivariate bins: these tech-
niques only store event counts. Wickham’s bigvis, by contrast, can store (slightly) more
sophisticated event statistics of events in its bins [24]. As a result, bigvis allows users to
easily create multivariate histogram plots where each bin stores a sample mean (or sample
variance, or other such simple statistics). This is a significant improvement in flexibility, par-
ticularly during exploratory analysis, where better statistics might help us find important
patterns in the data.
At the same time, bigvis does not employ any precomputation strategy. It relies on
the speed of current multicore processors in desktops and workstations, and every time
the user changes the axes used for plotting, bigvis recreates the addressing scheme and
rescans the data. Unlike the one for imMens and nanocubes, the addressing scheme in