66 Handbook of Big Data
step for imMens computes all possible sums where the All placeholder appears in at
least n 4 of the columns. In other words, imMens’s partial data cubes are at most four
dimensional. There is a very good reason for this particular choice: imMens was built to
enable extremely fast brushing as a form of interactive exploration [4]. In this mode, a user
moves his or her mouse over one particular two-dimensional projection of a dataset, and
the position of the mouse implies a restriction on the values they are interested in.
So that is one restricted dimension for the X position and another dimension for the Y
position.Now,foreveryother plot in a brushed multidimensional histogram, we want to
compute the sum of events in any particular bin. Those are, potentially, two extra dimensions
for a total of four. Because most of the interactive exploration in scatterplots happens with
one active brush, the representation in imMens works precisely in the right way. When
higher levels of aggregation are necessary (for storage reasons, sometimes imMens only
computes three-dimensional tables), imMens uses the GPU parallel power to perform this
aggregation quickly. As a result, after the data tiles are loaded on the client, imMens can
sustain interaction rates of 50 frames per second for datasets with hundreds of millions to
billions of events.
Another defining characteristic of imMens’s storage is that its binning scheme is dense:
bins in which no events are stored take as much space as bins in which many events are
stored. This limits the effective resolution of imMens’s binning scheme and is a significant
limitation. Many datasets of interest have features at multiple scales: in time series analysis,
for example, features can happen at week-long scales or at hour-long scales.
Nanocubes, by contrast, uses a sparse, multilevel scheme for its address space, essentially
computing a nested sequence of 2
d
-ary trees (binary trees for single-dimensional values,
quad-trees for spatial values, etc.) [15]. In order to avoid the obvious exponential blowup,
nanocubes reuse large portions of the allocated data structures: the algorithm detects that
further refinements of a query do not change the output sets and share those portions
across the data structure. As a simplified example, consider that every query for “rich men
in Seattle” will include “former CEOs of Microsoft” Bill Gates. It makes no sense, then, to
store separate results for “rich men in Seattle AND former CEOs of Microsoft” and “rich
men in Seattle”: the two result sets are identical. The construction algorithm for nanocubes
is essentially a vast generalization of this kind of rule and results in large storage gains.
Specifically, datasets that would take petabytes of storage to precompute in dense schemes
can be stored in a nanocube in a few tens of gigabytes.
Both nanocubes and imMens expose their application program interface (API) via web-
based visualization. Compared to imMens, the main disadvantage of nanocubes is that it
requires a special server process to answer its queries. imMens, however, preprocesses the
dataset into data tile files, which are served via a regular web server; all of the subsequent
computation happens on the client side. For deployment purposes, this is a very favorable
setup.
In addition, both nanocubes and imMens (in their currently available implementations)
share a limitation in the format of the data stored in their multivariate bins: these tech-
niques only store event counts. Wickham’s bigvis, by contrast, can store (slightly) more
sophisticated event statistics of events in its bins [24]. As a result, bigvis allows users to
easily create multivariate histogram plots where each bin stores a sample mean (or sample
variance, or other such simple statistics). This is a significant improvement in flexibility, par-
ticularly during exploratory analysis, where better statistics might help us find important
patterns in the data.
At the same time, bigvis does not employ any precomputation strategy. It relies on
the speed of current multicore processors in desktops and workstations, and every time
the user changes the axes used for plotting, bigvis recreates the addressing scheme and
rescans the data. Unlike the one for imMens and nanocubes, the addressing scheme in
Interactive Visual Analysis of Big Data 67
bigvis is exceedingly simple, and thus scanning through the dataset is relatively fast. In
addition, the richer statistics stored in bigvis’s bins enable it to provide statistical modeling
features completely absent in nanocubes and imMens. It is possible, for example, to do
linear regressions and LOESS curve fits over bigvis’s data structures.
It should be noted that data cubes do allow for these more general statistics. Specifi-
cally, their data can live in any commutative monoid, and there has been some recent work
exploring connections between algebraic structures and allowable statistical and machine
learning algorithms [13]. Besides enabling nanocubes and imMens to perform more sophis-
ticated analyses, these connections provide an exciting avenue for future work. We envision
a world where exploratory statistics, in general, are assumed to have access to a data cube
like imMens’s or nanocubes. It is plausible that techniques that require repeated scanning
through the dataset (EM, clustering, stochastic gradient descent) can benefit from judicious
preaggregation.
5.3 Sampling
Instead of preaggregation, the other general class of strategy for low-latency visualization
is that of sampling. The idea again is simple: if we are interested in understanding the
behavior of a large population (in this case, our dataset), then it should be enough for us to
study a small subset of this population, intuitively for the same reason that it is enough, in
many statistical analyses, to study a sample instead of the entire population.
In the context of general interactive query engines, BlinkDB is the best known and cur-
rent state of the art [1]. BlinkDB provides a typical SQL querying interface enriched by time
bound or error bound clauses. Its engine then builds a multidimensional stratified sample
that is updated in a streaming fashion until either of the bounds are satisfied. The main
complication lies in building indices capable of efficiently updating pieces of the stratified
sample, using a generalization of Hellerstein et al.’s online aggregation algorithms [12] that
take into account the additional knowledge available from the requested bounds. Fisher
et al.’s experiment with analysts in sampleAction [8] suggests that incremental results and
confidence bounds are typically interpreted correctly. This provides evidence that the faster
cycle “query, examine results, think something new, repeat” is not happening at the expense
of a significantly worse data analysis capacity.
Another significant advantage of sampling engines, as opposed to aggregation engines,
is that it is much simpler to design a sampling engine that (at some level) has full row-level
access to the database. This usually means that the type of query returned by sampling
engines can be richer than that of aggregation engines. This retained flexibility means
potentially richer visualizations, but also simpler integration with existing visualization sys-
tems. If your application requirements include having (eventual, when the sampling engine
converges) access to full-resolution elements of a dataset, then a sampling engine is probably
a safer choice.
Nevertheless, sampling engines present one serious shortcoming for interactive visualiza-
tion purposes that must be kept in mind. Simply put, in the current state of the art, it is
not straightforward to generate high-confidence, low-error samples for many visualizations
of interest. For example, in a two-dimensional heatmap to be presented on a typical laptop
screen, the database engine would have to run about one million independent stratified
sampling runs (one for each pixel). It is not clear that the approaches in the literature scale
well to a large query set of this kind. If the system simply chooses to provide a simple
incremental sampling without replacement (such as sampleAction [8]), then the risk is that
this sampling will never actually produce acceptable error rates. This happens because the
68 Handbook of Big Data
heatmap being presented might have a colormap that portrays a wide dynamic range of
possible values. Consider, for example, a logarithmically increasing density plot: the range
of cardinalities in each pixel can easily go from tens to hundreds of thousands. The first
example from Liu et al.’s paper showcases this problem very elegantly: a naive sample
from a geographical dataset destroys much of the high-resolution structure that makes a
visualization valuable in the first place.
5.4 Graphical Considerations
The final class of techniques we will discuss exploits the peculiarities of the human visual
system to produce faster results. In other words, these techniques settle for approximate
results, but approximate in a perceptual sense. This is done in an attempt to save work that
would not have produced any perceptible change in the final image. This is an attractive
proposition in principle, but in practice it requires tight integration between the visualiza-
tion and the querying system, because even a change in the visualization that is typically
considered innocuous, such as a change in the color scale or the transformation of one of
the axes of the plot (from linear to logarithmic, for example), can cause the computational
trade-offs behind the perceptibility approximations to change wildly.
The techniques discussed in this (admittedly shorter) section are the closest to current
research and, as such, do not appear to be readily available in any open-source software
package. Nevertheless, they point to important problems in current systems and in the
direction of where we can expect future advances to happen.
The first system we highlight is Battle et al.’s Scalr, which introduces knowledge about
the perceptually relevant variables directly into the query optimizer of a database en-
gine [25]. This allows scalr to transform an exact query into an approximate one by a
variety of different means, including both aggregation and sampling as discussed above.
scalr is implemented on top of SciDB [22], whose focus on array data provides attractive
operators for the manipulation of multiresolution constructs that tend to be relevant in
visualization settings such as images and heatmaps.
The second technique we highlight is Kim et al.’s sampling with ordering guarantees [14].
This is an application of the general strategy of sampling, but done under a specific per-
ceptual constraint. The authors assume that a visualization will be consumed mostly for
the ordering relationships it depicts. In other words, they assume that the bulk of the vi-
sual information to be processed in a bar chart is the relative ordering of the bar heights.
With that assumption in mind, they design an online streaming update algorithm that
very quickly converges (about 25× faster than other sampling strategies, for a total time
of below 1 s and 10
10
entries in the dataset) to the perceptually correct result. Whether or
not ordering is a realistic perceptual component to be preserved can be set aside for now:
the fundamental innovation here is that knowledge about perceptual properties is driving
the algorithm. In principle, this technique should be generalizable to other settings, and it
shows significant promise for the future of big data visualization.
5.5 Outlook and Open Issues
Let us take a step back and look at the area more broadly, now that we know what are
the main techniques at our disposal. Binning strategies attack the problem by observing
that screen resolution is finite, and that careful precomputation can be sufficiently efficient
Interactive Visual Analysis of Big Data 69
and rich. Sampling strategies, by contrast, try to spend data-accessing effort only where
the result uncertainty is largest, quickly converging to an approximation of the result in a
statistical sense. Notice, then, that both classes of techniques are actually trying to build
effective approximations of the exact result without requiring a full scan of the data.
We believe that the overall strategy of approximate query processing is the best hope for
interactive data analysis. Specifically, we believe that perceptual approximations are going
to become a central part of the design of algorithms for interactive visualization. We seek an
infrastructure where richer mathematical descriptions of perceptual characteristics—take,
for example, Harrison et al.’s recent generalization of Weber’s law for perception of corre-
lation [11] or Demiralp et al.’s crowdsourcing of perceptual kernels [6]—can be translated
into potential efficiency gains by expressing the perceptual landscape of the visualizations in
the computing language itself.
In essence, we are advocating here for a concerted effort at pushing the visualization
concerns into the back end (be it an analysis back end or a database back end), rather
than being relegated to front ends. This will certainly bring its own set of problems, not
the least of which being that separating the back end from the front end is currently done
for very good engineering reasons: modularity, separation of concerns, etc. Still, the current
situation of state-of-the-art visualization APIs provides a good illustration of the problems
we face. We enjoy a large degree of modularity and independence: the de facto front end for
visualizations has become the web browser, by means of modern APIs such as d3 [5] and
HTML5featuressuchasscalable vector graphics [18]. Web browsers, by design, provide
automatic distributed operation, and data back ends vary widely in sophistication. At the
same same, these libraries fail to scale properly when datasets grow, eventually requiring the
very solutions we discussed earlier in this chapter, special purpose as they are. So although
we look forward to one day having general, modular, and scalable visualization back ends,
our best bet for the near-term future seems to be on a tightly integrated front-end–back-end
design based on perceptual considerations of the final visual artifact being displayed.
We also look forward to a deeper consideration of perceptual limitations of the outputs
of data analysis algorithms themselves. As a naive example, if the output of a clustering
algorithm will be displayed on a screen, then we do not need to know the values of cluster
centers to a resolution finer than that of the screen. The same argument holds for many
other potentially costly operations such as model fitting and supervised learning algorithms.
References
1. Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and
Ion Stoica. BlinkDB: Queries with bounded errors and bounded response times on very
large data. In Proceedings of the 8th ACM European Conference on Computer Systems,
pp. 29–42, Prague, Czech Republic, 2013.
2. James Ahrens, ebastien Jourdain, Patrick O’Leary, John Patchett, David H. Rogers,
and Mark Petersen. An image-based approach to extreme scale in situ visualization and
analysis. In Proceedings of the International Conference for High Performance Comput-
ing, Networking, Storage and Analysis, pp. 424–434, Piscataway, NJ, 2014.
3. Francis J. Anscombe. Graphs in statistical analysis. The American Statistician,
27(1):17–21, 1973.
4. Richard A. Becker and William S. Cleveland. Brushing scatterplots. Technometrics,
29(2):127–142, 1987.
70 Handbook of Big Data
5. Michael Bostock, Vadim Ogievetsky, and Jeffrey Heer. D
3
: Data-driven documents.
IEEE Transactions on Visualization and Computer Graphics, 17(12):2301–2309, 2011.
6. Cagatay Demiralp, Michael Bernstein, and Jeffrey Heer. Learning perceptual kernels
for visualization design. IEEE Transactions on Visualization and Computer Graphics,
20(12):1933–1942, 2014.
7. Bradley Efron and Robert J. Tibshirani. An Introduction to the Bootstrap. Chapman
& Hall/CRC Press, Boca Raton, FL, 1994.
8. Danyel Fisher, Igor Popov, Steven Drucker, and M.C. Schraefel. Trust me, I’m partially
right: Incremental visualization lets analysts explore large datasets faster. In Proceedings
of the SIGCHI Conference on Human Factors in Computing Systems, pp. 1673–1682,
ACM, New York, 2012.
9. Parke Godfrey, Jarek Gryz, and Piotr Lasek. Interactive visualization of large data sets.
Technical Report EECS-2015-03, York University, Toronto, Canada, 2015.
10. Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don Reichart, Murali
Venkatrao, Frank Pellow, and Hamid Pirahesh. Data cube: A relational aggregation
operator generalizing group-by, cross-tab, and sub-totals. Data Mining and Knowledge
Discovery, 1(1):29–53, 1997.
11. Lane Harrison, Fumeng Yang, Steven Franconeri, and Remco Chang. Ranking visu-
alizations of correlation using Weber’s law. IEEE Transactions on Visualization and
Computer Graphics, 20(12):1943–1952, 2014.
12. Joseph M. Hellerstein, Peter J. Haas, and Helen J. Wang. Online aggregation. ACM
SIGMOD Record, 26(2):171–182, 1997.
13. Michael Izbicki. Algebraic classifiers: A generic approach to fast cross-validation, online
training, and parallel training. In Proceedings of the 30th International Conference on
Machine Learning, pp. 648–656, Atlanta, GA, 2013.
14. Albert Kim, Eric Blais, Aditya Parameswaran, Piotr Indyk, Samuel Madden, and Ronitt
Rubinfeld. Rapid sampling for visualizations with ordering guarantees. ArXiv e-prints,
December 2014.
15. Lauro Lins, James T Klosowski, and Carlos Scheidegger. Nanocubes for real-time explo-
ration of spatiotemporal datasets. IEEE Transactions on Visualization and Computer
Graphics, 19(12):2456–2465, 2013.
16. Zhicheng Liu and Jeffrey Heer. The effects of interactive latency on exploratory
visual analysis. IEEE Transactions on Visualization and Computer Graphics, 20(12):
2122–2131, December 2014.
17. Zhicheng Liu, Biye Jiang, and Jeffrey Heer. imMens: Real-time visual querying of big
data. In Proceedings of the 15th Eurographics Conference on Visualization, Computer
Graphics Forum, vol. 32, pp. 421–430, John Wiley & Sons, Chichester, 2013.
18. Mozilla Developer Network. SVG: Scalable Vector Graphics. https://developer.mozilla.
org/en-US/docs/Web/SVG.
19. Thomas Porter and Tom Duff. Compositing digital images. SIGGRAPH Computer
Graphics, 18(3):253–259, January 1984.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset