Divide and Recombine 45
operations are a bonus in that the same data being used for D&R can be used for other
parallel computing purposes. Tessera connects to Spark using the SparkR package [19],
which exposes the Spark API in the R console.
Support in Tessera for Spark at the time of this writing is very experimental—it has
been implemented and works, and adding it to Tessera is a testament of Tessera’s flexibility
in being back end agnostic, but it has only been tested with rather small datasets.
3.6 Discussion
In this chapter, we have presented one point of view regarding methodology and compu-
tational tools for deep statistical analysis and visualization of large complex data. D&R is
attractive because of its simplicity and its ability to make a wide array of methods available
without needing to implement scalable versions of them. D&R also builds on approaches
that are already very popular with small data, particularly implementations of the split-
apply-combine paradigm such as the plyr and dplyr R packages. D&R as implemented in
datadr is future proof because of its design, enabling adoption of improved back-end tech-
nology as it comes along. All of these factors give D&R a high chance of success. However,
there is a great need for more research and software development to extend D&R to more
statistical domains and make it easier to program.
References
1. Richard A. Becker, William S. Cleveland, and Ming-Jen Shyu. The visual design
and control of trellis display. Journal of Computational and Graphical Statistics,
5(2):123–155, 1996.
2. Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eck-stein. Dis-
tributed optimization and statistical learning via the alternating direction method of
multipliers. Foundations and Trends
in Machine Learning, 3(1):1–122, 2011.
3. William S. Cleveland and Ryan Hafen. Divide and recombine (D&R): Data science
for large complex data. Statistical Analysis and Data Mining: The ASA Data Science
Journal, 7(6):425–433, 2014.
4. Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large
clusters. Communications of the ACM, 51(1):107–113, 2008.
5. Saptarshi Guha and William S. Adviser-Cleveland. Computing environment for the
statistical analysis of large and complex data. PhD thesis, Department of Statistics,
Purdue University, West Lafayette, IN, 2010.
6. Saptarshi Guha, Ryan Hafen, Jeremiah Rounds, Jin Xia, Jianfu Li, Bowei Xi, and
William S. Cleveland. Large complex data: Divide and recombine (D&R) with rhipe.
Stat, 1(1):53–67, 2012.
7. Saptarshi Guha, Paul Kidwell, Ryan Hafen, and William S. Cleveland. Visualization
databases for the analysis of large complex datasets. In International Conference on
Artificial Intelligence and Statistics, pp. 193–200, 2009.
46 Handbook of Big Data
8. Apache Hadoop. Hadoop, 2009.
9. Ryan Hafen, Luke Gosink, Jason McDermott, Karin Rodland, Kerstin Kleese-Van Dam,
and William S. Cleveland. Trelliscope: A system for detailed visualization in the deep
analysis of large complex data. In IEEE Symposium on Large-Scale Data Analysis and
Visualization (LDAV ), pp. 105–112. IEEE, Atlanta, GA, 2013.
10. Michael J. Kane. Scatter matrix concordance: A diagnostic for regressions on subsets
of data. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2015.
11. Ariel Kleiner, Ameet Talwalkar, Purnamrita Sarkar, and Michael I. Jordan. A scalable
bootstrap for massive data. Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 2014.
12. R Core Team. R: A Language and Environment for Statistical Computing. R Foundation
for Statistical Computing, Vienna, Austria, 2012.
13. Steven L. Scott, Alexander W. Blocker, Fernando V. Bonassi, Hugh A. Chipman,
Edward I. George, and Robert E. McCulloch. Bayes and big data: The consensus Monte
Carlo algorithm. In EFaB@Bayes 250 Conference, volume 16, 2013.
14. Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. The Hadoop
distributed file system. In IEEE 26th Symposium on Mass Storage Systems and Tech-
nologies, pp. 1–10. IEEE, Incline Village, NV, 2010.
15. Luke Tierney, Anthony Rossini, Na Li, and Han Sevcikova. SNOW: Simple Network of
Workstations. R package version 0.3-13, 2013.
16. Edward R. Tufte. Visual Explanations: Images and Quantities, Evidence and Narrative,
volume 36. Graphics Press, Cheshire, CT, 1997.
17. John W. Tukey. Exploratory Data Analysis. 1977.
18. John W. Tukey and Paul A. Tukey. Computer graphics and exploratory data analysis:
An introduction. The Collected Works of John W. Tukey: Graphics: 1965–1985, 5:419,
1988.
19. Shivaram Venkataraman. SparkR: R frontend for Spark. R package version 0.1, 2013.
20. Hadley Wickham. The split-apply-combine strategy for data analysis. Journal of Sta-
tistical Software, 40(1):1–29, 2011.
21. Leland Wilkinson, Anushka Anand, and Robert L. Grossman. Graph-theoretic scagnos-
tics. In INFOVIS, volume 5, p. 21, 2005.
22. Leland Wilkinson and Graham Wills. Scagnostics distributions. Journal of Computa-
tional and Graphical Statistics, 17(2):473–491, 2008.
23. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion
Stoica. Spark: Cluster computing with working sets. In Proceedings of the 2nd USENIX
Conference on Hot Topics in Cloud Computing, pp. 10–10, 2010.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset