3. Divide and Recombine: Approach for Detailed Analysis and Visualization of Large Complex Data (3/3)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Divide and Recombine 45

operations are a bonus in that the same data being used for D&R can be used for other

parallel computing purposes. Tessera connects to Spark using the SparkR package [19],

which exposes the Spark API in the R console.

Support in Tessera for Spark at the time of this writing is very experimental—it has

been implemented and works, and adding it to Tessera is a testament of Tessera’s ﬂexibility

in being back end agnostic, but it has only been tested with rather small datasets.

3.6 Discussion

In this chapter, we have presented one point of view regarding methodology and compu-

tational tools for deep statistical analysis and visualization of large complex data. D&R is

attractive because of its simplicity and its ability to make a wide array of methods available

without needing to implement scalable versions of them. D&R also builds on approaches

that are already very popular with small data, particularly implementations of the split-

apply-combine paradigm such as the plyr and dplyr R packages. D&R as implemented in

datadr is future proof because of its design, enabling adoption of improved back-end tech-

nology as it comes along. All of these factors give D&R a high chance of success. However,

there is a great need for more research and software development to extend D&R to more

statistical domains and make it easier to program.

References

1. Richard A. Becker, William S. Cleveland, and Ming-Jen Shyu. The visual design

and control of trellis display. Journal of Computational and Graphical Statistics,

5(2):123–155, 1996.

2. Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eck-stein. Dis-

tributed optimization and statistical learning via the alternating direction method of

multipliers. Foundations and Trends



in Machine Learning, 3(1):1–122, 2011.

3. William S. Cleveland and Ryan Hafen. Divide and recombine (D&R): Data science

for large complex data. Statistical Analysis and Data Mining: The ASA Data Science

Journal, 7(6):425–433, 2014.

4. Jeﬀrey Dean and Sanjay Ghemawat. Mapreduce: Simpliﬁed data processing on large

clusters. Communications of the ACM, 51(1):107–113, 2008.

5. Saptarshi Guha and William S. Adviser-Cleveland. Computing environment for the

statistical analysis of large and complex data. PhD thesis, Department of Statistics,

Purdue University, West Lafayette, IN, 2010.

6. Saptarshi Guha, Ryan Hafen, Jeremiah Rounds, Jin Xia, Jianfu Li, Bowei Xi, and

William S. Cleveland. Large complex data: Divide and recombine (D&R) with rhipe.

Stat, 1(1):53–67, 2012.

7. Saptarshi Guha, Paul Kidwell, Ryan Hafen, and William S. Cleveland. Visualization

databases for the analysis of large complex datasets. In International Conference on

Artiﬁcial Intelligence and Statistics, pp. 193–200, 2009.

46 Handbook of Big Data

8. Apache Hadoop. Hadoop, 2009.

9. Ryan Hafen, Luke Gosink, Jason McDermott, Karin Rodland, Kerstin Kleese-Van Dam,

and William S. Cleveland. Trelliscope: A system for detailed visualization in the deep

analysis of large complex data. In IEEE Symposium on Large-Scale Data Analysis and

Visualization (LDAV ), pp. 105–112. IEEE, Atlanta, GA, 2013.

10. Michael J. Kane. Scatter matrix concordance: A diagnostic for regressions on subsets

of data. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2015.

11. Ariel Kleiner, Ameet Talwalkar, Purnamrita Sarkar, and Michael I. Jordan. A scalable

bootstrap for massive data. Journal of the Royal Statistical Society: Series B (Statistical

Methodology), 2014.

12. R Core Team. R: A Language and Environment for Statistical Computing. R Foundation

for Statistical Computing, Vienna, Austria, 2012.

13. Steven L. Scott, Alexander W. Blocker, Fernando V. Bonassi, Hugh A. Chipman,

Edward I. George, and Robert E. McCulloch. Bayes and big data: The consensus Monte

Carlo algorithm. In EFaB@Bayes 250 Conference, volume 16, 2013.

14. Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. The Hadoop

distributed ﬁle system. In IEEE 26th Symposium on Mass Storage Systems and Tech-

nologies, pp. 1–10. IEEE, Incline Village, NV, 2010.

15. Luke Tierney, Anthony Rossini, Na Li, and Han Sevcikova. SNOW: Simple Network of

Workstations. R package version 0.3-13, 2013.

16. Edward R. Tufte. Visual Explanations: Images and Quantities, Evidence and Narrative,

volume 36. Graphics Press, Cheshire, CT, 1997.

17. John W. Tukey. Exploratory Data Analysis. 1977.

18. John W. Tukey and Paul A. Tukey. Computer graphics and exploratory data analysis:

An introduction. The Collected Works of John W. Tukey: Graphics: 1965–1985, 5:419,

1988.

19. Shivaram Venkataraman. SparkR: R frontend for Spark. R package version 0.1, 2013.

20. Hadley Wickham. The split-apply-combine strategy for data analysis. Journal of Sta-

tistical Software, 40(1):1–29, 2011.

21. Leland Wilkinson, Anushka Anand, and Robert L. Grossman. Graph-theoretic scagnos-

tics. In INFOVIS, volume 5, p. 21, 2005.

22. Leland Wilkinson and Graham Wills. Scagnostics distributions. Journal of Computa-

tional and Graphical Statistics, 17(2):473–491, 2008.

23. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion

Stoica. Spark: Cluster computing with working sets. In Proceedings of the 2nd USENIX

Conference on Hot Topics in Cloud Computing, pp. 10–10, 2010.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 3. Divide and Recombine: Approach for Detailed Analysis and Visualization of Large Complex Data (3/3)

Create new playlist

Sign In

Sign Up

Table of Contents for
3. Divide and Recombine: Approach for Detailed Analysis and Visualization of Large Complex Data (3/3)