Apache Spark 1.4 release added SparkR, an R package on top of Spark, which allowed data analysts and data scientists to analyze large datasets and run jobs interactively using R language on Spark platforms.
R is one of the most popular open source statistical programming languages with a huge number (over 7,000) of community-supported packages. R packages help in statistical analysis, machine learning, and visualization of data. Interactive analytics in R is limited by single-threaded processes and memory limitation, which means that R can process data sets that fit in a single computer's memory only. SparkR is an R package developed at the AMPLab of University of California, which provides features of R on distributed computation engines of Spark, which enables us to run large-scale data analytics interactively using R. This chapter is divided into the following topics:
Let's understand the features and limitations of R and how SparkR helps in overcoming those limitations.
R is an open source software package for statistical analysis, machine learning, and visualization of data. R project (https://www.r-project.org/) is a simple programming language, such as S and S-plus. R can be used on multiple platforms such as Windows, Linux, Mac OS, and other Unix flavors. R was originally developed at the University of Auckland by Ross Ihaka and Robert Gentleman, and now it is maintained by the R development team. It is an implementation of S language, which was developed by John Chambers. R is an interpreted programming language, and is one of the most popular open source statistical analysis packages.
The R features are as follows:
Limitations of R, are as follows:
R is the most preferred tool for data scientists. Figure 10.1 indicates how R is typically used by data scientists in two example scenarios. Big data is processed to create a subset of data and then processed in R or distributed storage, as HDFS is analyzed with MapReduce tools for R, such as RMR2 and RHive:
SparkR enables users to write R programs on Spark platforms. SparkR was introduced in Spark version 1.4 (June 2015), and currently supports the following features in 2.0:
SparkR removes additional layers for creating subsets of data or using MapReduce-based frameworks, as shown in Figure 10.2:
SparkR provides an easy to use API, and also provides the following benefits:
R integration with Hadoop is supported using many packages and tools. RHadoop, RHive, RHipe, and Hadoop Streaming are a few ways to process data on Hadoop using R language. All these tools are based on MapReduce, and they lack performance in iterative algorithms. In most cases, SparkR would be significantly faster than MapReduce-based implementations of R.
SparkR architecture is similar to PySpark architecture. Typically, users launch SparkR from a shell, IDE, or RStudio and then load the SparkR package. This will create SparkContext using the R-JVM bridge, and then spawns executor JVMs on workers. Tasks are then shipped from driver to executors, which are forked on R for execution. Executors directly deal with DataSources API to create and process DataFrames. This is depicted in Figure 10.4: