Chapter 10. Interactive Analytics with SparkR

Apache Spark 1.4 release added SparkR, an R package on top of Spark, which allowed data analysts and data scientists to analyze large datasets and run jobs interactively using R language on Spark platforms.

R is one of the most popular open source statistical programming languages with a huge number (over 7,000) of community-supported packages. R packages help in statistical analysis, machine learning, and visualization of data. Interactive analytics in R is limited by single-threaded processes and memory limitation, which means that R can process data sets that fit in a single computer's memory only. SparkR is an R package developed at the AMPLab of University of California, which provides features of R on distributed computation engines of Spark, which enables us to run large-scale data analytics interactively using R. This chapter is divided into the following topics:

  • Introducing R and SparkR
  • Getting started with SparkR
  • Using DataFrames with SparkR
  • Using SparkR with RStudio
  • Machine learning with SparkR
  • Using SparkR with Zeppelin

Introducing R and SparkR

Let's understand the features and limitations of R and how SparkR helps in overcoming those limitations.

What is R?

R is an open source software package for statistical analysis, machine learning, and visualization of data. R project (https://www.r-project.org/) is a simple programming language, such as S and S-plus. R can be used on multiple platforms such as Windows, Linux, Mac OS, and other Unix flavors. R was originally developed at the University of Auckland by Ross Ihaka and Robert Gentleman, and now it is maintained by the R development team. It is an implementation of S language, which was developed by John Chambers. R is an interpreted programming language, and is one of the most popular open source statistical analysis packages.

The R features are as follows:

  • Open source with over 7,000 packages
  • Stable statistics, graphics, and general packages
  • Manipulates R objects directly in C, C++, and Java
  • Command line and IDEs support

Limitations of R, are as follows:

  • Single-threaded
  • Data has to fit in memory

R is the most preferred tool for data scientists. Figure 10.1 indicates how R is typically used by data scientists in two example scenarios. Big data is processed to create a subset of data and then processed in R or distributed storage, as HDFS is analyzed with MapReduce tools for R, such as RMR2 and RHive:

What is R?

Figure 10.1: Processing patterns using R

Introducing SparkR

SparkR enables users to write R programs on Spark platforms. SparkR was introduced in Spark version 1.4 (June 2015), and currently supports the following features in 2.0:

  • Distributed DataFrame based on DataFrame API
  • Distributed machine learning using MLLib

SparkR removes additional layers for creating subsets of data or using MapReduce-based frameworks, as shown in Figure 10.2:

Introducing SparkR

Figure 10.2: SparkR processing

SparkR provides an easy to use API, and also provides the following benefits:

  • DataSources API: Spark SQL's DataSources API enables SparkR to read data from a variety of in-built sources such as JSON, Parquet, JDBC, and external sources such as Hive tables, CSV files, XML files, and so on.
  • DataFrame optimizations: Spark SQL's Catalyst provides optimization for DataFrames such as code generation and memory management. Figure 10.3 shows that Spark computation engine optimizations make SparkR performance similar to that of Scala and Python. Refer to https://databricks.com/blog/2015/06/09/announcing-sparkr-r-on-spark.html:
    Introducing SparkR

    Figure 10.3: SparkR performance with Scala and Python

  • Higher scalability: Distributed DataFrames created on SparkR will be distributed across all nodes in the Spark cluster. This enables SparkR DataFrames to process terabytes of data.

R integration with Hadoop is supported using many packages and tools. RHadoop, RHive, RHipe, and Hadoop Streaming are a few ways to process data on Hadoop using R language. All these tools are based on MapReduce, and they lack performance in iterative algorithms. In most cases, SparkR would be significantly faster than MapReduce-based implementations of R.

Architecture of SparkR

SparkR architecture is similar to PySpark architecture. Typically, users launch SparkR from a shell, IDE, or RStudio and then load the SparkR package. This will create SparkContext using the R-JVM bridge, and then spawns executor JVMs on workers. Tasks are then shipped from driver to executors, which are forked on R for execution. Executors directly deal with DataSources API to create and process DataFrames. This is depicted in Figure 10.4:

Architecture of SparkR

Figure 10.4: SparkR architecture

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset