Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Using SparkR with RStudio

Let's learn how to use RStudio with SparkR in this section. RStudio is an integrated development environment (IDE) interface for R that provides developer productivity. It is open source and available on multiple platforms including Linux, macOS, and Windows with optional commercial support. It has a familiar R shell, a syntax-highlighting editor that supports direct code execution, tools for plotting, history, debugging, and workspace management. It has a desktop and server version for running from a desktop or centrally from a server. Let's use RStudio Server, which provides a web browser to do analytics, using the following steps:

Let's install RStudio Server using the following commands:

wget https://download2.rstudio.org/rstudio-server-rhel-0.99.903-x86_64.rpm
sudo yum -y install --nogpgcheck rstudio-server-rhel-0.99.903-x86_64.rpm

Open the Firefox browser in the VM and open the page using http://localhost:8787/auth-sign-in. Use cloudera/cloudera credentials to login to the RStudio server. Paste the following lines in the shell of RStudio and hit Enter. Let's use the similar flights dataset we have used earlier for this exercise:

.libPaths(c(.libPaths(), '/home/cloudera/spark-2.0.0-bin-hadoop2.7/R/lib'))

Sys.setenv(SPARK_HOME = '/home/cloudera/spark-2.0.0-bin-hadoop2.7')

Sys.setenv('SPARKR_SUBMIT_ARGS'='"--master" "yarn" "--packages" "com.databricks:spark-avro_2.11:3.0.0" "sparkr-shell"')

Sys.setenv(PATH = paste(Sys.getenv(c('PATH')), '/home/cloudera/spark-2.0.0-bin-hadoop2.7/bin', sep=':'))

library(SparkR)
sparkR.session(appName = "RStudio Application")

library(magrittr)

flights <- read.df("flights.csv", source = "csv", header = "true")

# Run a query to print the top most frequent destinations from JFK airport

jfk_dest <- filter(flights, flights$origin == "JFK") %>% 
  group_by(flights$dest) %>% 
  summarize(count = n(flights$dest))

top_dests <- head(arrange(jfk_dest, desc(jfk_dest$count)))

# Finally, create a bar plot of top destinations.   
barplot(top_dests$count, names.arg = top_dests$dest,col=rainbow(7),main="Top Flight Destinations from JFK", xlab = "Destination", ylab= "Count", beside=TRUE )

A bar plot shows up in the plots area, as follows:

Figure 10.9: RStudio screenshot

Write the DataFrame to HDFS in the avro format:

write.df(flights, path = "flights.avro", source = "com.databricks.spark.avro", mode = "overwrite")

Check the output on Hadoop as shown here:

[cloudera@quickstart ~]$ hadoop fs -ls flights.avro | awk '{print $8}'
flights.avro/_SUCCESS
flights.avro/part-r-00000-1a6133bd-0039-4dfd-972f-f1ba6b1d0385.avro
flights.avro/part-r-00001-1a6133bd-0039-4dfd-972f-f1ba6b1d0385.avro

Check the job status in the Yarn Resource Manager and Spark UI while executing these commands in RStudio.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Using SparkR with RStudio

Create new playlist

Sign In

Sign Up

Using SparkR with RStudio

Table of Contents for
Using SparkR with RStudio