Let's learn how to use RStudio with SparkR in this section. RStudio is an integrated development environment (IDE) interface for R that provides developer productivity. It is open source and available on multiple platforms including Linux, macOS, and Windows with optional commercial support. It has a familiar R shell, a syntax-highlighting editor that supports direct code execution, tools for plotting, history, debugging, and workspace management. It has a desktop and server version for running from a desktop or centrally from a server. Let's use RStudio Server, which provides a web browser to do analytics, using the following steps:
wget https://download2.rstudio.org/rstudio-server-rhel-0.99.903-x86_64.rpm sudo yum -y install --nogpgcheck rstudio-server-rhel-0.99.903-x86_64.rpm
http://localhost:8787/auth-sign-in
. Use cloudera
/cloudera
credentials to login to the RStudio server. Paste the following lines in the shell of RStudio and hit Enter. Let's use the similar flights dataset we have used earlier for this exercise:.libPaths(c(.libPaths(), '/home/cloudera/spark-2.0.0-bin-hadoop2.7/R/lib')) Sys.setenv(SPARK_HOME = '/home/cloudera/spark-2.0.0-bin-hadoop2.7') Sys.setenv('SPARKR_SUBMIT_ARGS'='"--master" "yarn" "--packages" "com.databricks:spark-avro_2.11:3.0.0" "sparkr-shell"') Sys.setenv(PATH = paste(Sys.getenv(c('PATH')), '/home/cloudera/spark-2.0.0-bin-hadoop2.7/bin', sep=':')) library(SparkR) sparkR.session(appName = "RStudio Application") library(magrittr) flights <- read.df("flights.csv", source = "csv", header = "true") # Run a query to print the top most frequent destinations from JFK airport jfk_dest <- filter(flights, flights$origin == "JFK") %>% group_by(flights$dest) %>% summarize(count = n(flights$dest)) top_dests <- head(arrange(jfk_dest, desc(jfk_dest$count))) # Finally, create a bar plot of top destinations. barplot(top_dests$count, names.arg = top_dests$dest,col=rainbow(7),main="Top Flight Destinations from JFK", xlab = "Destination", ylab= "Count", beside=TRUE )
A bar plot shows up in the plots area, as follows:
avro
format:write.df(flights, path = "flights.avro", source = "com.databricks.spark.avro", mode = "overwrite")
Check the output on Hadoop as shown here:
[cloudera@quickstart ~]$ hadoop fs -ls flights.avro | awk '{print $8}' flights.avro/_SUCCESS flights.avro/part-r-00000-1a6133bd-0039-4dfd-972f-f1ba6b1d0385.avro flights.avro/part-r-00001-1a6133bd-0039-4dfd-972f-f1ba6b1d0385.avro
Check the job status in the Yarn Resource Manager and Spark UI while executing these commands in RStudio.