Using SparkR with RStudio

Let's learn how to use RStudio with SparkR in this section. RStudio is an integrated development environment (IDE) interface for R that provides developer productivity. It is open source and available on multiple platforms including Linux, macOS, and Windows with optional commercial support. It has a familiar R shell, a syntax-highlighting editor that supports direct code execution, tools for plotting, history, debugging, and workspace management. It has a desktop and server version for running from a desktop or centrally from a server. Let's use RStudio Server, which provides a web browser to do analytics, using the following steps:

  1. Let's install RStudio Server using the following commands:
    wget https://download2.rstudio.org/rstudio-server-rhel-0.99.903-x86_64.rpm
    sudo yum -y install --nogpgcheck rstudio-server-rhel-0.99.903-x86_64.rpm
    
  2. Open the Firefox browser in the VM and open the page using http://localhost:8787/auth-sign-in. Use cloudera/cloudera credentials to login to the RStudio server. Paste the following lines in the shell of RStudio and hit Enter. Let's use the similar flights dataset we have used earlier for this exercise:
    .libPaths(c(.libPaths(), '/home/cloudera/spark-2.0.0-bin-hadoop2.7/R/lib'))
    
    Sys.setenv(SPARK_HOME = '/home/cloudera/spark-2.0.0-bin-hadoop2.7')
    
    Sys.setenv('SPARKR_SUBMIT_ARGS'='"--master" "yarn" "--packages" "com.databricks:spark-avro_2.11:3.0.0" "sparkr-shell"')
    
    Sys.setenv(PATH = paste(Sys.getenv(c('PATH')), '/home/cloudera/spark-2.0.0-bin-hadoop2.7/bin', sep=':'))
    
    library(SparkR)
    sparkR.session(appName = "RStudio Application")
    
    library(magrittr)
    
    flights <- read.df("flights.csv", source = "csv", header = "true")
    
    # Run a query to print the top most frequent destinations from JFK airport
    
    jfk_dest <- filter(flights, flights$origin == "JFK") %>% 
      group_by(flights$dest) %>% 
      summarize(count = n(flights$dest))
    
    top_dests <- head(arrange(jfk_dest, desc(jfk_dest$count)))
    
    # Finally, create a bar plot of top destinations.   
    barplot(top_dests$count, names.arg = top_dests$dest,col=rainbow(7),main="Top Flight Destinations from JFK", xlab = "Destination", ylab= "Count", beside=TRUE )
    

    A bar plot shows up in the plots area, as follows:

    Using SparkR with RStudio

    Figure 10.9: RStudio screenshot

  3. Write the DataFrame to HDFS in the avro format:
    write.df(flights, path = "flights.avro", source = "com.databricks.spark.avro", mode = "overwrite")
    

    Check the output on Hadoop as shown here:

    [cloudera@quickstart ~]$ hadoop fs -ls flights.avro | awk '{print $8}'
    flights.avro/_SUCCESS
    flights.avro/part-r-00000-1a6133bd-0039-4dfd-972f-f1ba6b1d0385.avro
    flights.avro/part-r-00001-1a6133bd-0039-4dfd-972f-f1ba6b1d0385.avro
    

    Check the job status in the Yarn Resource Manager and Spark UI while executing these commands in RStudio.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset