In order to process large datasets using Hadoop and associated R packages, one needs a cluster of computers. In today's world, it is easy to get using cloud computing services provided by Amazon, Microsoft, and others. One needs to pay only for the amount of CPU and storage used. No need for upfront investments on infrastructure. The top four cloud computing services are AWS by Amazon, Azure by Microsoft, Compute Cloud by Google, and Bluemix by IBM. In this section, we will discuss running R programs on AWS. In particular, you will learn how to create an AWS instance; install R, RStudio, and other packages in that instance; develop and run machine learning models.
Popularly known as AWS, Amazon Web Services started as an internal project in Amazon in 2002 to meet the dynamic computing requirements to support their e-commerce business. This grew as an infrastructure as a service and in 2006 Amazon launched two services to the world, Simple Storage Service (S3) and Elastic Computing Cloud (EC2). From there, AWS grew at incredible pace. Today, they have more than 40 different types of services using millions of servers.
The best place to learn how to set up an AWS account and start using EC2 is the freely available e-book from Amazon Kindle store named Amazon Elastic Compute Cloud (EC2) User Guide (reference 6 in the References section of this chapter).
Here, we only summarize the essential steps involved in the process:
Log in to your instance using SSH (from Linux/Ubuntu), Putty (from Windows), or a browser using the private key provided at the time of configuring security and the IP address given at the time of launching. Here, we are assuming that the instance you have launched is a Linux instance.
To install R and RStudio, you need to be an authenticated user. So, create a new user and give the user administrative privilege (sudo). After that, execute the following steps from the Ubuntu shell:
/etc/apt/sources.list
file.deb http://cran.rstudio.com/bin/linux/ubuntu trusty .
sudo apt-key adv --keyserver keyserver.ubuntu.com –recv-keys 51716619E084DAB9
sudo apt-get update
sudo apt-get install r-base-core
sudo apt-get install gdebi-core
wget http://download2.rstudio.org/r-studio-server-0.99.446-amd64.deb
sudo gdebi r-studio-server-0.99.446-amd64.deb
Once the installation is completed successfully, RStudio running on your AWS instance can be accessed from a browser. For this, open a browser and enter the URL <your.aws.ip.no>:8787
.
If you are able to use your RStudio running on the AWS instance, you can then install other packages such as rhdfs, rmr2, and more from RStudio, build any machine learning models in R, and run them on the AWS cloud.
Apart from R and RStudio, AWS also supports Spark (and hence SparkR). In the following section, you will learn how to run Spark on an EC2 cluster.
You can launch and manage Spark clusters on Amazon EC2 using the spark-ec2
script located in the ec2
directory of Spark in your local machine. To launch a Spark cluster on EC2, use the following steps:
ec2
directory in the Spark folder in your local machine../spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> launch <cluster-name>
Here, <keypair>
is the name of the keypair you used for launching the EC2 service mentioned in the Creating and running computing instances on AWS section of this chapter. The <key-file>
is the path in your local machine where the private key has been downloaded and kept. The number of worker nodes is specified by <num-slaves>
.
./spark-ec2 -k <keypair> -i <key-file> login <cluster-name>
After logging into the cluster, you can use Spark as you use on the local machine.
More details on how to use Spark on EC2 can be found in the Spark documentation and AWS documentation (references 5, 6, and 7 in the References section of the chapter).
Microsoft Azure has full support for R and Spark. Microsoft had bought Revolution Analytics, a company that started building and supporting an enterprise version of R. Apart from this, Azure has a machine learning service where there are APIs for some Bayesian machine learning models as well. A nice video tutorial of how to launch instances on Azure and how to use their machine learning as a service can be found at the Microsoft Virtual Academy website (reference 8 in the References section of the chapter).