Computing clusters on the cloud

In order to process large datasets using Hadoop and associated R packages, one needs a cluster of computers. In today's world, it is easy to get using cloud computing services provided by Amazon, Microsoft, and others. One needs to pay only for the amount of CPU and storage used. No need for upfront investments on infrastructure. The top four cloud computing services are AWS by Amazon, Azure by Microsoft, Compute Cloud by Google, and Bluemix by IBM. In this section, we will discuss running R programs on AWS. In particular, you will learn how to create an AWS instance; install R, RStudio, and other packages in that instance; develop and run machine learning models.

Amazon Web Services

Popularly known as AWS, Amazon Web Services started as an internal project in Amazon in 2002 to meet the dynamic computing requirements to support their e-commerce business. This grew as an infrastructure as a service and in 2006 Amazon launched two services to the world, Simple Storage Service (S3) and Elastic Computing Cloud (EC2). From there, AWS grew at incredible pace. Today, they have more than 40 different types of services using millions of servers.

Creating and running computing instances on AWS

The best place to learn how to set up an AWS account and start using EC2 is the freely available e-book from Amazon Kindle store named Amazon Elastic Compute Cloud (EC2) User Guide (reference 6 in the References section of this chapter).

Here, we only summarize the essential steps involved in the process:

  1. Create an AWS account.
  2. Sign in to the AWS management console (https://aws.amazon.com/console/).
  3. Click on the EC2 service.
  4. Choose Amazon Machine Instance (AMI).
  5. Choose an instance type.
  6. Create a public-private key-pair.
  7. Configure instance.
  8. Add storage.
  9. Tag instance.
  10. Configure a security group (policy specifying who can access the instance).
  11. Review and launch the instance.

Log in to your instance using SSH (from Linux/Ubuntu), Putty (from Windows), or a browser using the private key provided at the time of configuring security and the IP address given at the time of launching. Here, we are assuming that the instance you have launched is a Linux instance.

Installing R and RStudio

To install R and RStudio, you need to be an authenticated user. So, create a new user and give the user administrative privilege (sudo). After that, execute the following steps from the Ubuntu shell:

  1. Edit the /etc/apt/sources.list file.
  2. Add the following line at the end:
    deb http://cran.rstudio.com/bin/linux/ubuntu trusty .
  3. Get the keys for the repository to run:
    sudo apt-key adv  --keyserver keyserver.ubuntu.com –recv-keys 51716619E084DAB9
    
  4. Update the package list:
    sudo apt-get update
    
  5. Install the latest version of R:
    sudo apt-get install r-base-core
    
  6. Install gdebi to install Debian packages from the local disk:
    sudo apt-get install gdebi-core
    
  7. Download the RStudio package:
    wget http://download2.rstudio.org/r-studio-server-0.99.446-amd64.deb
    
  8. Install RStudio:
    sudo gdebi r-studio-server-0.99.446-amd64.deb
    

Once the installation is completed successfully, RStudio running on your AWS instance can be accessed from a browser. For this, open a browser and enter the URL <your.aws.ip.no>:8787.

If you are able to use your RStudio running on the AWS instance, you can then install other packages such as rhdfs, rmr2, and more from RStudio, build any machine learning models in R, and run them on the AWS cloud.

Apart from R and RStudio, AWS also supports Spark (and hence SparkR). In the following section, you will learn how to run Spark on an EC2 cluster.

Running Spark on EC2

You can launch and manage Spark clusters on Amazon EC2 using the spark-ec2 script located in the ec2 directory of Spark in your local machine. To launch a Spark cluster on EC2, use the following steps:

  1. Go to the ec2 directory in the Spark folder in your local machine.
  2. Run the following command:
    ./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> launch <cluster-name>
    

    Here, <keypair> is the name of the keypair you used for launching the EC2 service mentioned in the Creating and running computing instances on AWS section of this chapter. The <key-file> is the path in your local machine where the private key has been downloaded and kept. The number of worker nodes is specified by <num-slaves>.

  3. To run your programs in the cluster, first SSH into the cluster using the following command:
    ./spark-ec2 -k <keypair> -i <key-file> login <cluster-name>
    

    After logging into the cluster, you can use Spark as you use on the local machine.

More details on how to use Spark on EC2 can be found in the Spark documentation and AWS documentation (references 5, 6, and 7 in the References section of the chapter).

Microsoft Azure

Microsoft Azure has full support for R and Spark. Microsoft had bought Revolution Analytics, a company that started building and supporting an enterprise version of R. Apart from this, Azure has a machine learning service where there are APIs for some Bayesian machine learning models as well. A nice video tutorial of how to launch instances on Azure and how to use their machine learning as a service can be found at the Microsoft Virtual Academy website (reference 8 in the References section of the chapter).

IBM Bluemix

Bluemix has full support for R through the full set of R libraries available on their instances. IBM also has integration of Spark into their cloud services in their roadmap plans. More details can be found at their documentation page (reference 9 in the References section of the chapter).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset