Moving to the cloud

As we are dealing with distributed systems, an environment on a virtual machine running on a single laptop is limited for exploration and learning. We can move to the cloud in order to experience the power and scalability of the Spark distributed framework.

Deploying apps in Amazon Web Services

Once we are ready to scale our apps, we can migrate our development environment to Amazon Web Services (AWS).

How to run Spark on EC2 is clearly described in the following page: https://spark.apache.org/docs/latest/ec2-scripts.html.

We emphasize five key steps in setting up the AWS Spark environment:

  1. Create an AWS EC2 key pair via the AWS console http://aws.amazon.com/console/.
  2. Export your key pair to your environment:
    export AWS_ACCESS_KEY_ID=accesskeyid
    export AWS_SECRET_ACCESS_KEY=secretaccesskey
    
  3. Launch your cluster:
    ~$ cd $SPARK_HOME/ec2
    ec2$ ./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> launch <cluster-name>
    
  4. SSH into a cluster to run Spark jobs:
    ec2$ ./spark-ec2 -k <keypair> -i <key-file> login <cluster-name>
    
  5. Destroy your cluster after usage:
    ec2$ ./spark-ec2 destroy <cluster-name>
    

Virtualizing the environment with Docker

In order to create a portable Python and Spark environment that can be easily shared and cloned, the development environment can be built in Docker containers.

We wish capitalize on Docker's two main functions:

  • Creating isolated containers that can be easily deployed on different operating systems or in the cloud.
  • Allowing easy sharing of the development environment image with all its dependencies using The DockerHub. The DockerHub is similar to GitHub. It allows easy cloning and version control. The snapshot image of the configured environment can be the baseline for further enhancements.

The following diagram illustrates a Docker-enabled environment with Spark, Anaconda, and the database server and their respective data volumes.

Virtualizing the environment with Docker

Docker offers the ability to clone and deploy an environment from the Dockerfile.

You can find an example Dockerfile with a PySpark and Anaconda setup at the following address: https://hub.docker.com/r/thisgokeboysef/pyspark-docker/~/dockerfile/.

Install Docker as per the instructions provided at the following links:

Install the docker container with the Dockerfile provided earlier with the following command:

$ docker pull thisgokeboysef/pyspark-docker

Other great sources of information on how to dockerize your environment can be seen at Lab41. The GitHub repository contains the necessary code:

https://github.com/Lab41/ipython-spark-docker

The supporting blog post is rich in information on thought processes involved in building the docker environment: http://lab41.github.io/blog/2015/04/13/ipython-on-spark-on-docker/.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset