As we are dealing with distributed systems, an environment on a virtual machine running on a single laptop is limited for exploration and learning. We can move to the cloud in order to experience the power and scalability of the Spark distributed framework.
Once we are ready to scale our apps, we can migrate our development environment to Amazon Web Services (AWS).
How to run Spark on EC2 is clearly described in the following page: https://spark.apache.org/docs/latest/ec2-scripts.html.
We emphasize five key steps in setting up the AWS Spark environment:
export AWS_ACCESS_KEY_ID=accesskeyid export AWS_SECRET_ACCESS_KEY=secretaccesskey
~$ cd $SPARK_HOME/ec2 ec2$ ./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> launch <cluster-name>
ec2$ ./spark-ec2 -k <keypair> -i <key-file> login <cluster-name>
ec2$ ./spark-ec2 destroy <cluster-name>
In order to create a portable Python and Spark environment that can be easily shared and cloned, the development environment can be built in Docker containers.
We wish capitalize on Docker's two main functions:
The following diagram illustrates a Docker-enabled environment with Spark, Anaconda, and the database server and their respective data volumes.
Docker offers the ability to clone and deploy an environment from the Dockerfile.
You can find an example Dockerfile with a PySpark and Anaconda setup at the following address: https://hub.docker.com/r/thisgokeboysef/pyspark-docker/~/dockerfile/.
Install Docker as per the instructions provided at the following links:
Install the docker container with the Dockerfile provided earlier with the following command:
$ docker pull thisgokeboysef/pyspark-docker
Other great sources of information on how to dockerize your environment can be seen at Lab41. The GitHub repository contains the necessary code:
https://github.com/Lab41/ipython-spark-docker
The supporting blog post is rich in information on thought processes involved in building the docker environment: http://lab41.github.io/blog/2015/04/13/ipython-on-spark-on-docker/.