In order to create a portable Python and Spark environment that can be easily shared and cloned, the development environment can be built with a vagrantfile
.
We will point to the Massive Open Online Courses (MOOCs) delivered by Berkeley University and Databricks:
The course labs were executed on IPython Notebooks powered by PySpark. They can be found in the following GitHub repository: https://github.com/spark-mooc/mooc-setup/.
Once you have set up Vagrant on your machine, follow these instructions to get started: https://docs.vagrantup.com/v2/getting-started/index.html.
Clone the spark-mooc/mooc-setup/ github
repository in your work directory and launch the command $ vagrant up
, within the cloned directory:
Be aware that the version of Spark may be outdated as the vagrantfile
may not be up-to-date.
You will see an output similar to this:
C:Programssparkedx1001mooc-setup-master>vagrant up Bringing machine 'sparkvm' up with 'virtualbox' provider... ==> sparkvm: Checking if box 'sparkmooc/base' is up to date... ==> sparkvm: Clearing any previously set forwarded ports... ==> sparkvm: Clearing any previously set network interfaces... ==> sparkvm: Preparing network interfaces based on configuration... sparkvm: Adapter 1: nat ==> sparkvm: Forwarding ports... sparkvm: 8001 => 8001 (adapter 1) sparkvm: 4040 => 4040 (adapter 1) sparkvm: 22 => 2222 (adapter 1) ==> sparkvm: Booting VM... ==> sparkvm: Waiting for machine to boot. This may take a few minutes... sparkvm: SSH address: 127.0.0.1:2222 sparkvm: SSH username: vagrant sparkvm: SSH auth method: private key sparkvm: Warning: Connection timeout. Retrying... sparkvm: Warning: Remote connection disconnect. Retrying... ==> sparkvm: Machine booted and ready! ==> sparkvm: Checking for guest additions in VM... ==> sparkvm: Setting hostname... ==> sparkvm: Mounting shared folders... sparkvm: /vagrant => C:/Programs/spark/edx1001/mooc-setup-master ==> sparkvm: Machine already provisioned. Run `vagrant provision` or use the `--provision` ==> sparkvm: to force provisioning. Provisioners marked to run always will still run. C:Programssparkedx1001mooc-setup-master>
This will launch the IPython Notebooks powered by PySpark on localhost:8001
: