Setting up your local Spark instance

Making a full installation of Apache Spark is not an easy task to do from scratch. This is usually accomplished on a cluster of computers, often accessible on the cloud, and it is delegated to experts of the technology (namely, data engineers). This could be a limitation, because you may then not have access to an environment in which to test what you will be learning in this chapter.

However, in order to test the contents of this chapter, you actually do not need to make too-complex installations. By using Docker (https://www.docker.com/), you can have access to an installation of Spark, together with a Jupyter notebook and PySpark, on a Linux server on your own computer (it does not matter if it is a Linux, macOS, or Windows-based machine).

Actually, that is mainly possible because of Docker. Docker allows operating-system-level virtualization, also known as containerization. Containerization means that a computer is allowed to run multiple, isolated filesystem instances, where each instance is simply separated from the other (though sharing the same hardware resources) as if they were single computers themselves. Basically, any piece of software running in Docker is wrapped in a complete, stable, and previously defined filesystem that is totally independent of the filesystem you are running Docker from. Using a Docker container implies that your code will run as perfectly as expected (and as presented in this chapter). Consistency in the execution of commands is the main reason why Docker is the best way to put your solutions into production: you just need to move the container you used into a server and make an API to access your solution (a topic we previously discussed in Chapter 5, Visualization, Insights, and Results, where we presented the Bottle package).

Here are the steps you need to take:

First, you start by installing the Docker software suitable for your system. You can find all you need here, depending on the operating system you operate on:

Windows	https://docs.docker.com/docker-for-windows/
Linux	https://docs.docker.com/engine/getstarted/
macOS	https://docs.docker.com/docker-for-mac/

The installation is straightforward, yet you can find any further information you may need on the very same pages you are downloading the software from.

After having completed the installation, we can use the Docker image that can be found at https://github.com/jupyter/docker-stacks/tree/master/pyspark-notebook. It contains a complete installation of Spark, accessible by a Jupyter notebook, plus a Miniconda installation with the most recent versions of Python 2 and 3. You can find out more about the image's contents here: http://jupyter-docker-stacks.readthedocs.io/en/latest/using/selecting.html#jupyter-pyspark-notebook.
At this point, just open the Docker interface; there, a shell will appear with the ASCII-art of a whale and an IP address. Just take note of the IP address (in our case it was 192.168.99.100). Now, run the following command in the shell:

$> docker run -d -p 8888:8888 --name spark jupyter/pyspark-notebook start-notebook.sh –NotebookApp.token=''

If you prefer security over ease of use, just type this:

$> docker run -d -p 8888:8888 --name spark jupyter/pyspark-notebook start-notebook.sh –NotebookApp.token='mypassword'

Replace the mypassword placeholder with your chosen password. Please note that the Jupyter notebook will then ask for that password when starting it.

After running the preceding command, Docker will then start downloading the pyspark-notebook image (it could take a while); assign it the name spark, copy the 8888 port on the Docker image to the 8888 port on your machine, then execute the start-notebook.sh script, and set the notebook password to empty (that will allow you to immediately access Jupyter just by using the previously noted IP address and the 8888 port).

At this very point, the only other thing you need to do is just type this into your browser:

http://192.168.99.100:8888/

That is, put into your browser the IP address Docker gave you when you started, a colon, and then 8888, which is the port number. Jupyter should immediately appear.

As a simple test, you could immediately open a new notebook and test this:

In: import pyspark
    sc = pyspark.SparkContext('local[*]')

    # do something to prove it works
    rdd = sc.parallelize(range(1000))
    rdd.takeSample(False, 5)

It is also important to notice that you have commands to stop the Docker machine and commands that will even destroy it. This shell command will stop it:

$> docker stop spark

In order to destroy the container after it has been stopped, use the following command (you will lose all your work in the container, by the way):

$> docker rm spark

If your container has not been destroyed, in order to have the container run again after it has been stopped, just use this shell command:

$> docker start spark

Additionally, you need to know that, on the Docker machine, you operate on the /home/jovyan directory, and you can get a list of its contents directly from the Docker shell:

$> docker exec -t -i spark ls /home/jovyan

You can also execute any other Linux bash command.

Notably, you can also copy data to and from the container (since, otherwise, your work will be just kept inside the machine's operating system). Let's pretend that you have to copy a file (file.txt) from a directory on your Windows desktop to the Docker machine:

$> docker cp c:/Users/Luca/Desktop/spark_stuff/file.txt spark:/home/jovyan/file.txt

Also, the opposite is possible:

$> docker cp spark:/home/jovyan/test.ipynb c:/Users/Luca/Desktop/spark_stuff/test.ipynb

That's really all there is; in just a few steps, you should have a locally operating Spark environment to run all your experiments on (clearly, it will use only one node and it will be limited to the power of a single CPU).

Table of Contents for Setting up your local Spark instance

Create new playlist

Sign In

Sign Up

Table of Contents for
Setting up your local Spark instance