Deploying Spark, again

We choose a host where we want to run the Spark standalone manager, be aws-105, and tag it as such:

docker node update --label-add type=sparkmaster aws-105

Other nodes will host our Spark workers.

We start the Spark master on aws-105:

$ docker service create 
--container-label spark-master 
--network spark 
--constraint 'node.labels.type == sparkmaster' 
--publish 8080:8080 
--publish 7077:7077 
--publish 6066:6066 
--name spark-master 
--replicas 1 
--env SPARK_MASTER_IP=0.0.0.0 
--mount type=volume,target=/data,source=spark,volume-driver=flocker 
    
fsoppelsa/spark-master

First, the image. I discovered that there are some annoying things included into the Google images (such as unsetting some environment variables, so making a configuration from external, with --env switches, impossible). Thus, I created myself a pair of Spark 1.6.2 master and worker images.

Then,  --network. Here we say to this container to attach to the user-defined overlay network called spark.

Finally, storage: --mount, which works with Docker volumes. We specify it to:

  • Work with a volume: type=volume
  • Mount the volume inside the container on /data: target=/data
  • Use the spark volume that we created previously: source=spark
  • Use Flocker as a volume-driver

When you create a service and mount a certain volume, if volume does not exist, it will get created.

Note

The current releases of Flocker only support replicas of 1. The reason being that iSCSI/block level mounts cannot be attached across multiple nodes. So only one service can use a volume at a given point of time with replica factor of 1. This makes Flocker more useful for storing and moving database data (which is what it's used for, especially). But here we'll use it to show a tiny example with persistent data in /data in the Spark master container.

So, with this configuration, let's add the workhorses, three Spark workers:

$ docker service create 
--constraint 'node.labels.type != sparkmaster' 
--network spark 
--name spark-worker 
--replicas 3 
--env SPARK\_MASTER\_IP=10.0.0.3 
--env SPARK\_WORKER\_CORES=1 
--env SPARK\_WORKER\_MEMORY=1g 
fsoppelsa/spark-worker

Here, we pass some environment variables into the container, to limit resources usage to 1 core and 1G of memory per container.

After some minutes, this system is up, we connect to aws-105, port 8080 and see this page:

Deploying Spark, again

Testing Spark

So, we access the Spark shell and run a Spark task to check if things are up and running.

We prepare a container with some Spark utilities, for example, fsoppelsa/spark-worker, and run it to compute the value of Pi using the Spark binary run-example:

docker run -ti fsoppelsa/spark-worker /spark/bin/run-example 
    SparkPi

After a ton of output messages, Spark finishes the computation giving us:

...
Pi is roughly 3.14916
...

If we go back to the Spark UI, we can see that our amazing Pi application was successfully completed.

Testing Spark

More interesting is running an interactive Scala shell connecting to the master to execute Spark jobs:

$ docker run -ti fsoppelsa/spark-worker 
/spark/bin/spark-shell --master spark://<aws-105-IP>:7077
Testing Spark

Using Flocker storage

Only for the purpose of this tutorial, we now run an example using the spark volume we created previously to read and write some persistent data from Spark.

In order to do that and because of Flocker limitation of the replica factor, we kill the current set of three workers and create a set of only one, mounting spark:

$ docker service rm spark-worker
$ docker service create 
--constraint 'node.labels.type == sparkmaster' 
--network spark 
--name spark-worker 
--replicas 1 
--env SPARK\_MASTER\_IP=10.0.0.3 
--mount type=volume,target=/data,source=spark,volume-driver=flocker
fsoppelsa/spark-worker

We now gain the Docker credentials of host aws-105 with:

$ eval $(docker-machine env aws-105) 

We can try to write some data in /data by connecting to the Spark master container. In this example, we just save some text data (The content of lorem ipsum, available for example at http://www.loremipsum.net ) to /data/file.txt.

$ docker exec -ti 13ad1e671c8d bash
# echo "the content of lorem ipsum" > /data/file.txt
Using Flocker storage

Then, we connect to the Spark shell to execute a simple Spark job:

  1. Load file.txt.
  2. Map the words it contains to the number of their occurrences.
  3. Save the result in /data/output:
    $ docker exec -ti 13ad1e671c8d /spark/bin/spark-shell
    ...
    scala> val inFile = sc.textFile("file:/data/file.txt")
    scala> val counts = inFile.flatMap(line => line.split(" 
            ")).map(word => (word, 1)).reduceByKey(_ + _)
    scala> counts.saveAsTextFile("file:/data/output")
    scala> ^D
    
    Using Flocker storage

Now, let's start a busybox container on any Spark node and check the content of the spark volume, verifying that the output was written. We run the following code:

$ docker run -v spark:/data -ti busybox sh
# ls /data
# ls /data/output/
# cat /data/output/part-00000
Using Flocker storage

The preceding screenshot shows the output, as expected. The interesting thing about Flocker volume is that they can be even moved from one host to another. A number of operations can be done in a reliable way. Flocker is a good idea if one is looking for a good storage solution for Docker. For example, it's used in production by the Swisscom Developer cloud (http://developer.swisscom.com/), which lets you provision databases such as MongoDB backed by Flocker technology. Upcoming releases of Flocker will aim at slimming down the Flocker codebase and making it more lean and durable. Items such as built in HA, snapshotting, certificate distribution, and easily deployable agents in containers are some of things that are up next. So, a bright future!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset