How it works...

In step 2, dependencies were added for DataVec. We need to use data transformation functions in Spark just like in regular training. Transformation is a data requirement for neural networks and is not Spark-specific.

For example, we talked about LocalTransformExecutor in Chapter 2, Data Extraction, Transformation, and LoadingLocalTransformExecutor is used for DataVec transformation in non-distributed environments. SparkTransformExecutor will be used for the DataVec transformation process in Spark. 

In step 4, we added dependencies for gradient sharing. Training times are faster for gradient sharing and it is designed to be scalable and fault-tolerant. Therefore, gradient sharing is preferred over parameter averaging. In gradient sharing, instead of relaying all the parameter updates/gradients across the network, it only updates those that are above the specified threshold. Let's say we have an update vector at the beginning that we want to communicate across the network. Due to this, we will be creating a sparse binary vector for the large values (as specified by a threshold) in the update vector. We will use this sparse binary vector for further communication. The main idea is to decrease the communication effort. Note that the rest of the updates will not be discarded and are added in a residual vector for processing later. Residual vectors will be kept for future updates (delayed communication) and not lost. Gradient sharing in DL4J is an asynchronous SGD implementation. You can read more about this in detail at http://nikkostrom.com/publications/interspeech2015/strom_interspeech2015.pdf.

In step 5, we added CUDA dependencies for the Spark distributed training application.

Here are the uber-JAR requirements for this:

  • If the OS that's building the uber-JAR is the same as that of the cluster OS (for example, run it on Linux and then execute it on a Spark Linux cluster), include the nd4j-cuda-x.x dependency in the pom.xml file.
  • If the OS that's building the uber-JAR is not the same as that of the cluster OS (for example, run it on Windows and then execute it on a Spark Linux cluster), include the nd4j-cuda-x.x-platform dependency in the pom.xml file.

Just replace x.x with the CUDA version you have installed (for example, nd4j-cuda-9.2 for CUDA 9.2).

In cases where the clusters don't have CUDA/cuDNN set up, we can includredist javacpp- presets for the cluster OS. You can refer to the respective dependencies here: https://deeplearning4j.org/docs/latest/deeplearning4j-config-cuDNN. That way, we don't have to install CUDA or cuDNN in each and every cluster machine.

In step 6, we added a Maven dependency for JCommander. JCommander is used to parse command-line arguments that are supplied with spark-submit. We need this because we will be passing directory locations (HDFS/local) of the train/test data as command-line arguments in spark-submit

From steps 7 to 16, we downloaded and configured Hadoop. Remember to replace {PathDownloaded} with the actual location of the extracted Hadoop package. Also, replace x.x with the Hadoop version you've downloaded. We need to specify the disk location where we will store the metadata and the data represented in HDFS. Due to this, we created name/data directories in step 8/step 9. To make changes, in step 10, we configured mapred-site.xml. If you can't locate the file in the directory, just create an XML file by copying all the content from the mapred-site.xml.template file, and then make the changes that were mentioned in step 10.

In step 13, we replaced the JAVA_HOME path variable with the actual Java home directory location. This was done to avoid certain ClassNotFound exceptions from being encountered at runtime. 

In step 18, make sure that you are downloading the Spark version that matches your Hadoop version. For example, if you have Hadoop 2.7.3, then get the Spark version that looks like spark-x.x-bin-hadoop2.7. When we made changes in step 19, if the spark-env.sh file isn't present, then just create a new file named spark-env.sh by copying the content from the spark-env.sh.template file. Then, make the changes that were mentioned in step 19. After completing all the steps in this recipe, you should be able to perform distributed neural network training via the spark-submit command. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset