Developing Applications in a Distributed Environment

As the demand increases regarding the quantity of data and resource requirements for parallel computations, legacy approaches may not perform well. So far, we have seen how big data development has become famous and is the most followed approach by enterprises due to the same reasons. DL4J supports neural network training, evaluation, and inference on distributed clusters.

Modern approaches to heavy training, or output generation tasks, distribute training effort across multiple machines. This also brings additional challenges. We need to ensure that we have the following constraints checked before we use Spark to perform distributed training/evaluation/inference:

  • Our data should be significantly large enough to justify the need for distributed clusters. Small network/data on Spark doesn't really gain any performance improvements and local machine execution may have much better results in such scenarios.
  • We have more than a single machine to perform training/evaluation or inference.

Let's say we have a single machine with multiple GPU processors. We could simply use a parallel wrapper rather than Spark in this case. A parallel wrapper enables parallel training on a single machine with multiple cores. Parallel wrappers will be discussed in Chapter 12, Benchmarking and Neural Network Optimization, where you will find out how to configure them. Also, if the neural network takes more than 100 ms for one single iteration, it may be worth considering distributed training.

In this chapter, we will discuss how to configure DL4J for distributed training, evaluation, and inference. We will develop a distributed neural network for the TinyImageNet classifier. In this chapter, we will cover the following recipes:

  • Setting up DL4J and the required dependencies
  • Creating an uber-JAR for training
  • CPU/GPU-specific configuration for training
  • Memory settings and garbage collection for Spark
  • Configuring encoding thresholds
  • Performing a distributed test set evaluation
  • Saving and loading trained neural network models
  • Performing distributed inference
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset