How it works...

In step 1, we discussed Spark-specific memory configurations. We mentioned that this can be configured for master/worker nodes. Also, these memory configurations can be dependent on the cluster resource manager. 

Note that the --executor-memory 4g command-line argument is for YARN. Please refer to the respective cluster resource manager documentation to find out the respective command-line argument for the following:

For Spark Standalone, use the following command-line options to configure the memory space:

  • The on-heap memory for the driver can be configured like so (8G -> 8 GB of memory):
SPARK_DRIVER_MEMORY=8G

  • The off-heap memory for the driver can be configured like so:
SPARK_DRIVER_OPTS=-Dorg.bytedeco.javacpp.maxbytes=8G
  • The on-heap memory for the worker can be configured like so:
SPARK_WORKER_MEMORY=8G
  • The off-heap memory for the worker can be configured like so:
SPARK_WORKER_OPTS=-Dorg.bytedeco.javacpp.maxbytes=8G 

In step 5, we discussed garbage collection for worker nodes. Generally speaking, there are two ways in which we can control the frequency of garbage collection. The following is the first approach:

Nd4j.getMemoryManager().setAutoGcWindow(frequencyIntervalInMs);

This will limit the frequency of garbage collector calls to the specified time interval, that is, frequencyIntervalInMs. The second approach is as follows:

Nd4j.getMemoryManager().togglePeriodicGc(false);

This will totally disable the garbage collector's calls. However, the these approaches will not alter the worker node's memory configuration. We can configure the worker node's memory using the builder methods that are available in SharedTrainingMaster.

We call workerTogglePeriodicGC() to disable/enable periodic garbage collector (GC) calls and workerPeriodicGCFrequency() to set the frequency in which GC needs to be called. 

In step 6, we added support for Kryo serialization in ND4J. The Kryo serializer is a Java serialization framework that helps to increase the speed/efficiency during training in Spark.

For more information, refer to https://spark.apache.org/docs/latest/tuning.html. In step 8, locality configuration is an optional configuration that can be used to improve training performance. Data locality can have a major impact on the performance of Spark jobs. The idea is to ship the data and code together so that the computation can be performed really quickly. For more information, please refer to https://spark.apache.org/docs/latest/tuning.html#data-locality.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset