How it works...

In step 1, we configured the threshold algorithm in SharedTrainingMaster, where the default algorithm is AdaptiveThresholdAlgorithm. Threshold algorithms will determine the encoding threshold for distributed training, which is a hyperparameter that's specific to distributed training. Also, note that we are not discarding the rest of the parameter updates. As we mentioned earlier, we put them into separate residual vectors and process them later. We do this to reduce the network traffic/load during training. AdaptiveThresholdAlgorithm is preferred in most cases for better performance.

In step 2, we used ResidualPostProcessor to post process the residual vector. The residual vector was created internally by the gradient sharing implementation to collect parameter updates that were not marked by the specified bound. Most implementations of ResidualPostProcessor will clip/decay the residual vector so that the values in them will not become too large compared to the threshold value. ResidualClippingPostProcessor is one such implementation. ResidualPostProcessor will prevent the residual vector from becoming too large in size as it can take too much time to communicate and may lead to stale gradient issues.

In step 1, we called thresholdAlgorithm() to set the threshold algorithm. In step 2, we called residualPostProcessor() to post process the residual vector for the gradient sharing implementation in DL4J. ResidualClippingPostProcessor accepts two attributes: clipValue and frequency. clipValue is the multiple of the current threshold that we use for clipping. For example, if threshold is t and clipValue is c, then the residual vectors will be clipped to the range [-c*t , c*t]

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset