Boltzmann or softmax exploration

Boltzmann exploration is also called softmax exploration. As opposed to either taking the optimal action all the time or taking a random action all the time, this exploration favors both through weighted probabilities. This is done through a softmax over the network's estimates of values for each action. In this case, although not guaranteed, the action that the agent estimates to be optimal is most likely to be chosen.

Boltzmann exploration has the biggest advantage over epsilon greedy. This method has information about the likely values of the other actions. In other words, let's imagine that there are five actions available to an agent. Generally, in the epsilon-greedy method, four actions are estimated as non-optimal and they are all considered equally. However, in Boltzmann exploration, the four sub-optimal choices are weighed by their relative value. This enables the agent to ignore actions that are estimated to be largely sub-optimal and give more attention to potentially promising, but not necessarily ideal, actions.

The temperature parameter (τ) controls the spread of the softmax distribution, so that all actions are considered equally at the start of training, and actions are sparsely distributed by the end of training. The parameter is annealed over time.

Table of Contents for Boltzmann or softmax exploration

Create new playlist

Sign In

Sign Up

Table of Contents for
Boltzmann or softmax exploration