How it works...

In step 1, we defined an action space for the agent by specifying a defined set of Malmo actions. For example, movenorth 1 means moving the agent one block north. We passed in a list of strings to MalmoActionSpaceDiscrete indicating an agent's actions on Malmo space.

In step 2, we created an observation space from the bitmap size (mentioned by xSize and ySize) of input images(from the Malmo space). Also, we assumed three color channels (R, G, B). The agent needs to know about observation space before they run. We used MalmoObservationSpacePixels because we target observation from pixels.

In step 3, we have created a Malmo consistency policy using MalmoDescretePositionPolicy to ensure that the upcoming observation is in a consistent state.

A MDP is an approach used in reinforcement learning in grid-world environments. Our mission has states in the form of grids. MDP requires a policy and the objective of reinforcement learning is to find the optimal policy for the MDP. MalmoEnv is an MDP wrapper around a Java client.

In step 4, we created an MDP wrapper using the mission schema, action space, observation space, and observation policy. Note that the observation policy is not the same as the policy that an agent wants to form at the end of the learning process.

In step 5, we used DQNFactoryStdConv to build the DQN by adding convolutional layers.

In step 6, we configured HistoryProcessor to scale and remove pixels that were not needed. The actual intent of HistoryProcessor is to perform an experience replay, where the previous experience from the agent will be considered while deciding the action on the current state. With the use of HistoryProcessor, we can change the partial observation of states to a fully-observed state, that is, when the current state is an accumulation of the previous states.

Here are the hyperparameters used in step 7 while creating Q-learning configuration:

maxEpochStep: The maximum number of steps allowed per epoch.
maxStep: The maximum number of steps that are allowed. Training will finish when the iterations exceed the value specified for maxStep.
expRepMaxSize: The maximum size of experience replay. Experience replay refers to the number of past transitions based on which the agent can decide on the next step to take.
doubleDQN: This decides whether double DQN is enabled in the configuration (true if enabled).
targetDqnUpdateFreq: Regular Q-learning can overestimate the action values under certain conditions. Double Q-learning adds stability to the learning. The main idea of double DQN is to freeze the network after every M number of updates or smoothly average for every M number of updates. The value of M is referred to as targetDqnUpdateFreq.
updateStart: The number of no-operation (do nothing) moves at the beginning to ensure the Malmo mission starts with a random configuration. If the agent starts the game in the same way every time, then the agent will memorize the sequence of actions, rather than learning to take the next action based on the current state.
gamma: This is also known as the discount factor. A discount factor is multiplied by future rewards to prevent the agent from being attracted to high rewards, rather than learning the actions. A discount factor close to 1 indicates that the rewards from the distant future are considered. On the other hand, a discount factor close to 0 indicates that the rewards from the immediate future are being considered.
rewardFactor: This is a reward-scaling factor to scale the reward for every single step of training.
errorClamp: This will clip the gradient of loss function with respect to output during backpropagation. For errorClamp = 1, the gradient component is clipped to the range (-1, 1).
minEpsilon: Epsilon is the derivative of the loss function with respect to the output of the activation function. Gradients for every activation node for backpropagation are calculated from the given epsilon value.
epsilonNbStep: Th epsilon value is annealed to minEpsilon over an epsilonNbStep number of steps.

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...