Cliff walking example of on-policy and off-policy of TD control

A cliff walking grid-world example is used to compare SARSA and Q-learning, to highlight the differences between on-policy (SARSA) and off-policy (Q-learning) methods. This is a standard undiscounted, episodic task with start and end goal states, and with permitted movements in four directions (north, west, east and south). The reward of -1 is used for all transitions except the regions marked The Cliff, stepping on this region will penalize the agent with reward of -100 and sends the agent instantly back to the start position.

The following snippets of code have taken inspiration from Shangtong Zhang's Python codes for RL and are published in this book with permission from the student of Richard S. Sutton, the famous author of Reinforcement Learning: An Introduction (details provided in the Further reading section):

# Cliff-Walking - TD learning - SARSA & Q-learning 
>>> from __future__ import print_function 
>>> import numpy as np 
>>> import matplotlib.pyplot as plt 
 
# Grid dimensions 
>>> GRID_HEIGHT = 4 
>>> GRID_WIDTH = 12 
 
# probability for exploration, step size,gamma  
>>> EPSILON = 0.1 
>>> ALPHA = 0.5 
>>> GAMMA = 1 
 
# all possible actions 
>>> ACTION_UP = 0; ACTION_DOWN = 1;ACTION_LEFT = 2;ACTION_RIGHT = 3 
>>> actions = [ACTION_UP, ACTION_DOWN, ACTION_LEFT, ACTION_RIGHT] 
 
# initial state action pair values 
>>> stateActionValues = np.zeros((GRID_HEIGHT, GRID_WIDTH, 4)) 
>>> startState = [3, 0] 
>>> goalState = [3, 11] 
 
# reward for each action in each state 
>>> actionRewards = np.zeros((GRID_HEIGHT, GRID_WIDTH, 4)) 
>>> actionRewards[:, :, :] = -1.0 
>>> actionRewards[2, 1:11, ACTION_DOWN] = -100.0 
>>> actionRewards[3, 0, ACTION_RIGHT] = -100.0 
 
# set up destinations for each action in each state 
>>> actionDestination = [] 
>>> for i in range(0, GRID_HEIGHT): 
...     actionDestination.append([]) 
...     for j in range(0, GRID_WIDTH): 
...         destinaion = dict() 
...         destinaion[ACTION_UP] = [max(i - 1, 0), j] 
...         destinaion[ACTION_LEFT] = [i, max(j - 1, 0)] 
...         destinaion[ACTION_RIGHT] = [i, min(j + 1, GRID_WIDTH - 1)] 
...         if i == 2 and 1 <= j <= 10: 
...             destinaion[ACTION_DOWN] = startState 
...         else: 
...             destinaion[ACTION_DOWN] = [min(i + 1, GRID_HEIGHT - 1), j] 
...         actionDestination[-1].append(destinaion) 
>>> actionDestination[3][0][ACTION_RIGHT] = startState 
 
# choose an action based on epsilon greedy algorithm 
>>> def chooseAction(state, stateActionValues): 
...     if np.random.binomial(1, EPSILON) == 1: 
...         return np.random.choice(actions) 
...     else: 
...         return np.argmax(stateActionValues[state[0], state[1], :]) 
 
 
# SARSA update 
 
>>> def sarsa(stateActionValues, expected=False, stepSize=ALPHA): 
...     currentState = startState 
...     currentAction = chooseAction(currentState, stateActionValues) 
...     rewards = 0.0 
...     while currentState != goalState: 
 
...         newState = actionDestination[currentState[0]][currentState[1]] [currentAction] 
 
...         newAction = chooseAction(newState, stateActionValues) 
...         reward = actionRewards[currentState[0], currentState[1], currentAction] 
...         rewards += reward 
...         if not expected: 
...             valueTarget = stateActionValues[newState[0], newState[1], newAction] 
...         else: 
...             valueTarget = 0.0 
...             actionValues = stateActionValues[newState[0], newState[1], :] 
...             bestActions = np.argwhere(actionValues == np.max(actionValues)) 
...             for action in actions: 
...                 if action in bestActions: 
 
...                     valueTarget += ((1.0 - EPSILON) / len(bestActions) + EPSILON / len(actions)) * stateActionValues[newState[0], newState[1], action] 
 
...                 else: 
...                     valueTarget += EPSILON / len(actions) * stateActionValues[newState[0], newState[1], action] 
...         valueTarget *= GAMMA 
...         stateActionValues[currentState[0], currentState[1], currentAction] += stepSize * (reward+ valueTarget - stateActionValues[currentState[0], currentState[1], currentAction]) 
...         currentState = newState 
...         currentAction = newAction 
...     return rewards 
 
# Q-learning update 
>>> def qlearning(stateActionValues, stepSize=ALPHA): 
...     currentState = startState 
...     rewards = 0.0 
...     while currentState != goalState: 
...         currentAction = chooseAction(currentState, stateActionValues) 
...         reward = actionRewards[currentState[0], currentState[1], currentAction] 
...         rewards += reward 
...         newState = actionDestination[currentState[0]][currentState[1]] [currentAction] 
...         stateActionValues[currentState[0], currentState[1], currentAction] += stepSize * (reward + GAMMA * np.max(stateActionValues[newState[0], newState[1], :]) - 
...             stateActionValues[currentState[0], currentState[1], currentAction]) 
...         currentState = newState 
...     return rewards 
 
 
# print optimal policy 
>>> def printOptimalPolicy(stateActionValues): 
...     optimalPolicy = [] 
...     for i in range(0, GRID_HEIGHT): 
...         optimalPolicy.append([]) 
...         for j in range(0, GRID_WIDTH): 
...             if [i, j] == goalState: 
...                 optimalPolicy[-1].append('G') 
...                 continue 
...             bestAction = np.argmax(stateActionValues[i, j, :]) 
...             if bestAction == ACTION_UP: 
...                 optimalPolicy[-1].append('U') 
...             elif bestAction == ACTION_DOWN: 
...                 optimalPolicy[-1].append('D') 
...             elif bestAction == ACTION_LEFT: 
...                 optimalPolicy[-1].append('L') 
...             elif bestAction == ACTION_RIGHT: 
...                 optimalPolicy[-1].append('R') 
...     for row in optimalPolicy: 
...         print(row) 
 
>>> def SARSAnQLPlot(): 
    # averaging the reward sums from 10 successive episodes 
...     averageRange = 10 
 
    # episodes of each run 
...     nEpisodes = 500 
 
    # perform 20 independent runs 
...     runs = 20 
 
...     rewardsSarsa = np.zeros(nEpisodes) 
...     rewardsQlearning = np.zeros(nEpisodes) 
...     for run in range(0, runs): 
...         stateActionValuesSarsa = np.copy(stateActionValues) 
...         stateActionValuesQlearning = np.copy(stateActionValues) 
...         for i in range(0, nEpisodes): 
            # cut off the value by -100 to draw the figure more elegantly 
...             rewardsSarsa[i] += max(sarsa(stateActionValuesSarsa), -100) 
...             rewardsQlearning[i] += max(qlearning(stateActionValuesQlearning), -100) 
 
    # averaging over independent runs 
...     rewardsSarsa /= runs 
...     rewardsQlearning /= runs 
 
    # averaging over successive episodes 
...     smoothedRewardsSarsa = np.copy(rewardsSarsa) 
...     smoothedRewardsQlearning = np.copy(rewardsQlearning) 
...     for i in range(averageRange, nEpisodes): 
...         smoothedRewardsSarsa[i] = np.mean(rewardsSarsa[i - averageRange: i + 1]) 
...         smoothedRewardsQlearning[i] = np.mean(rewardsQlearning[i - averageRange: i + 1]) 
 
    # display optimal policy 
...     print('Sarsa Optimal Policy:') 
...     printOptimalPolicy(stateActionValuesSarsa) 
...     print('Q-learning Optimal Policy:') 
...     printOptimalPolicy(stateActionValuesQlearning) 
 
    # draw reward curves 
...     plt.figure(1) 
...     plt.plot(smoothedRewardsSarsa, label='Sarsa') 
...     plt.plot(smoothedRewardsQlearning, label='Q-learning') 
...     plt.xlabel('Episodes') 
...     plt.ylabel('Sum of rewards during episode') 
...     plt.legend() 
 
 
# Sum of Rewards for SARSA versus Qlearning 
>>> SARSAnQLPlot()

After an initial transient, Q-learning learns the value of optimal policy to walk along the optimal path, in which the agent travels right along the edge of the cliff. Unfortunately, this will result in occasionally falling off the cliff because of ε-greedy action selection. Whereas SARSA, on the other hand, takes the action selection into account and learns the longer and safer path through the upper part of the grid. Although Q-learning learns the value of the optimal policy, its online performance is worse than that of the SARSA, which learns the roundabout and safest policy. Even if we observe the following sum of rewards displayed in the following diagram, SARSA has a less negative sum of rewards during the episode than Q-learning.

Table of Contents for Cliff walking example of on-policy and off-policy of TD control

Create new playlist

Sign In

Sign Up

Table of Contents for
Cliff walking example of on-policy and off-policy of TD control