In this section, we will code the strategy we discussed earlier (the code file is available as Frozen_Lake_with_Q_Learning.ipynb in GitHub):
- Import the relevant packages:
import gym
from gym import envs
from gym.envs.registration import register
Gym is a toolkit for developing and comparing reinforcement learning algorithms. It supports teaching agents everything from walking to playing games such as Pong and Pinball.
More about Gym can be found at: https://gym.openai.com/.
- Register the environment:
register(
id = 'FrozenLakeNotSlippery-v1',
entry_point = 'gym.envs.toy_text:FrozenLakeEnv',
kwargs = {'map_name': '4x4', 'is_slippery':False},
max_episode_steps = 100,
reward_threshold = 0.8196)
- Create the environment:
env = gym.make('FrozenLakeNotSlippery-v1')
- Inspect the created environment:
env.render()
The preceding step renders (prints) the environment:
env.observation_space
The preceding code provides the number of state action pairs in the environment. In our case, given that it is a 4 x 4 grid, we have a total of 16 states. Thus, we have a total of 16 observations.
env.action_space.n
The preceding code defines the number of actions that can be taken in a state in the environment:
env.action_space.sample()
The preceding code samples an action from the possible set of actions:
env.step(action)
The preceding code takes the action and generates the new state and the reward of the action, flags whether the game is done, and provides additional information for the step:
env.reset()
The preceding code resets the environment so that the agent is back to the starting state.
- Initialize the q-table:
import numpy as np
qtable = np.zeros((16,4))
We have initialized it to a shape of (16, 4) as there are 16 states and 4 possible actions in each state.
- Run multiple iterations of playing a game:
Initialize hyper-parameters:
total_episodes=15000
learning_rate=0.8
max_steps=99
gamma=0.95
epsilon=1
max_epsilon=1
min_epsilon=0.01
decay_rate=0.005
Play multiple episodes of the game:
rewards=[]
for episode in range(total_episodes):
state=env.reset()
step=0
done=False
total_rewards=0
In the code below, we are defining the action to be taken. If eps (which is a random number generated between 0 to 1) is less than 0.5, we explore; otherwise, we exploit (to consider the best action in a q-table)
for step in range(max_steps):
exp_exp_tradeoff=random.uniform(0,1)
## Exploitation:
if exp_exp_tradeoff>epsilon:
action=np.argmax(qtable[state,:])
else:
## Exploration
action=env.action_space.sample()
In the code below, we are fetching the new state and the reward, and flag whether the game is done by taking the action in the given step:
new_state, reward, done, _ = env.step(action)
In the code below, we are updating the q-table based on the action taken in a state. Additionally, we are also updating the state with the new state obtained after taking action in the current state:
qtable[state,action]=qtable[state,action]+learning_rate*(reward+gamma*np.max(qtable[new_state,:])-qtable[state,action])
total_rewards+=reward
state=new_state
In the following code, as the game is over, we proceed to a new episode of the game. However, we ensure that the randomness factor (eps), which is used in deciding whether we are going for exploration or exploitation, is updated.
if(done):
break
epsilon=min_epsilon+(max_epsilon-min_epsilon)*np.exp(decay_rate*episode)
rewards.append(total_rewards)
- Once we have built the q-table, we now deploy the agent to maneuver in line with the optimal actions suggested by the q-table:
env.reset()
for episode in range(1):
state=env.reset()
step=0
done=False
print("-----------------------")
print("Episode",episode)
for step in range(max_steps):
env.render()
action=np.argmax(qtable[state,:])
print(action)
new_state,reward,done,info=env.step(action)
if done:
#env.render()
print("Number of Steps",step+1)
break
state=new_state
The preceding gives the optimal path that the agent has to traverse to reach the end goal.