Modeling Blackjack example of Monte Carlo methods using Python

The objective of the popular casino card game Blackjack is to obtain cards, the sum of whose numerical values is as great as possible, without exceeding the value of 21. All face cards (king, queen, and jack) count as 10, and an ace can count as either 1 or as 11, depending upon the way the player wants to use it. Only the ace has this flexibility option. All the other cards are valued at face value. The game begins with two cards dealt with both dealer and players. One of the dealer's cards is face up and the other is face down. If the player has a 'Natural 21' from these first two cards (an ace and a 10-card), the player wins unless the dealer also has a Natural, in which case the game is a draw. If the player does not have a natural, then he can ask for additional cards, one by one (hits), until he either stops (sticks) or exceeds 21 (goes bust). If the player goes bust, he loses; if the player sticks, then it's the dealer's turn. The dealer hits or sticks according to a fixed strategy without choice: the dealer usually sticks on any sum of 17 or greater, and hits otherwise. If the dealer goes bust, then the player automatically wins. If he sticks, the outcome would be either win, lose, or draw, determined by whether the dealer or the player's sum total is closer to 21.

The Blackjack problem can be formulated as an episodic finite MDP, in which each game of Blackjack is an episode. Rewards of +1, -1, and 0 are given for winning, losing, and drawing for each episode respectively at the terminal state and the remaining rewards within the state of game are given the value as 0 with no discount (gamma = 1). Therefore, the terminal rewards are also the returns for this game. We draw the cards from an infinite deck so that no traceable pattern exists. The entire game is modeled in Python in the following code.

The following snippets of code have taken inspiration from Shangtong Zhang's Python codes for RL, and are published in this book with permission from the student of Richard S. Sutton, the famous author of Reinforcement : Learning: An Introduction (details provided in the Further reading section).

The following package is imported for array manipulation and visualization:

>>> from __future__ import print_function 
>>> import numpy as np 
>>> import matplotlib.pyplot as plt 
>>> from mpl_toolkits.mplot3d import Axes3D

At each turn, the player or dealer can take one of the actions possible: either to hit or to stand. These are the only two states possible :

>>> ACTION_HIT = 0 
>>> ACTION_STAND = 1   
>>> actions = [ACTION_HIT, ACTION_STAND]

The policy for player is modeled with 21 arrays of values, as the player will get bust after going over the value of 21:

>>> policyPlayer = np.zeros(22) 
 
>>> for i in range(12, 20): 
...     policyPlayer[i] = ACTION_HIT

The player has taken the policy of stick if he gets a value of either 20 or 21, or else he will keep hitting the deck to draw a new card:

>>> policyPlayer[20] = ACTION_STAND 
>>> policyPlayer[21] = ACTION_STAND

Function form of target policy of a player:

>>> def targetPolicyPlayer(usableAcePlayer, playerSum, dealerCard): 
...     return policyPlayer[playerSum]

Function form of behavior policy of a player:

>>> def behaviorPolicyPlayer(usableAcePlayer, playerSum, dealerCard): 
...     if np.random.binomial(1, 0.5) == 1: 
...         return ACTION_STAND 
...     return ACTION_HIT

Fixed policy for the dealer is to keep hitting the deck until value is 17 and then stick between 17 to 21:

>>> policyDealer = np.zeros(22) 
>>> for i in range(12, 17): 
...     policyDealer[i] = ACTION_HIT 
>>> for i in range(17, 22): 
...     policyDealer[i] = ACTION_STAND

The following function is used for drawing a new card from the deck with replacement:

>>> def getCard(): 
...     card = np.random.randint(1, 14) 
...     card = min(card, 10) 
...     return card

Let's play the game!

>>> def play(policyPlayerFn, initialState=None, initialAction=None):

Sum of the player, player's trajectory and whether player uses ace as 11:

...     playerSum = 0  
...     playerTrajectory = [] 
...     usableAcePlayer = False

Dealer status of drawing cards:

...     dealerCard1 = 0 
...     dealerCard2 = 0 
...     usableAceDealer = False 
 
...     if initialState is None:

Generate a random initial state:

...         numOfAce = 0

Initializing the player's cards:

...         while playerSum < 12:

If the sum of a player's cards is less than 12, always hit the deck for drawing card:

...             card = getCard() 
...             if card == 1: 
...                 numOfAce += 1 
...                 card = 11 
...                 usableAcePlayer = True 
...             playerSum += card

If the player's sum is larger than 21, he must hold at least one ace, but two aces are also possible. In that case, he will use ace as 1 rather than 11. If the player has only one ace, then he does not have a usable ace any more:

...         if playerSum > 21: 
...             playerSum -= 10 
...             if numOfAce == 1: 
...                 usableAcePlayer = False

Initializing the dealer cards:

...         dealerCard1 = getCard() 
...         dealerCard2 = getCard() 
 
...     else: 
...         usableAcePlayer = initialState[0] 
...         playerSum = initialState[1] 
...         dealerCard1 = initialState[2] 
...         dealerCard2 = getCard()

Initialize the game state:

...     state = [usableAcePlayer, playerSum, dealerCard1]

Initializing the dealer's sum:

...     dealerSum = 0 
...     if dealerCard1 == 1 and dealerCard2 != 1: 
...         dealerSum += 11 + dealerCard2 
...         usableAceDealer = True 
...     elif dealerCard1 != 1 and dealerCard2 == 1: 
...         dealerSum += dealerCard1 + 11 
...         usableAceDealer = True 
...     elif dealerCard1 == 1 and dealerCard2 == 1: 
...         dealerSum += 1 + 11 
...         usableAceDealer = True 
...     else: 
...         dealerSum += dealerCard1 + dealerCard2

The game starts from here, as the player needs to draw extra cards from here onwards:

...     while True: 
...         if initialAction is not None: 
...             action = initialAction 
...             initialAction = None 
...         else:

Get action based on the current sum of a player:

...             action = policyPlayerFn(usableAcePlayer, playerSum, dealerCard1)

Tracking the player's trajectory for importance sampling:

...         playerTrajectory.append([action, (usableAcePlayer, playerSum, dealerCard1)]) 
 
...         if action == ACTION_STAND: 
...             break

Get new a card if the action is to hit the deck:

...         playerSum += getCard()

Player busts here if the total sum is greater than 21, the game ends, and he gets a reward of -1. However, if he has an ace at his disposable, he can use it to save the game, or else he will lose.

...         if playerSum > 21: 
...             if usableAcePlayer == True: 
...                 playerSum -= 10 
...                 usableAcePlayer = False 
...             else: 
...                 return state, -1, playerTrajectory

Now it's the dealer's turn. He will draw cards based on a sum: if he reaches 17, he will stop, otherwise keep on drawing cards. If the dealer also has ace, he can use it to achieve the bust situation, otherwise, he goes bust:

...     while True: 
...         action = policyDealer[dealerSum] 
...         if action == ACTION_STAND: 
...             break 
...         dealerSum += getCard() 
...         if dealerSum > 21: 
...             if usableAceDealer == True: 
...                 dealerSum -= 10 
...                 usableAceDealer = False 
...             else: 
...                 return state, 1, playerTrajectory

Now we compare the player's sum with the dealer's sum to decide who wins without going bust:

...     if playerSum > dealerSum: 
...         return state, 1, playerTrajectory 
...     elif playerSum == dealerSum: 
...         return state, 0, playerTrajectory 
...     else: 
...         return state, -1, playerTrajectory

The following code illustrates the Monte Carlo sample with On-Policy:

>>> def monteCarloOnPolicy(nEpisodes): 
...     statesUsableAce = np.zeros((10, 10)) 
...     statesUsableAceCount = np.ones((10, 10)) 
...     statesNoUsableAce = np.zeros((10, 10)) 
...     statesNoUsableAceCount = np.ones((10, 10)) 
...     for i in range(0, nEpisodes): 
...         state, reward, _ = play(targetPolicyPlayer) 
...         state[1] -= 12 
...         state[2] -= 1 
...         if state[0]: 
...             statesUsableAceCount[state[1], state[2]] += 1 
...             statesUsableAce[state[1], state[2]] += reward 
...         else: 
...             statesNoUsableAceCount[state[1], state[2]] += 1 
...             statesNoUsableAce[state[1], state[2]] += reward 
...     return statesUsableAce / statesUsableAceCount, statesNoUsableAce / statesNoUsableAceCount

The following code discusses Monte Carlo with Exploring Starts, in which all the returns for each state-action pair are accumulated and averaged, irrespective of what policy was in force when they were observed:

>>> def monteCarloES(nEpisodes): 
...     stateActionValues = np.zeros((10, 10, 2, 2)) 
...     stateActionPairCount = np.ones((10, 10, 2, 2))

Behavior policy is greedy, which gets argmax of the average returns (s, a):

...     def behaviorPolicy(usableAce, playerSum, dealerCard): 
...         usableAce = int(usableAce) 
...         playerSum -= 12 
...         dealerCard -= 1 
...         return np.argmax(stateActionValues[playerSum, dealerCard, usableAce, :] 
                      / stateActionPairCount[playerSum, dealerCard, usableAce, :])

Play continues for several episodes and, at each episode, randomly initialized state, action, and update values of state-action pairs:

...     for episode in range(nEpisodes): 
...         if episode % 1000 == 0: 
...             print('episode:', episode) 
...         initialState = [bool(np.random.choice([0, 1])), 
...                        np.random.choice(range(12, 22)), 
...                        np.random.choice(range(1, 11))] 
...         initialAction = np.random.choice(actions) 
...         _, reward, trajectory = play(behaviorPolicy, initialState, initialAction) 
...         for action, (usableAce, playerSum, dealerCard) in trajectory: 
...             usableAce = int(usableAce) 
...             playerSum -= 12 
...             dealerCard -= 1

Update values of state-action pairs:

...             stateActionValues[playerSum, dealerCard, usableAce, action] += reward 
...             stateActionPairCount[playerSum, dealerCard, usableAce, action] += 1 
...     return stateActionValues / stateActionPairCount

Print the state value:

>>> figureIndex = 0 
>>> def prettyPrint(data, tile, zlabel='reward'): 
...     global figureIndex 
...     fig = plt.figure(figureIndex) 
...     figureIndex += 1 
...     fig.suptitle(tile) 
...     ax = fig.add_subplot(111, projection='3d') 
...     x_axis = [] 
...     y_axis = [] 
...     z_axis = [] 
...     for i in range(12, 22): 
...         for j in range(1, 11): 
...             x_axis.append(i) 
...             y_axis.append(j) 
...             z_axis.append(data[i - 12, j - 1]) 
...     ax.scatter(x_axis, y_axis, z_axis,c='red') 
...     ax.set_xlabel('player sum') 
...     ax.set_ylabel('dealer showing') 
...     ax.set_zlabel(zlabel)

On-Policy results with or without a usable ace for 10,000 and 500,000 iterations:

>>> def onPolicy(): 
...     statesUsableAce1, statesNoUsableAce1 = monteCarloOnPolicy(10000) 
...     statesUsableAce2, statesNoUsableAce2 = monteCarloOnPolicy(500000) 
...     prettyPrint(statesUsableAce1, 'Usable Ace & 10000 Episodes') 
...     prettyPrint(statesNoUsableAce1, 'No Usable Ace & 10000 Episodes') 
...     prettyPrint(statesUsableAce2, 'Usable Ace & 500000 Episodes') 
...     prettyPrint(statesNoUsableAce2, 'No Usable Ace & 500000 Episodes') 
...     plt.show()

Optimized or Monte Carlo control of policy iterations:

>>> def MC_ES_optimalPolicy(): 
...     stateActionValues = monteCarloES(500000) 
...     stateValueUsableAce = np.zeros((10, 10)) 
...     stateValueNoUsableAce = np.zeros((10, 10)) 
    # get the optimal policy 
...     actionUsableAce = np.zeros((10, 10), dtype='int') 
...     actionNoUsableAce = np.zeros((10, 10), dtype='int') 
...     for i in range(10): 
...         for j in range(10): 
...             stateValueNoUsableAce[i, j] = np.max(stateActionValues[i, j, 0, :]) 
...             stateValueUsableAce[i, j] = np.max(stateActionValues[i, j, 1, :]) 
...             actionNoUsableAce[i, j] = np.argmax(stateActionValues[i, j, 0, :]) 
...             actionUsableAce[i, j] = np.argmax(stateActionValues[i, j, 1, :]) 
...     prettyPrint(stateValueUsableAce, 'Optimal state value with usable Ace') 
...     prettyPrint(stateValueNoUsableAce, 'Optimal state value with no usable Ace') 
...     prettyPrint(actionUsableAce, 'Optimal policy with usable Ace', 'Action (0 Hit, 1 Stick)') 
...     prettyPrint(actionNoUsableAce, 'Optimal policy with no usable Ace', 'Action (0 Hit, 1 Stick)') 
...     plt.show() 
 
# Run on-policy function 
>>> onPolicy()

From the previous diagram, we can conclude that a usable ace in a hand gives much higher rewards even at the low player sum combinations, whereas for a player without a usable ace, values are pretty distinguished in terms of earned reward if those values are less than 20.

# Run Monte Carlo Control or Explored starts 
>>> MC_ES_optimalPolicy()

From the optimum policies and state values, we can conclude that, with a usable ace at our disposal, we can hit more than stick, and also that the state values for rewards are much higher compared with when there is no ace in a hand. Though the results we are talking about are obvious, we can see the magnitude of the impact of holding an ace in a hand.

Table of Contents for Modeling Blackjack example of Monte Carlo methods using Python

Create new playlist

Sign In

Sign Up

Table of Contents for
Modeling Blackjack example of Monte Carlo methods using Python