Solving the MABP with UCB and Thompson sampling algorithms

In this project, we will use upper confidence limits and Thompson sampling algorithms to solve the MABP. We will compare their performance and strategy in three different situations—standard rewards, standard but more volatile rewards, and somewhat chaotic rewards. Let's prepare the simulation data, and once the data is prepared, we will view the simulated data using the following code:

# loading the required packages
library(ggplot2)
library(reshape2)
# distribution of arms or actions having normally distributed
# rewards with small variance
# The data represents a standard, ideal situation i.e.
# normally distributed rewards, well seperated from each other.
mean_reward = c(5, 7.5, 10, 12.5, 15, 17.5, 20, 22.5, 25, 26)
reward_dist = c(function(n) rnorm(n = n, mean = mean_reward[1], sd = 2.5),
function(n) rnorm(n = n, mean = mean_reward[2], sd = 2.5),
function(n) rnorm(n = n, mean = mean_reward[3], sd = 2.5),
function(n) rnorm(n = n, mean = mean_reward[4], sd = 2.5),
function(n) rnorm(n = n, mean = mean_reward[5], sd = 2.5),
function(n) rnorm(n = n, mean = mean_reward[6], sd = 2.5),
function(n) rnorm(n = n, mean = mean_reward[7], sd = 2.5),
function(n) rnorm(n = n, mean = mean_reward[8], sd = 2.5),
function(n) rnorm(n = n, mean = mean_reward[9], sd = 2.5),
function(n) rnorm(n= n, mean = mean_reward[10], sd = 2.5))
#preparing simulation data
dataset = matrix(nrow = 10000, ncol = 10)
for(i in 1:10){
dataset[, i] = reward_dist[[i]](n = 10000)
}
# assigning column names
colnames(dataset) <- 1:10
# viewing the dataset that is just created with simulated data
View(dataset)

This will give the following output:

Now, create a melted dataset with an arm and reward combination, and then convert the arm column to the nominal type using the following code:

# creating a melted dataset with arm and reward combination
dataset_p = melt(dataset)[, 2:3]
colnames(dataset_p) <- c("Bandit", "Reward")
# converting the arms column in the dataset to nominal type
dataset_p$Bandit = as.factor(dataset_p$Bandit)
# viewing the dataset that is just melted
View(dataset_p)

This will give us the following output:

Now, plot the distributions of rewards from bandits using the following code:

#ploting the distributions of rewards from bandits
ggplot(dataset_p, aes(x = Reward, col = Bandit, fill = Bandit)) +
geom_density(alpha = 0.3) +
labs(title = "Reward from different bandits")

This will give us the following output:

Now let's implement the UCB algorithm on the hypothesized arm with a normal distribution using the following code:

# implementing upper confidence bound algorithm
UCB <- function(N = 1000, reward_data){
d = ncol(reward_data)
bandit_selected = integer(0)
numbers_of_selections = integer(d)
sums_of_rewards = integer(d)
total_reward = 0
for (n in 1:N) {
max_upper_bound = 0
for (i in 1:d) {
if (numbers_of_selections[i] > 0){
average_reward = sums_of_rewards[i] / numbers_of_selections[i]
delta_i = sqrt(2 * log(1 + n * log(n)^2) /
numbers_of_selections[i])
upper_bound = average_reward + delta_i
} else {
upper_bound = 1e400
}
if (upper_bound > max_upper_bound){
max_upper_bound = upper_bound
bandit = i
}
}
bandit_selected = append(bandit_selected, bandit)
numbers_of_selections[bandit] = numbers_of_selections[bandit] + 1
reward = reward_data[n, bandit]
sums_of_rewards[bandit] = sums_of_rewards[bandit] + reward
total_reward = total_reward + reward
}
return(list(total_reward = total_reward, bandit_selected bandit_selected, numbers_of_selections = numbers_of_selections, sums_of_rewards = sums_of_rewards))
}
# running the UCB algorithm on our
# hypothesized arms with normal distributions
UCB(N = 1000, reward_data = dataset)

You will get the following as the resultant output:

$total_reward
1
25836.91
$numbers_of_selections
[1] 1 1 1 1 1 1 2 1 23 968
$sums_of_rewards
[1] 4.149238 10.874230 5.998070 11.951624 18.151797 21.004781 44.266832 19.370479 563.001692
[10] 25138.139942

Next, we will implement the Thompson sampling algorithm using a normal-gamma prior and normal likelihood to estimate posterior distributions using the following code:

# Thompson sampling algorithm
rnormgamma <- function(n, mu, lambda, alpha, beta){
if(length(n) > 1)
n <- length(n)
tau <- rgamma(n, alpha, beta)
x <- rnorm(n, mu, 1 / (lambda * tau))
data.frame(tau = tau, x = x)
}
T.samp <- function(N = 500, reward_data, mu0 = 0, v = 1, alpha = 2,
beta = 6){
d = ncol(reward_data)
bandit_selected = integer(0)
numbers_of_selections = integer(d)
sums_of_rewards = integer(d)
total_reward = 0
reward_history = vector("list", d)
for (n in 1:N){
max_random = -1e400
for (i in 1:d){
if(numbers_of_selections[i] >= 1){
rand = rnormgamma(1,
(v * mu0 + numbers_of_selections[i] * mean(reward_history[[i]])) / (v + numbers_of_selections[i]),
v + numbers_of_selections[i],
alpha + numbers_of_selections[i] / 2,
beta + (sum(reward_history[[i]] - mean(reward_history[[i]])) ^ 2) / 2 + ((numbers_of_selections[i] * v) / (v + numbers_of_selections[i])) * (mean(reward_history[[i]]) - mu0) ^ 2 / 2)$x
}else {
rand = rnormgamma(1, mu0, v, alpha, beta)$x
}
if(rand > max_random){
max_random = rand
bandit = i
}
}
bandit_selected = append(bandit_selected, bandit)
numbers_of_selections[bandit] = numbers_of_selections[bandit] + 1
reward = reward_data[n, bandit]
sums_of_rewards[bandit] = sums_of_rewards[bandit] + reward
total_reward = total_reward + reward
reward_history[[bandit]] = append(reward_history[[bandit]], reward)
}
return(list(total_reward = total_reward, bandit_selected = bandit_selected, numbers_of_selections = numbers_of_selections, sums_of_rewards = sums_of_rewards))
}
# Applying Thompson sampling using normal-gamma prior and Normal likelihood to estimate posterior distributions
T.samp(N = 1000, reward_data = dataset, mu0 = 40)

You will get the following as the resultant output:

$total_reward
10
24434.24
$numbers_of_selections
[1] 16 15 15 14 14 17 16 19 29 845
$sums_of_rewards
[1] 80.22713 110.09657 141.14346 171.41301 212.86899 293.30138 311.12230 423.93256 713.54105 21976.59855

From the results, we can infer that the UCB algorithm quickly identified that the 10th arm yields the most reward. We also observe that Thompson sampling tried the worst arms a lot more times before finding the best one.

Now, let's simulate the data of bandits with normally distributed rewards with large variance and plot the distributions of rewards by using the following code:

# Distribution of bandits / actions having normally distributed rewards with large variance
# This data represents an ideal but more unstable situation: normally distributed rewards with much larger variance,
# thus not well separated from each other.
mean_reward = c(5, 7.5, 10, 12.5, 15, 17.5, 20, 22.5, 25, 26)
reward_dist = c(function(n) rnorm(n = n, mean = mean_reward[1], sd = 20),
function(n) rnorm(n = n, mean = mean_reward[2], sd = 20),
function(n) rnorm(n = n, mean = mean_reward[3], sd = 20),
function(n) rnorm(n = n, mean = mean_reward[4], sd = 20),
function(n) rnorm(n = n, mean = mean_reward[5], sd = 20),
function(n) rnorm(n = n, mean = mean_reward[6], sd = 20),
function(n) rnorm(n = n, mean = mean_reward[7], sd = 20),]
function(n) rnorm(n = n, mean = mean_reward[8], sd = 20),
function(n) rnorm(n = n, mean = mean_reward[9], sd = 20),
function(n) rnorm(n = n, mean = mean_reward[10], sd = 20))
#preparing simulation data
dataset = matrix(nrow = 10000, ncol = 10)
for(i in 1:10){
dataset[, i] = reward_dist[[i]](n = 10000)
}
colnames(dataset) <- 1:10
dataset_p = melt(dataset)[, 2:3]
colnames(dataset_p) <- c("Bandit", "Reward")
dataset_p$Bandit = as.factor(dataset_p$Bandit)
#plotting the distributions of rewards from bandits
ggplot(dataset_p, aes(x = Reward, col = Bandit, fill = Bandit)) +
geom_density(alpha = 0.3) +
labs(title = "Reward from different bandits")

You will get the following graph as the resultant output:

Apply UCB on rewards with higher variance using the following code:

# Applying UCB on rewards with higher variance
UCB(N = 1000, reward_data = dataset)

You will get the following output:

$total_reward
1
25321.39
$numbers_of_selections
[1] 1 1 1 3 1 1 2 6 903 81
$sums_of_rewards
[1] 2.309649 -6.982907 -24.654597 49.186498 8.367174 -16.211632 31.243270 104.190075 23559.216706 1614.725305

Next, apply Thompson sampling on rewards with higher variance by using the following code:

# Applying Thompson sampling on rewards with higher variance
T.samp(N = 1000, reward_data = dataset, mu0 = 40)

You will get the following output:

$total_reward
2
24120.94
$numbers_of_selections
[1] 16 15 14 15 15 17 20 21 849 18
$sums_of_rewards
[1] 94.27878 81.42390 212.00717 181.46489 140.43908 249.82014 368.52864 397.07629 22090.20740 305.69191

From the results, we can infer that when the fluctuation of rewards is greater, the UCB algorithm is more susceptible to being stuck at a suboptimal choice and never finds the optimal bandit. Thompson sampling is generally more robust and is able to find the optimal bandit in all kinds of situations.

Now let's simulate the more chaotic distribution bandit data and plot the distribution of rewards from bandits by using the following code:

# Distribution of bandits / actions with rewards of different distributions
# This data represents an more chaotic (possibly more realistic) situation:
# rewards with different distribution and different variance.
mean_reward = c(5, 7.5, 10, 12.5, 15, 17.5, 20, 22.5, 25, 26)
reward_dist = c(function(n) rnorm(n = n, mean = mean_reward[1], sd = 20),
function(n) rgamma(n = n, shape = mean_reward[2] / 2, rate
= 0.5),
function(n) rpois(n = n, lambda = mean_reward[3]),
function(n) runif(n = n, min = mean_reward[4] - 20, max = mean_reward[4] + 20),
function(n) rlnorm(n = n, meanlog = log(mean_reward[5]) - 0.25, sdlog = 0.5),
function(n) rnorm(n = n, mean = mean_reward[6], sd = 20),
function(n) rexp(n = n, rate = 1 / mean_reward[7]),
function(n) rbinom(n = n, size = mean_reward[8] / 0.5, prob = 0.5),
function(n) rnorm(n = n, mean = mean_reward[9], sd = 20),
function(n) rnorm(n = n, mean = mean_reward[10], sd = 20))
#preparing simulation data
dataset = matrix(nrow = 10000, ncol = 10)
for(i in 1:10){
dataset[, i] = reward_dist[[i]](n = 10000)
}
colnames(dataset) <- 1:10
dataset_p = melt(dataset)[, 2:3]
colnames(dataset_p) <- c("Bandit", "Reward")
dataset_p$Bandit = as.factor(dataset_p$Bandit)
#plotting the distributions of rewards from bandits
ggplot(dataset_p, aes(x = Reward, col = Bandit, fill = Bandit)) +
geom_density(alpha = 0.3) +
labs(title = "Reward from different bandits")

You will get the following graph as the resultant output:

Apply UCB on rewards with different distributions by using the following code:

# Applying UCB on rewards with different distributions
UCB(N = 1000, reward_data = dataset)

You will get the following output:

$total_reward
1
22254.18
$numbers_of_selections
[1] 1 1 1 1 1 1 1 926 61 6
$sums_of_rewards
[1] 6.810026 3.373098 8.000000 12.783859 12.858791 11.835287 1.616978 20755.000000 1324.564987 117.335467

Next, apply Thompson sampling on rewards with different distributions by using the following code:

# Applying Thompson sampling on rewards with different distributions
T.samp(N = 1000, reward_data = dataset, mu0 = 40)

You will get the following as the resultant output:

$total_reward
2
24014.36
$numbers_of_selections
[1] 16 14 14 14 14 15 14 51 214 634
$sums_of_rewards
[1] 44.37095 127.57153 128.00000 142.66207 191.44695 169.10430 150.19486 1168.00000 5201.69130 16691.32118

From the preceding results, we see that the performance of the two algorithms is similar. A major reason for the Thompson sampling algorithm trying all bandits several times before choosing the one it considers best is because we chose a prior distribution with a relatively high mean in this project. With the prior having a larger mean, the algorithm favors exploration overexploitation at the beginning. Only when the algorithm becomes very confident that it has found the best choice does it value exploitation over exploration. If we decrease the mean of the prior, exploitation would have a higher value and the algorithm would stop exploring faster. By altering the prior distribution used, you can adjust the relative importance of exploration overexploitation to suit the specific problem at hand. This is more evidence highlighting the flexibility of the Thompson sampling algorithm.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset