ContextualWheelBandit {contextual} R Documentation

## Bandit: ContextualWheelBandit

### Description

Samples from Wheel bandit game.

### Details

The Wheel bandit game offers an artificial problem where the need for exploration is smoothly parameterized through exploration parameter `delta`.

In the game, contexts are sampled uniformly at random from a unit circle divided into one central and four edge areas for a total of `k = 5` possible actions. The central area offers a random normal sampled reward independent of the context, in contrast to the outer areas which offer a random normal sampled reward dependent on a `d = 2` dimensional context.

### Usage

```  bandit <- ContextualWheelBandit\$new(delta, mean_v, std_v, mu_large, std_large)
```

### Arguments

`delta`

numeric; exploration parameter: high reward in one region if norm above delta.

`mean_v`

numeric vector; mean reward for each action if context norm is below delta.

`std_v`

numeric vector; gaussian reward sd for each action if context norm is below delta.

`mu_large`

numeric; mean reward for optimal action if context norm is above delta.

`std_large`

numeric; standard deviation of the reward for optimal action if context norm is above delta.

### Methods

`new(delta, mean_v, std_v, mu_large, std_large)`

generates and instantializes a new `ContextualWheelBandit` instance.

`get_context(t)`

argument:

• `t`: integer, time step `t`.

returns a named `list` containing the current `d x k` dimensional matrix `context\$X`, the number of arms `context\$k` and the number of features `context\$d`.

`get_reward(t, context, action)`

arguments:

• `t`: integer, time step `t`.

• `context`: list, containing the current `context\$X` (d x k context matrix), `context\$k` (number of arms) and `context\$d` (number of context features) (as set by `bandit`).

• `action`: list, containing `action\$choice` (as set by `policy`).

returns a named `list` containing `reward\$reward` and, where computable, `reward\$optimal` (used by "oracle" policies and to calculate regret).

### References

Riquelme, C., Tucker, G., & Snoek, J. (2018). Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling. arXiv preprint arXiv:1802.09127.

Implementation follows https://github.com/tensorflow/models/tree/master/research/deep_contextual_bandits

Core contextual classes: `Bandit`, `Policy`, `Simulator`, `Agent`, `History`, `Plot`

Bandit subclass examples: `BasicBernoulliBandit`, `ContextualLogitBandit`, `OfflineReplayEvaluatorBandit`

Policy subclass examples: `EpsilonGreedyPolicy`, `ContextualLinTSPolicy`

### Examples

```## Not run:

horizon       <- 1000L
simulations   <- 10L

delta         <- 0.95
num_actions   <- 5
context_dim   <- 2
mean_v        <- c(1.0, 1.0, 1.0, 1.0, 1.2)
std_v         <- c(0.05, 0.05, 0.05, 0.05, 0.05)
mu_large      <- 50
std_large     <- 0.01

bandit        <- ContextualWheelBandit\$new(delta, mean_v, std_v, mu_large, std_large)
agents        <- list(Agent\$new(UCB1Policy\$new(), bandit),
Agent\$new(LinUCBDisjointOptimizedPolicy\$new(0.6), bandit))

simulation     <- Simulator\$new(agents, horizon, simulations)
history        <- simulation\$run()

plot(history, type = "cumulative", regret = FALSE, rate = TRUE, legend_position = "bottomright")

## End(Not run)
```

[Package contextual version 0.9.8.4 Index]