GittinsBrezziLaiPolicy {contextual} R Documentation

## Policy: Gittins Approximation algorithm for choosing arms in a MAB problem.

### Description

`GittinsBrezziLaiPolicy` Algorithm based on Brezzi and Lai (2002) "Optimal learning and experimentation in bandit problems."

### Details

The algorithm provides an approximation of the Gittins index, by specifying a closed-form expression, which is a function of the discount factor, and the number of successes and failures associated with each arm.

### Usage

```  policy <- GittinsBrezziLaiPolicy\$new(discount=0.95, prior=NULL)
```

### Arguments

`discount`

numeric; discount factor

`prior`

numeric matrix; prior beliefs over Bernoulli parameters governing each arm. Beliefs are specified by Beta distribution with two parameters (alpha,beta) where alpha = number of success, beta = number of failures. Matrix is of arms times two (alpha / beta) dimensions

### Methods

`new(discount=0.95, prior=NULL)`

Generates and initializes a new `Policy` object.

`get_action(t, context)`

arguments:

• `t`: integer, time step `t`.

• `context`: list, containing the current `context\$X` (d x k context matrix), `context\$k` (number of arms) and `context\$d` (number of context features)

computes which arm to play based on the current values in named list `theta` and the current `context`. Returns a named list containing `action\$choice`, which holds the index of the arm to play.

`set_reward(t, context, action, reward)`

arguments:

• `t`: integer, time step `t`.

• `context`: list, containing the current `context\$X` (d x k context matrix), `context\$k` (number of arms) and `context\$d` (number of context features) (as set by `bandit`).

• `action`: list, containing `action\$choice` (as set by `policy`).

• `reward`: list, containing `reward\$reward` and, if available, `reward\$optimal` (as set by `bandit`).

utilizes the above arguments to update and return the set of parameters in list `theta`.

`set_parameters()`

Helper function, called during a Policy's initialisation, assigns the values it finds in list `self\$theta_to_arms` to each of the Policy's k arms. The parameters defined here can then be accessed by arm index in the following way: `theta[[index_of_arm]]\$parameter_name`.

### References

Brezzi, M., & Lai, T. L. (2002). Optimal learning and experimentation in bandit problems. Journal of Economic Dynamics and Control, 27(1), 87-108.

Implementation follows https://github.com/elarry/bandit-algorithms-simulated

Core contextual classes: `Bandit`, `Policy`, `Simulator`, `Agent`, `History`, `Plot`
Bandit subclass examples: `BasicBernoulliBandit`, `ContextualLogitBandit`, `OfflineReplayEvaluatorBandit`
Policy subclass examples: `EpsilonGreedyPolicy`, `ContextualLinTSPolicy`