BootstrapTSPolicy {contextual} | R Documentation |
Bootstrap Thompson Sampling
Bootstrap Thompson Sampling (BTS) is a heuristic method for solving bandit problems which modifies Thompson Sampling (see ThompsonSamplingPolicy) by replacing the posterior distribution used in Thompson sampling by a bootstrap distribution.
policy <- BootstrapTSPolicy(J = 100, a= 1, b = 1)
policy <- BootstrapTSPolicy(1000)
new(J = 100, a= 1, b = 1)
Generates a new BootstrapTSPolicy
object.
Arguments are defined in the Argument section above.
set_parameters()
each policy needs to assign the parameters it wants to keep track of
to list self$theta_to_arms
that has to be defined in set_parameters()
's body.
The parameters defined here can later be accessed by arm index in the following way:
theta[[index_of_arm]]$parameter_name
get_action(context)
here, a policy decides which arm to choose, based on the current values of its parameters and, potentially, the current context.
set_reward(reward, context)
in set_reward(reward, context)
, a policy updates its parameter values
based on the reward received, and, potentially, the current context.
Eckles, D., & Kaptein, M. (2014). Thompson sampling with the online bootstrap. arXiv preprint arXiv:1410.4009.
Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4), 285-294.
Core contextual classes: Bandit
, Policy
, Simulator
,
Agent
, History
, Plot
Bandit subclass examples: BasicBernoulliBandit
, ContextualLogitBandit
,
OfflineReplayEvaluatorBandit
Policy subclass examples: EpsilonGreedyPolicy
, ContextualLinTSPolicy