R: MB/MF reinforcement learning model

slpMBMF {catlearn}

R Documentation

MB/MF reinforcement learning model

Description

Gillan et al.'s (2015) model-free / model-based hybrid Reinforcement Learning model (see Note 1).

Usage

slpMBMF(st, tr, xtdo = FALSE)

Arguments

`st`	List of model parameters
`tr`	Matrix of training items
`xtdo`	Boolean. When TRUE, extended output is provided, see below

Details

The contents of this help file are relatively brief; a more extensive discussion of this model can be found in the supplementary materials of Gillan et al. (2015).

The function operates as a stateful list processor (slp; see Wills et al., 2017). Specifically, it takes a matrix (tr) as an argument, where each row represents a single training trial, while each column represents the different types of information required by the model. It returns a matrix of predicted response probabilities for each stage 1 action on each trial. The slpMBMF function also returns the final Q values for the model.

The current implementation of slpMBMF deals only with relatively simple Reinforcement Learning experiments, of which Gillan et al. (2015, Exp. 2) is one example. Specifically, each trial has two stages. In the first stage of the trial, there is a single state, and the participant can emit one of x actions. In the second stage, there are y states. A reward follows (or doesn't) without a further action from the participant.

A hybrid MB/MF model thus has 2x Q-values at stage 1 (x for the model-based system, and x for the model-free system), and y Q-values at stage 2 (one for each state; there are no actions at stage 2, and the MB and MF systems evaluate stage 2 Q-values the same way in this model). See Note 3.

Argument st must be a list containing the following items:

alpha - the model-free learning rate (range: 0-1)

lambda - the eligibility trace parameter (range: 0-1)

w - A number between 0 and 1, representing the relative contribution of the model-based and model-free parts of the model to the response (0 = pure model-free, 1 = pure model-based).

beta - Decision stochasticity parameter

p - Decision perseveration (p > 0) or switching (p < 0) parameter

tprob - A 2 x 2 matrix of transition probabilities, used by the model-based system. The rows are the actions at stage 1. The columns are the states at stage 2. The cells are transition probabilities (e.g. tprob[2,1] is the probability of arriving at stage 2 state #1 given action #2 at stage 1).

q1.mf - A vector of initial model-free Q values for the actions at stage 1.

q1.mb - A vector of initial model-based Q values for the actions at stage 1.

q2 - A vector of initial Q values for the states at stage 2 (the MB and MF systems share common Q values at stage 2).

If you are unsure what initial Q values to use, set all to 0.5.

Argument tr must be a matrix, where each row is one trial. Trials are always presented to the model in the order specified. The matrix must contain the following named columns (other columns will be ignored):

s1.act - The action made by the participant at stage 1, for each trial; must be an integer in the range 1-x.

s2.state - State of environment at stage 2, for each trial; must be an integer in the range 1-y.

t - Reward signal for trial; must be a real number. If you're unsure what to use here, use 1 = rewarded, 0 = not rewarded.

Value

When xtdo = FALSE, returns a list containing these components:

out - Matrix of response probabilities, for each stage 1 action on each trial.

q1.mf - A vector of final model-free Q values for the actions at stage 1.

q1.mb - A vector of final model-based Q values for the actions at stage 1

q2 - A vector of final Q values for the states at stage 2 (the MB and MF systems share common Q values at stage 2).

When xtdo = TRUE, the list also contains the following model-state information :

xout - A matrix containing the state of the model at the end of each trial. Each row is one trial. It has the following columns:

q1.mb.1, q1.mb.2, ... - One column for each model-based Q value at stage 1.

q1.mf.1, q1.mf.2, ... - One column for each model-free Q value at stage 1.

q2.1, q2.2, ... - One column for each Q value at stage 2.

q1.h.1, q1.h.2, ... - One column for each hybrid Q value at stage 1.

s1.d.mf - Model-free delta at stage 2, wrt. stage 1 action.

s2.d.mf - Model-free delta at outcome.

In addition, when xtdo = TRUE, the list also contains the following information that is not used by the model (but which might be handy as potential neural regressors).

s1.d.mb - Model-based delta at stage 2, wrt. stage 1 action.

s1.d.h - Hybrid delta (based on stage 1 hybrid Q values) at stage 2, wrt. stage 1 action.

s1.d.diff - s1.d.mf - s1.d.mb

Note

1. Gillan et al.'s (2015) choice rule, at least as stated in their supplementary materials, would lead to the response probabilities being infinite on switch trials, which is presumably an error. The current implementation uses Daw et al. (2011, suppl. mat., Eq. 2).

2. Gillan et al. (2015) decay Q values for unselected actions by (1-alpha). This is not part of the current implementation.

3. In the current implementation of the model, x must be 2 and y must be two, otherwise the model will fail or behave unpredictably. If you'd like to develop a more general version of this implementation, contact the author.

Author(s)

Andy Wills ( andy@willslab.co.uk ), Tom Sambrook

References

Daw, N.D., Gershman, S.J., Seymour, B., Dayan, P., & Dolan, R.J. (2011). Model-based influences on humans' choices and striatal prediction errors. Neuron, 69, 1204-1215.

Gillan, C.M., Otto, A.R., Phelps, E.A. & Daw, N.D. (2015). Model-based learning protects against forming habits. Cogn. Affect. Behav. Neurosci., 15, 523-536.

Wills, A.J., O'Connell, G., Edmunds, C.E.R., & Inkster, A.B.(2017). Progress in modeling thrXSough distributed collaboration: Concepts, tools, and category-learning examples. Psychology of Learning and Motivation, 66, 79-115.

[Package catlearn version 1.0 Index]