simulate_MDP {pomdp}R Documentation

Simulate Trajectories in a MDP

Description

Simulate trajectories through a MDP. The start state for each trajectory is randomly chosen using the specified belief. The belief is used to choose actions from an epsilon-greedy policy and then update the state.

Usage

simulate_MDP(
  model,
  n = 100,
  start = NULL,
  horizon = NULL,
  epsilon = NULL,
  delta_horizon = 0.001,
  return_trajectories = FALSE,
  engine = "cpp",
  verbose = FALSE,
  ...
)

Arguments

model

a MDP model.

n

number of trajectories.

start

probability distribution over the states for choosing the starting states for the trajectories. Defaults to "uniform".

horizon

epochs end once an absorbing state is reached or after the maximal number of epochs specified via horizon. If NULL then the horizon for the model is used.

epsilon

the probability of random actions for using an epsilon-greedy policy. Default for solved models is 0 and for unsolved model 1.

delta_horizon

precision used to determine the horizon for infinite-horizon problems.

return_trajectories

logical; return the complete trajectories.

engine

'cpp' or 'r' to perform simulation using a faster C++ or a native R implementation.

verbose

report used parameters.

...

further arguments are ignored.

Details

A native R implementation is available (engine = 'r') and the default is a faster C++ implementation (engine = 'cpp').

Both implementations support parallel execution using the package foreach. To enable parallel execution, a parallel backend like doparallel needs to be available needs to be registered (see doParallel::registerDoParallel()). Note that small simulations are slower using parallelization. Therefore, C++ simulations with n * horizon less than 100,000 are always executed using a single worker.

Value

A list with elements:

Author(s)

Michael Hahsler

See Also

Other MDP: MDP(), MDP2POMDP, MDP_policy_functions, accessors, actions(), add_policy(), gridworld, reachable_and_absorbing, regret(), solve_MDP(), transition_graph(), value_function()

Examples

# enable parallel simulation 
# doParallel::registerDoParallel()

data(Maze)

# solve the POMDP for 5 epochs and no discounting
sol <- solve_MDP(Maze, discount = 1)
sol

# U in the policy is and estimate of the utility of being in a state when using the optimal policy.
policy(sol)
gridworld_matrix(sol, what = "action")

## Example 1: simulate 100 trajectories following the policy, 
#             only the final belief state is returned
sim <- simulate_MDP(sol, n = 100, horizon = 10, verbose = TRUE)
sim

# Note that all simulations start at s_1 and that the simulated avg. reward
# is therefore an estimate to the U value for the start state s_1.
policy(sol)[1,]

# Calculate proportion of actions taken in the simulation
round_stochastic(sim$action_cnt / sum(sim$action_cnt), 2)

# reward distribution
hist(sim$reward)

## Example 2: simulate starting following a uniform distribution over all
#             states and return all trajectories
sim <- simulate_MDP(sol, n = 100, start = "uniform", horizon = 10, 
  return_trajectories = TRUE)
head(sim$trajectories)   
  
# how often was each state visited?
table(sim$trajectories$s)

[Package pomdp version 1.2.3 Index]