R: Simulate Trajectories in a MDP

simulate_MDP {pomdp}

R Documentation

Simulate Trajectories in a MDP

Description

Simulate trajectories through a MDP. The start state for each trajectory is randomly chosen using the specified belief. The belief is used to choose actions from an epsilon-greedy policy and then update the state.

Usage

simulate_MDP(
  model,
  n = 100,
  start = NULL,
  horizon = NULL,
  epsilon = NULL,
  delta_horizon = 0.001,
  return_trajectories = FALSE,
  engine = "cpp",
  verbose = FALSE,
  ...
)

Arguments

`model`	a MDP model.
`n`	number of trajectories.
`start`	probability distribution over the states for choosing the starting states for the trajectories. Defaults to "uniform".
`horizon`	epochs end once an absorbing state is reached or after the maximal number of epochs specified via `horizon`. If `NULL` then the horizon for the model is used.
`epsilon`	the probability of random actions for using an epsilon-greedy policy. Default for solved models is 0 and for unsolved model 1.
`delta_horizon`	precision used to determine the horizon for infinite-horizon problems.
`return_trajectories`	logical; return the complete trajectories.
`engine`	`'cpp'` or `'r'` to perform simulation using a faster C++ or a native R implementation.
`verbose`	report used parameters.
`...`	further arguments are ignored.

Details

A native R implementation is available (engine = 'r') and the default is a faster C++ implementation (engine = 'cpp').

Both implementations support parallel execution using the package foreach. To enable parallel execution, a parallel backend like doparallel needs to be available needs to be registered (see doParallel::registerDoParallel()). Note that small simulations are slower using parallelization. Therefore, C++ simulations with n * horizon less than 100,000 are always executed using a single worker.

Value

A list with elements:

avg_reward: The average discounted reward.
reward: Reward for each trajectory.
action_cnt: Action counts.
state_cnt: State counts.
trajectories: A data.frame with the trajectories. Each row contains the episode id, the time step, the state s, the chosen action a, the reward r, and the next state s_prime. Trajectories are only returned for return_trajectories = TRUE.

Author(s)

Michael Hahsler

Examples

# enable parallel simulation 
# doParallel::registerDoParallel()

data(Maze)

# solve the POMDP for 5 epochs and no discounting
sol <- solve_MDP(Maze, discount = 1)
sol

# U in the policy is and estimate of the utility of being in a state when using the optimal policy.
policy(sol)
gridworld_matrix(sol, what = "action")

## Example 1: simulate 100 trajectories following the policy, 
#             only the final belief state is returned
sim <- simulate_MDP(sol, n = 100, horizon = 10, verbose = TRUE)
sim

# Note that all simulations start at s_1 and that the simulated avg. reward
# is therefore an estimate to the U value for the start state s_1.
policy(sol)[1,]

# Calculate proportion of actions taken in the simulation
round_stochastic(sim$action_cnt / sum(sim$action_cnt), 2)

# reward distribution
hist(sim$reward)

## Example 2: simulate starting following a uniform distribution over all
#             states and return all trajectories
sim <- simulate_MDP(sol, n = 100, start = "uniform", horizon = 10, 
  return_trajectories = TRUE)
head(sim$trajectories)   
  
# how often was each state visited?
table(sim$trajectories$s)

[Package pomdp version 1.2.3 Index]