simulate_POMDP {pomdp} | R Documentation |
Simulate Trajectories in a POMDP
Description
Simulate trajectories through a POMDP. The start state for each trajectory is randomly chosen using the specified belief. The belief is used to choose actions from the the epsilon-greedy policy and then updated using observations.
Usage
simulate_POMDP(
model,
n = 1000,
belief = NULL,
horizon = NULL,
epsilon = NULL,
delta_horizon = 0.001,
digits = 7L,
return_beliefs = FALSE,
return_trajectories = FALSE,
engine = "cpp",
verbose = FALSE,
...
)
Arguments
model |
a POMDP model. |
n |
number of trajectories. |
belief |
probability distribution over the states for choosing the starting states for the trajectories. Defaults to the start belief state specified in the model or "uniform". |
horizon |
number of epochs for the simulation. If |
epsilon |
the probability of random actions for using an epsilon-greedy policy. Default for solved models is 0 and for unsolved model 1. |
delta_horizon |
precision used to determine the horizon for infinite-horizon problems. |
digits |
round probabilities for belief points. |
return_beliefs |
logical; Return all visited belief states? This requires n x horizon memory. |
return_trajectories |
logical; Return the simulated trajectories as a data.frame? |
engine |
|
verbose |
report used parameters. |
... |
further arguments are ignored. |
Details
Simulates n
trajectories.
If no simulation horizon is specified, the horizon of finite-horizon problems
is used. For infinite-horizon problems with \gamma < 1
, the simulation
horizon T
is chosen such that
the worst-case error is no more than \delta_\text{horizon}
. That is
\gamma^T \frac{R_\text{max}}{\gamma} \le \delta_\text{horizon},
where R_\text{max}
is the largest possible absolute reward value used as a
perpetuity starting after T
.
A native R implementation (engine = 'r'
) and a faster C++ implementation
(engine = 'cpp'
) are available. Currently, only the R implementation supports
multi-episode problems.
Both implementations support the simulation of trajectories in parallel using the package
foreach. To enable parallel execution, a parallel backend like
doparallel needs to be registered (see
doParallel::registerDoParallel()
).
Note that small simulations are slower using parallelization. C++ simulations
with n * horizon
less than 100,000 are always executed using a single worker.
Value
A list with elements:
-
avg_reward
: The average discounted reward. -
action_cnt
: Action counts. -
state_cnt
: State counts. -
reward
: Reward for each trajectory. -
belief_states
: A matrix with belief states as rows. -
trajectories
: A data.frame with theepisode
id,time
, the state of the simulation (simulation_state
), the id of the used alpha vector given the current belief (seebelief_states
above), the actiona
and the rewardr
.
Author(s)
Michael Hahsler
See Also
Other POMDP:
MDP2POMDP
,
POMDP()
,
accessors
,
actions()
,
add_policy()
,
plot_belief_space()
,
projection()
,
reachable_and_absorbing
,
regret()
,
sample_belief_space()
,
solve_POMDP()
,
solve_SARSOP()
,
transition_graph()
,
update_belief()
,
value_function()
,
write_POMDP()
Examples
data(Tiger)
# solve the POMDP for 5 epochs and no discounting
sol <- solve_POMDP(Tiger, horizon = 5, discount = 1, method = "enum")
sol
policy(sol)
# uncomment the following line to register a parallel backend for simulation
# (needs package doparallel installed)
# doParallel::registerDoParallel()
# foreach::getDoParWorkers()
## Example 1: simulate 100 trajectories
sim <- simulate_POMDP(sol, n = 100, verbose = TRUE)
sim
# calculate the percentage that each action is used in the simulation
round_stochastic(sim$action_cnt / sum(sim$action_cnt), 2)
# reward distribution
hist(sim$reward)
## Example 2: look at the belief states and the trajectories starting with
# an initial start belief.
sim <- simulate_POMDP(sol, n = 100, belief = c(.5, .5),
return_beliefs = TRUE, return_trajectories = TRUE)
head(sim$belief_states)
head(sim$trajectories)
# plot with added density (the x-axis is the probability of the second belief state)
plot_belief_space(sol, sample = sim$belief_states, jitter = 2, ylim = c(0, 6))
lines(density(sim$belief_states[, 2], bw = .02)); axis(2); title(ylab = "Density")
## Example 3: simulate trajectories for an unsolved POMDP which uses an epsilon of 1
# (i.e., all actions are randomized). The simulation horizon for the
# infinite-horizon Tiger problem is calculated using delta_horizon.
sim <- simulate_POMDP(Tiger, return_beliefs = TRUE, verbose = TRUE)
sim$avg_reward
hist(sim$reward, breaks = 20)
plot_belief_space(sol, sample = sim$belief_states, jitter = 2, ylim = c(0, 6))
lines(density(sim$belief_states[, 1], bw = .05)); axis(2); title(ylab = "Density")