simulate_MDP {pomdp} | R Documentation |
Simulate Trajectories in a MDP
Description
Simulate trajectories through a MDP. The start state for each trajectory is randomly chosen using the specified belief. The belief is used to choose actions from an epsilon-greedy policy and then update the state.
Usage
simulate_MDP(
model,
n = 100,
start = NULL,
horizon = NULL,
epsilon = NULL,
delta_horizon = 0.001,
return_trajectories = FALSE,
engine = "cpp",
verbose = FALSE,
...
)
Arguments
model |
a MDP model. |
n |
number of trajectories. |
start |
probability distribution over the states for choosing the starting states for the trajectories. Defaults to "uniform". |
horizon |
epochs end once an absorbing state is reached or after
the maximal number of epochs specified via |
epsilon |
the probability of random actions for using an epsilon-greedy policy. Default for solved models is 0 and for unsolved model 1. |
delta_horizon |
precision used to determine the horizon for infinite-horizon problems. |
return_trajectories |
logical; return the complete trajectories. |
engine |
|
verbose |
report used parameters. |
... |
further arguments are ignored. |
Details
A native R implementation is available (engine = 'r'
) and the default is a
faster C++ implementation (engine = 'cpp'
).
Both implementations support parallel execution using the package
foreach. To enable parallel execution, a parallel backend like
doparallel needs to be available needs to be registered (see
doParallel::registerDoParallel()
).
Note that small simulations are slower using parallelization. Therefore, C++ simulations
with n * horizon less than 100,000 are always executed using a single worker.
Value
A list with elements:
-
avg_reward
: The average discounted reward. -
reward
: Reward for each trajectory. -
action_cnt
: Action counts. -
state_cnt
: State counts. -
trajectories
: A data.frame with the trajectories. Each row contains theepisode
id, thetime
step, the states
, the chosen actiona
, the rewardr
, and the next states_prime
. Trajectories are only returned forreturn_trajectories = TRUE
.
Author(s)
Michael Hahsler
See Also
Other MDP:
MDP()
,
MDP2POMDP
,
MDP_policy_functions
,
accessors
,
actions()
,
add_policy()
,
gridworld
,
reachable_and_absorbing
,
regret()
,
solve_MDP()
,
transition_graph()
,
value_function()
Examples
# enable parallel simulation
# doParallel::registerDoParallel()
data(Maze)
# solve the POMDP for 5 epochs and no discounting
sol <- solve_MDP(Maze, discount = 1)
sol
# U in the policy is and estimate of the utility of being in a state when using the optimal policy.
policy(sol)
gridworld_matrix(sol, what = "action")
## Example 1: simulate 100 trajectories following the policy,
# only the final belief state is returned
sim <- simulate_MDP(sol, n = 100, horizon = 10, verbose = TRUE)
sim
# Note that all simulations start at s_1 and that the simulated avg. reward
# is therefore an estimate to the U value for the start state s_1.
policy(sol)[1,]
# Calculate proportion of actions taken in the simulation
round_stochastic(sim$action_cnt / sum(sim$action_cnt), 2)
# reward distribution
hist(sim$reward)
## Example 2: simulate starting following a uniform distribution over all
# states and return all trajectories
sim <- simulate_MDP(sol, n = 100, start = "uniform", horizon = 10,
return_trajectories = TRUE)
head(sim$trajectories)
# how often was each state visited?
table(sim$trajectories$s)