solve_POMDP {pomdp} | R Documentation |
Solve a POMDP Problem using pomdp-solver
Description
This function utilizes the C implementation of 'pomdp-solve' by Cassandra (2015) to solve problems that are formulated as partially observable Markov decision processes (POMDPs). The result is an optimal or approximately optimal policy.
Usage
solve_POMDP(
model,
horizon = NULL,
discount = NULL,
initial_belief = NULL,
terminal_values = NULL,
method = "grid",
digits = 7,
parameter = NULL,
timeout = Inf,
verbose = FALSE
)
solve_POMDP_parameter()
Arguments
model |
a POMDP problem specification created with |
horizon |
an integer with the number of epochs for problems with a
finite planning horizon. If set to |
discount |
discount factor in range |
initial_belief |
An initial belief vector. If |
terminal_values |
a vector with the terminal utility values for each state or a
matrix specifying the terminal rewards via a terminal value function (e.g.,
the alpha components produced by |
method |
string; one of the following solution methods: |
digits |
precision used when writing POMDP files (see
|
parameter |
a list with parameters passed on to the pomdp-solve program. |
timeout |
number of seconds for the solver to run. |
verbose |
logical, if set to |
Details
Parameters
solve_POMDP_parameter()
displays available solver parameter options.
Horizon: Infinite-horizon POMDPs (horizon = Inf
) converge to a
single policy graph. Finite-horizon POMDPs result in a policy tree of a
depth equal to the smaller of the horizon or the number of epochs to
convergence. The policy (and the associated value function) are stored in a
list by epoch. The policy for the first epoch is stored as the first
element. Horizon can also be used to limit the number of epochs used
for value iteration.
Precision: The POMDP solver uses various epsilon values to control
precision for comparing alpha vectors to check for convergence, and solving
LPs. Overall precision can be changed using
parameter = list(epsilon = 1e-3)
.
Methods: Several algorithms using exact value iteration are available:
Enumeration (Sondik 1971).
Two pass (Sondik 1971).
Witness (Littman, Cassandra, Kaelbling, 1996).
Incremental pruning (Zhang and Liu, 1996, Cassandra et al 1997).
In addition, the following approximate value iteration method is available:
Grid implements a variation of point-based value iteration to solve larger POMDPs (PBVI; see Pineau 2003) without dynamic belief set expansion.
Details can be found in (Cassandra, 2015).
Note on POMDP problem size: Finding optimal policies for POMDPs is known to be a prohibitively difficult problem because the belief space grows exponentially with the number of states. Therefore, exact algorithms can be only used for extremely small problems with only a few states. Typically, the researcher needs to simplify the problem description (fewer states, actions and observations) and choose an approximate algorithm with an acceptable level of approximation to make the problem tractable.
Note on method grid: The finite grid method implements a version of Point
Based Value Iteration (PBVI). The used belief points are by default created
using points that are reachable from the initial belief (start
) by
following all combinations of actions and observations. The size of the grid is
by default 10,000 and
can be set via parameter = list(fg_points = 100)
. Alternatively,
different strategies can be chosen using the parameter fg_type
. In
this implementation, the user can also specify manually a grid of belief
states by providing a matrix with belief states as produced by
sample_belief_space()
as the parameter grid
.
To guarantee convergence in point-based (finite grid) value iteration, the
initial value function must be a lower bound on the optimal value function.
If all rewards are strictly non-negative, an initial value function with an
all zero vector can be used and results will be similar to other methods.
However, if there are negative rewards, lower bounds can be guaranteed by
setting a single vector with the values min(reward)/(1 - discount)
.
The value function is guaranteed to converge to the true value function, but
finite-horizon value functions will not be as expected. solve_POMDP()
produces a warning in this case.
Time-dependent POMDPs: Time dependence of transition probabilities, observation probabilities and reward structure can be modeled by considering a set of episodes representing epochs with the same settings. In the scared tiger example (see Examples section), the tiger has the normal behavior for the first three epochs (episode 1) and then becomes scared with different transition probabilities for the next three epochs (episode 2). The episodes can be solved in reverse order where the value function is used as the terminal values of the preceding episode. This can be done by specifying a vector of horizons (one horizon for each episode) and then lists with transition matrices, observation matrices, and rewards. If the horizon vector has names, then the lists also need to be named, otherwise they have to be in the same order (the numeric index is used). Only the time-varying matrices need to be specified. An example can be found in Example 4 in the Examples section. The procedure can also be done by calling the solver multiple times (see Example 5).
Solution
Policy:
Each policy is a data frame where each row representing a
policy graph node with an associated optimal action and a list of node IDs
to go to depending on the observation (specified as the column names). For
the finite-horizon case, the observation specific node IDs refer to nodes in
the next epoch creating a policy tree. Impossible observations have a
NA
as the next state.
Value function: The value function specifies the value of the value function (the expected reward) over the belief space. The dimensionality of the belief space is $n-1$ where $n$ is the number of states. The value function is stored as a matrix. Each row is associated with a node (row) in the policy graph and represents the coefficients (alpha or V vector) of a hyperplane. It contains one value per state which is the value for the belief state that has a probability of 1 for that state and 0s for all others.
Temporary Files
All temporary solver files are stored in the directory returned by tempdir()
.
Value
The solver returns an object of class POMDP which is a list with the
model specifications. Solved POMDPs also have an element called solution
which is a list, and the
solver output (solver_output
). The solution is a list that contains elements like:
-
method
used solver method. -
solver_output
output of the solver program. -
converged
did the solution converge? -
initial_belief
used initial belief used. -
total_expected_reward
total expected reward starting from the the initial belief. -
pg
,initial_pg_node
the policy graph (see Details section). -
alpha
value function as hyperplanes representing the nodes in the policy graph (see Details section). -
belief_points_solver
optional; belief points used by the solver.
Author(s)
Hossein Kamalzadeh, Michael Hahsler
References
Cassandra, A. (2015). pomdp-solve: POMDP Solver Software, http://www.pomdp.org.
Sondik, E. (1971). The Optimal Control of Partially Observable Markov Processes. Ph.D. Dissertation, Stanford University.
Cassandra, A., Littman M.L., Zhang L. (1997). Incremental Pruning: A Simple, Fast, Exact Algorithm for Partially Observable Markov Decision Processes. UAI'97: Proceedings of the Thirteenth conference on Uncertainty in artificial intelligence, August 1997, pp. 54-61.
Monahan, G. E. (1982). A survey of partially observable Markov decision processes: Theory, models, and algorithms. Management Science 28(1):1-16.
Littman, M. L.; Cassandra, A. R.; and Kaelbling, L. P. (1996). Efficient dynamic-programming updates in partially observable Markov decision processes. Technical Report CS-95-19, Brown University, Providence, RI.
Zhang, N. L., and Liu, W. (1996). Planning in stochastic domains: Problem characteristics and approximation. Technical Report HKUST-CS96-31, Department of Computer Science, Hong Kong University of Science and Technology.
Pineau J., Geoffrey J Gordon G.J., Thrun S.B. (2003). Point-based value iteration: an anytime algorithm for POMDPs. IJCAI'03: Proceedings of the 18th international joint conference on Artificial Intelligence. Pages 1025-1030.
See Also
Other policy:
estimate_belief_for_nodes()
,
optimal_action()
,
plot_belief_space()
,
plot_policy_graph()
,
policy()
,
policy_graph()
,
projection()
,
reward()
,
solve_SARSOP()
,
value_function()
Other solver:
solve_MDP()
,
solve_SARSOP()
Other POMDP:
MDP2POMDP
,
POMDP()
,
accessors
,
actions()
,
add_policy()
,
plot_belief_space()
,
projection()
,
reachable_and_absorbing
,
regret()
,
sample_belief_space()
,
simulate_POMDP()
,
solve_SARSOP()
,
transition_graph()
,
update_belief()
,
value_function()
,
write_POMDP()
Examples
# display available solver options which can be passed on to pomdp-solve as parameters.
solve_POMDP_parameter()
################################################################
# Example 1: Solving the simple infinite-horizon Tiger problem
data("Tiger")
Tiger
# look at the model as a list
unclass(Tiger)
# inspect an individual field of the model (e.g., the transition probabilities and the reward)
Tiger$transition_prob
Tiger$reward
sol <- solve_POMDP(model = Tiger)
sol
# look at the solution
sol$solution
# policy (value function (alpha vectors), optimal action and observation dependent transitions)
policy(sol)
# plot the policy graph of the infinite-horizon POMDP
plot_policy_graph(sol)
# value function
plot_value_function(sol, ylim = c(0,20))
################################################################
# Example 2: Solve a problem specified as a POMDP file
# using a grid of size 20
sol <- solve_POMDP("http://www.pomdp.org/examples/cheese.95.POMDP",
method = "grid", parameter = list(fg_points = 20))
sol
policy(sol)
plot_policy_graph(sol)
# Example 3: Solving a finite-horizon POMDP using the incremental
# pruning method (without discounting)
sol <- solve_POMDP(model = Tiger,
horizon = 3, discount = 1, method = "incprune")
sol
# look at the policy tree
policy(sol)
plot_policy_graph(sol)
# note: only open the door in epoch 3 if you get twice the same observation.
# Expected reward starting for the models initial belief (uniform):
# listen twice and then open the door or listen 3 times
reward(sol)
# Expected reward for listen twice (-2) and then open-left (-1 + (-1) + 10 = 8)
reward(sol, belief = c(1,0))
# Expected reward for just opening the right door (10)
reward(sol, belief = c(1,0), epoch = 3)
# Expected reward for just opening the right door (0.5 * -100 + 0.95 * 10 = 4.5)
reward(sol, belief = c(.95,.05), epoch = 3)
################################################################
# Example 3: Using terminal values (state-dependent utilities after the final epoch)
#
# Specify 1000 if the tiger is right after 3 (horizon) epochs
sol <- solve_POMDP(model = Tiger,
horizon = 3, discount = 1, method = "incprune",
terminal_values = c(0, 1000))
sol
policy(sol)
# Note: The optimal strategy is to never open the left door. If we think the
# Tiger is behind the right door, then we just wait for the final payout. If
# we think the tiger might be behind the left door, then we open the right
# door, are likely to get a small reward and the tiger has a chance of 50\% to
# move behind the right door. The second episode is used to gather more
# information for the more important # final action.
################################################################
# Example 4: Model time-dependent transition probabilities
# The tiger reacts normally for 3 epochs (goes randomly two one
# of the two doors when a door was opened). After 3 epochs he gets
# scared and when a door is opened then he always goes to the other door.
# specify the horizon for each of the two different episodes
Tiger_time_dependent <- Tiger
Tiger_time_dependent$name <- "Scared Tiger Problem"
Tiger_time_dependent$horizon <- c(normal_tiger = 3, scared_tiger = 3)
Tiger_time_dependent$transition_prob <- list(
normal_tiger = list(
"listen" = "identity",
"open-left" = "uniform",
"open-right" = "uniform"),
scared_tiger = list(
"listen" = "identity",
"open-left" = rbind(c(0, 1), c(0, 1)),
"open-right" = rbind(c(1, 0), c(1, 0))
)
)
# Tiger_time_dependent (a higher value for verbose will show more messages)
sol <- solve_POMDP(model = Tiger_time_dependent, discount = 1,
method = "incprune", verbose = 1)
sol
policy(sol)
# note that the default method to estimate the belief for nodes is following a
# trajectory which uses only the first belief reached for each node. Random sampling
# can find a better estimate of the central belief of the segment (see nodes 4-1 to 6-3
# in the plots below).
plot_policy_graph(sol)
plot_policy_graph(sol, method = "random_sample")
################################################################
# Example 5: Alternative method to solve time-dependent POMDPs
# 1) create the scared tiger model
Tiger_scared <- Tiger
Tiger_scared$transition_prob <- list(
"listen" = "identity",
"open-left" = rbind(c(0, 1), c(0, 1)),
"open-right" = rbind(c(1, 0), c(1, 0))
)
# 2) Solve in reverse order. Scared tiger without terminal values first.
sol_scared <- solve_POMDP(model = Tiger_scared,
horizon = 3, discount = 1, method = "incprune")
sol_scared
policy(sol_scared)
# 3) Solve the regular tiger with the value function of the scared tiger as terminal values
sol <- solve_POMDP(model = Tiger,
horizon = 3, discount = 1, method = "incprune",
terminal_values = sol_scared$solution$alpha[[1]])
sol
policy(sol)
# Note: it is optimal to mostly listen till the Tiger gets in the scared mood. Only if
# we are extremely sure in the first epoch, then opening a door is optimal.
################################################################
# Example 6: PBVI with a custom grid
# Create a search grid by sampling from the belief space in
# 10 regular intervals
custom_grid <- sample_belief_space(Tiger, n = 10, method = "regular")
head(custom_grid)
# Visualize the search grid
plot_belief_space(sol, sample = custom_grid)
# Solve the POMDP using the grid for approximation
sol <- solve_POMDP(Tiger, method = "grid", parameter = list(grid = custom_grid))
policy(sol)
plot_policy_graph(sol)
# note that plot_policy_graph() automatically remove nodes that are unreachable from the
# initial node. This behavior can be switched off.
plot_policy_graph(sol, remove_unreachable_nodes = FALSE)