MDP_policy_functions {pomdp} | R Documentation |
Functions for MDP Policies
Description
Implementation several functions useful to deal with MDP policies.
Usage
q_values_MDP(model, U = NULL)
MDP_policy_evaluation(
pi,
model,
U = NULL,
k_backups = 1000,
theta = 0.001,
verbose = FALSE
)
greedy_MDP_action(s, Q, epsilon = 0, prob = FALSE)
random_MDP_policy(model, prob = NULL)
manual_MDP_policy(model, actions)
greedy_MDP_policy(Q)
Arguments
model |
an MDP problem specification. |
U |
a vector with value function representing the state utilities
(expected sum of discounted rewards from that point on).
If |
pi |
a policy as a data.frame with at least columns for states and action. |
k_backups |
number of look ahead steps used for approximate policy evaluation
used by the policy iteration method. Set k_backups to |
theta |
stop when the largest change in a state value is less
than |
verbose |
logical; should progress and approximation errors be printed. |
s |
a state. |
Q |
an action value function with Q-values as a state by action matrix. |
epsilon |
an |
prob |
probability vector for random actions for |
actions |
a vector with the action (either the action label or the numeric id) for each state. |
Details
Implemented functions are:
-
q_values_MDP()
calculates (approximates) Q-values for a given model using the Bellman optimality equation:q(s,a) = \sum_{s'} T(s'|s,a) [R(s,a) + \gamma U(s')]
Q-values can be used as the input for several other functions.
-
MDP_policy_evaluation()
evaluates a policy\pi
for a model and returns (approximate) state values by applying the Bellman equation as an update rule for each state and iterationk
:U_{k+1}(s) =\sum_a \pi{a|s} \sum_{s'} T(s' | s,a) [R(s,a) + \gamma U_k(s')]
In each iteration, all states are updated. Updating is stopped after
k_backups
iterations or after the largest update||U_{k+1} - U_k||_\infty < \theta
. -
greedy_MDP_action()
returns the action with the largest Q-value given a state. -
random_MDP_policy()
,manual_MDP_policy()
, andgreedy_MDP_policy()
generates different policies. These policies can be added to a problem usingadd_policy()
.
Value
q_values_MDP()
returns a state by action matrix specifying the Q-function,
i.e., the action value for executing each action in each state. The Q-values
are calculated from the value function (U) and the transition model.
MDP_policy_evaluation()
returns a vector with (approximate)
state values (U).
greedy_MDP_action()
returns the action with the highest q-value
for state s
. If prob = TRUE
, then a vector with
the probability for each action is returned.
random_MDP_policy()
returns a data.frame with the columns state and action to define a policy.
manual_MDP_policy()
returns a data.frame with the columns state and action to define a policy.
greedy_MDP_policy()
returns the greedy policy given Q
.
Author(s)
Michael Hahsler
References
Sutton, R. S., Barto, A. G. (2020). Reinforcement Learning: An Introduction. Second edition. The MIT Press.
See Also
Other MDP:
MDP()
,
MDP2POMDP
,
accessors
,
actions()
,
add_policy()
,
gridworld
,
reachable_and_absorbing
,
regret()
,
simulate_MDP()
,
solve_MDP()
,
transition_graph()
,
value_function()
Examples
data(Maze)
Maze
# create several policies:
# 1. optimal policy using value iteration
maze_solved <- solve_MDP(Maze, method = "value_iteration")
maze_solved
pi_opt <- policy(maze_solved)
pi_opt
gridworld_plot_policy(add_policy(Maze, pi_opt), main = "Optimal Policy")
# 2. a manual policy (go up and in some squares to the right)
acts <- rep("up", times = length(Maze$states))
names(acts) <- Maze$states
acts[c("s(1,1)", "s(1,2)", "s(1,3)")] <- "right"
pi_manual <- manual_MDP_policy(Maze, acts)
pi_manual
gridworld_plot_policy(add_policy(Maze, pi_manual), main = "Manual Policy")
# 3. a random policy
set.seed(1234)
pi_random <- random_MDP_policy(Maze)
pi_random
gridworld_plot_policy(add_policy(Maze, pi_random), main = "Random Policy")
# 4. an improved policy based on one policy evaluation and
# policy improvement step.
u <- MDP_policy_evaluation(pi_random, Maze)
q <- q_values_MDP(Maze, U = u)
pi_greedy <- greedy_MDP_policy(q)
pi_greedy
gridworld_plot_policy(add_policy(Maze, pi_greedy), main = "Greedy Policy")
#' compare the approx. value functions for the policies (we restrict
#' the number of backups for the random policy since it may not converge)
rbind(
random = MDP_policy_evaluation(pi_random, Maze, k_backups = 100),
manual = MDP_policy_evaluation(pi_manual, Maze),
greedy = MDP_policy_evaluation(pi_greedy, Maze),
optimal = MDP_policy_evaluation(pi_opt, Maze)
)
# For many functions, we first add the policy to the problem description
# to create a "solved" MDP
maze_random <- add_policy(Maze, pi_random)
maze_random
# plotting
plot_value_function(maze_random)
gridworld_plot_policy(maze_random)
# compare to a benchmark
regret(maze_random, benchmark = maze_solved)
# calculate greedy actions for state 1
q <- q_values_MDP(maze_random)
q
greedy_MDP_action(1, q, epsilon = 0, prob = FALSE)
greedy_MDP_action(1, q, epsilon = 0, prob = TRUE)
greedy_MDP_action(1, q, epsilon = .1, prob = TRUE)