solve_MDP {pomdp}R Documentation

Solve an MDP Problem

Description

Implementation of value iteration, modified policy iteration and other methods based on reinforcement learning techniques to solve finite state space MDPs.

Usage

solve_MDP(model, method = "value", ...)

solve_MDP_DP(
  model,
  method = "value_iteration",
  horizon = NULL,
  discount = NULL,
  N_max = 1000,
  error = 0.01,
  k_backups = 10,
  U = NULL,
  verbose = FALSE
)

solve_MDP_TD(
  model,
  method = "q_learning",
  horizon = NULL,
  discount = NULL,
  alpha = 0.5,
  epsilon = 0.1,
  N = 100,
  U = NULL,
  verbose = FALSE
)

Arguments

model

an MDP problem specification.

method

string; one of the following solution methods: 'value_iteration', 'policy_iteration', 'q_learning', 'sarsa', or 'expected_sarsa'.

...

further parameters are passed on to the solver function.

horizon

an integer with the number of epochs for problems with a finite planning horizon. If set to Inf, the algorithm continues running iterations till it converges to the infinite horizon solution. If NULL, then the horizon specified in model will be used.

discount

discount factor in range (0, 1]. If NULL, then the discount factor specified in model will be used.

N_max

maximum number of iterations allowed to converge. If the maximum is reached then the non-converged solution is returned with a warning.

error

value iteration: maximum error allowed in the utility of any state (i.e., the maximum policy loss) used as the termination criterion.

k_backups

policy iteration: number of look ahead steps used for approximate policy evaluation used by the policy iteration method.

U

a vector with initial utilities used for each state. If NULL, then the default of a vector of all 0s is used.

verbose

logical, if set to TRUE, the function provides the output of the solver in the R console.

alpha

step size in ⁠(0, 1]⁠.

epsilon

used for \epsilon-greedy policies.

N

number of episodes used for learning.

Details

Implemented are the following dynamic programming methods (following Russell and Norvig, 2010):

Note that the policy converges earlier than the value function.

Implemented are the following temporal difference control methods described in Sutton and Barto (2020). Note that the MDP transition and reward models are only used to simulate the environment for these reinforcement learning methods. The algorithms use a step size parameter \alpha (learning rate) for the updates and the exploration parameter \epsilon for the \epsilon-greedy policy.

If the model has absorbing states to terminate episodes, then no maximal episode length (horizon) needs to be specified. To make sure that the algorithm does finish in a reasonable amount of time, episodes are stopped after 10,000 actions with a warning. For models without absorbing states, a episode length has to be specified via horizon.

Value

solve_MDP() returns an object of class POMDP which is a list with the model specifications (model), the solution (solution). The solution is a list with the elements:

Author(s)

Michael Hahsler

References

Russell, S., Norvig, P. (2021). Artificial Intelligence: A Modern Approach. Fourth edition. Prentice Hall.

Sutton, R. S., Barto, A. G. (2020). Reinforcement Learning: An Introduction. Second edition. The MIT Press.

See Also

Other solver: solve_POMDP(), solve_SARSOP()

Other MDP: MDP(), MDP2POMDP, MDP_policy_functions, accessors, actions(), add_policy(), gridworld, reachable_and_absorbing, regret(), simulate_MDP(), transition_graph(), value_function()

Examples

data(Maze)
Maze

# use value iteration
maze_solved <- solve_MDP(Maze, method = "value_iteration")
maze_solved
policy(maze_solved)

# plot the value function U
plot_value_function(maze_solved)

# Maze solutions can be visualized
gridworld_plot_policy(maze_solved)

# use modified policy iteration
maze_solved <- solve_MDP(Maze, method = "policy_iteration")
policy(maze_solved)

# finite horizon
maze_solved <- solve_MDP(Maze, method = "value_iteration", horizon = 3)
policy(maze_solved)
gridworld_plot_policy(maze_solved, epoch = 1)
gridworld_plot_policy(maze_solved, epoch = 2)
gridworld_plot_policy(maze_solved, epoch = 3)

# create a random policy where action n is very likely and approximate
#  the value function. We change the discount factor to .9 for this.
Maze_discounted <- Maze
Maze_discounted$discount <- .9
pi <- random_MDP_policy(Maze_discounted, 
        prob = c(n = .7, e = .1, s = .1, w = 0.1))
pi

# compare the utility function for the random policy with the function for the optimal
#  policy found by the solver.
maze_solved <- solve_MDP(Maze)

MDP_policy_evaluation(pi, Maze, k_backup = 100)
MDP_policy_evaluation(policy(maze_solved), Maze, k_backup = 100)

# Note that the solver already calculates the utility function and returns it with the policy
policy(maze_solved)

# Learn a Policy using Q-Learning
maze_learned <- solve_MDP(Maze, method = "q_learning", N = 100)
maze_learned

maze_learned$solution
policy(maze_learned)
plot_value_function(maze_learned)
gridworld_plot_policy(maze_learned)

[Package pomdp version 1.2.3 Index]