ttt_qlearn {tictactoe} | R Documentation |
Q-Learning for Training Tic-Tac-Toe AI
Description
Train a tic-tac-toe AI through Q-learning
Usage
ttt_qlearn(player, N = 1000L, epsilon = 0.1, alpha = 0.8, gamma = 0.99,
simulate = TRUE, sim_every = 250L, N_sim = 1000L, verbose = TRUE)
Arguments
player |
AI player to train |
N |
number of episode, i.e. training games |
epsilon |
fraction of random exploration move |
alpha |
learning rate |
gamma |
discount factor |
simulate |
if true, conduct simulation during training |
sim_every |
conduct simulation after this many training games |
N_sim |
number of simulation games |
verbose |
if true, progress report is shown |
Details
This function implements Q-learning to train a tic-tac-toe AI player. It is designed to train one AI player, which plays against itself to update its value and policy functions.
The employed algorithm is Q-learning with epsilon greedy.
For each state s
, the player updates its value evaluation by
V(s) = (1-\alpha) V(s) + \alpha \gamma max_s' V(s')
if it is the first player's turn. If it is the other player's turn, replace
max
by min
.
Note that s'
spans all possible states you can reach from s
.
The policy function is also updated analogously, that is, the set of
actions to reach s'
that maximizes V(s')
.
The parameter \alpha
controls the learning rate, and gamma
is
the discount factor (earlier win is better than later).
Then the player chooses the next action by \epsilon
-greedy method;
Follow its policy with probability 1-\epsilon
, and choose random
action with probability \epsilon
. \epsilon
controls
the ratio of explorative moves.
At the end of a game, the player sets the value of the final state either to 100 (if the first player wins), -100 (if the second player wins), or 0 (if draw).
This learning process is repeated for N
training games.
When simulate
is set true, simulation is conducted after
sim_every
training games.
This would be usefule for observing the progress of training.
In general, as the AI gets smarter, the game tends to result in draw more.
See Sutton and Barto (1998) for more about the Q-learning.
Value
data.frame
of simulation outcomes, if any
References
Sutton, Richard S and Barto, Andrew G. Reinforcement Learning: An Introduction. The MIT Press (1998)
Examples
p <- ttt_ai()
o <- ttt_qlearn(p, N = 200)