RL - Implementation of n-step SARSA, n-step TreeBackup and n-step Q-sigma in a simple 10x10 grid world
Start in S
, move V
steps at a time (where V
is the velocity) in some direction.
V is between 0 and 3.
If you hit a wall X
, go back to S
and V
is reset to 0
If you go through the check point #
, V can be between 0 and 5.
If you hit the goal G
, the episode is over and you receive +1000 reward.
_ | _ | _ | _ | _ | _ | _ | _ | _ | G |
_ | X | X | X | X | X | X | X | X | X |
_ | _ | _ | _ | _ | _ | _ | _ | _ | _ |
_ | _ | _ | _ | _ | _ | _ | _ | _ | _ |
_ | _ | _ | _ | _ | _ | _ | _ | _ | _ |
X | _ | _ | _ | _ | _ | _ | _ | _ | _ |
_ | _ | _ | # | _ | _ | _ | _ | _ | _ |
_ | _ | _ | _ | _ | _ | _ | _ | _ | _ |
_ | _ | _ | _ | _ | _ | _ | _ | _ | _ |
S | _ | _ | _ | _ | _ | _ | _ | _ | _ |
- move RIGHT & Velocity -1 -- 0
- move RIGHT & Velocity +0 -- 1
- move RIGHT & Velocity +1 -- 2
- move UP & Velocity -1 -- 3
- move UP & Velocity +0 -- 4
- move UP & Velocity +1 -- 5
- move LEFT & Velocity -1 -- 6
- move LEFT & Velocity +0 -- 7
- move LEFT & Velocity +1 -- 8
- GOAL = +1000
- WALL = -10
- STEP = -1
- CHECK POINT = 0
python main.py <algo> <flags>
Algo:
- "SARSA" (n-step)
- "TreeBackup" (n-step)
- "Qsigma" (n-step)
- q_mode = "rnd": at each time step, chose a sigma randomly:
p = 0.5
- q_mode = "alt": alternate sigmas for each time step: 1, 0, 1, 0, 1, 0, ...
- q_mode = "inc": P(sigma=1) increases at each time step t:
p = 1 - e^{-log_base_(1+t)}
- q_mode = "dec": P(sigma=1) decreases at each time step t:
p = e^{-log_base_(1+t)}
- q_mode = "rnd": at each time step, chose a sigma randomly:
Flags:
-ne #
number of episodes to train for (default=1000)-ns #
number of steps to look forward: it's the n in n-step algorithms (default=5)-g #
gamma, discount factor (default=0.99)-a #
alpha, learning rate (default=0.10)-e #
epsilon, policy stochasticity, proba of chosing a random non-greddy action (default=0.10)-b #
beta, environement stochasticity, proba of not updatingV
no matter the action (default=0.00)-q_mode <'rnd' or 'alt' or 'inc' or 'dec'>
mode of Qsigma (default=alt)base #
log base used to calculate p in Qsigma-inc or Qsigma-dec