From 51734fe7567cce7e27c58c4ff399b4ef81302755 Mon Sep 17 00:00:00 2001
From: James MacGlashan Beware: I’m entering into controversial territory here! There are methods that use knowledge of the transitions dynamics that some people will still call “RL.” Like most things, lines get blurry. For example, the authors of AlphaGo called their method “RL” even though they used MCTS and the dynamics of the game were known and provided to the algorithm. Usually, when people call “planning” methods “RL” methods, they do so because the method makes limited use of its knowledge of the transition dynamics. Typically for computational reasons. Nevertheless, methods like MCTS require the agent to have knowledge of how the environment is affected by its actions, and they exploit this knowledge. For this reason, I still call them “planning” methods. Call me a curmudgeon if you must. ↩︎ Beware: I’m entering into controversial territory here! There are methods that use knowledge of the transitions dynamics that some people will still call “RL.” Like most things, lines get blurry. For example, the authors of AlphaGo called their method “RL” even though they used MCTS and the dynamics of the game were known and provided to the algorithm. Usually, when people call “planning” methods “RL” methods, they do so because the method makes limited use of its knowledge of the transition dynamics. Typically for computational reasons. Nevertheless, methods like MCTS require the agent to have knowledge of how the environment is affected by its actions, and they exploit this knowledge. For this reason, I still call them “planning” methods. Call me a curmudgeon if you must. ↩︎ Training the transition function predict $T(s_2 | s_1, a_1)$ can be a little complicated, because $T(s_2 | s_1, a_1)$ is meant to represent the probability of reaching state $s_2$. If we can assume the environment is deterministic with small levels of noise, we can train a function to predict $s_2$ given $s_1$ and $a_1$, and assign its output probability 1. If the environment is more stochastic, things get a little harder. However, generative ML is increasingly getting better. While generative models often don’t provide exact probabilities, there are many planning methods that only require a generative model and do not need a full probability distribution. ↩︎ If your data is generated by some other policy, be it an exploration policy, older versions of your policy, or maybe even some other expert, then you will need an off-policy method. Since the experience replay buffer is dominated by data generated by earlier versions of the agent’s policy, you will need an off-policy method to do policy evaluation/improvement from it. You may have noticed I keep naming two cases where on/off policy is relevant: for policy evaluation and policy improvement. For most algorithms, both the evaluation and improvement will be on-policy or off-policy. However, evaluation and improvement are two distinct steps. You could have one part be on-policy while the other is off-policy. PPO is an example where the policy evaluation is on-policy while the improvement is off-policy. Although PPO does not use an experience replay buffer, its policy improvement requires an off-policy method for the same reason you need an off-policy method when using data from an experience replay buffer: it’s improving the actor policy using data from an earlier policy. PPO is an example where the policy evaluation is on-policy while the improvement is off-policy. Although PPO does not use an experience replay buffer, its policy improvement requires an off-policy method for the same reason you need an off-policy method when using data from an experience replay buffer: it’s improving the actor policy using data from an earlier policy. First, let’s give some definitions to policy evaluation/improvement. These terms come from the steps of policy iteration, the foundation for many RL methods. In policy iteration, you repeat two steps until a policy stops improving.A simple heuristic
Off-policy vs on-policy
Evaluation vs improvement and the strange case of PPO
Evaluation vs improvement
diff --git a/public/posts/q_learning_doesnt_need_importance_sampling/index.html b/public/posts/q_learning_doesnt_need_importance_sampling/index.html
index 502f151..d1534cb 100644
--- a/public/posts/q_learning_doesnt_need_importance_sampling/index.html
+++ b/public/posts/q_learning_doesnt_need_importance_sampling/index.html
@@ -183,10 +183,10 @@
Why off-pol
V^\pi(s) = \sum_a \pi(a | s) Q(s, a)
$$
If the algorithm estimates the Q-function, it can recover V from Q allowing it to be off-policy much like Q-learning is. And many off-policy actor-critic algorithms do just that. Examples include SAC, TD3, and DDPG.
-However, other off-policy actor-critic algorithms, like IMPALA4 do not. Instead, they directly use sampled returns to estimate $V$. Consequently, they need to correct for the sample distribution difference with importance sampling, much like we did in our bandit example.
+However, other off-policy actor-critic algorithms, like IMPALA do not.4 Instead, they directly use sampled returns to estimate $V$. Consequently, they need to correct for the sample distribution difference with importance sampling, much like we did in our bandit example.
You might wonder why we don’t always estimate Q-values if we want to do off-policy learning. After all, it was probably the simpler approach you first imagined when I described the simple bandit problem. It also has the nice property that you don’t have to know what the probabilities of the behavior policy were (e.g., Alice’s policy $\mu$ in our bandit example). You only have to know the probabilities of the policy you want to evaluate.
-However, there are still some nice things about using state value estimates. First, if your action space is very large, maintaining separate estimates for each action can become problematic. If you’re using function approximation, you might try to avoid that problem by generalizing over the actions. That is in fact what SAC, TD3, and DDPG do. But if you’re introducing function approximation across your actions, now you’ve opened the door for more biased estimates for each action. Furthermore, you can only really do one-step updates where you bootstrap from the next state’s Q-values5 and that adds another source of bias. These sources of bias are non-trivial – very often if algorithms like SAC fall apart, it’s because of bias in the Q-function estimate. For these reasons, estimating the state value function and using importance sampling may be preferable.
+However, there are still some nice things about using state value estimates. First, if your action space is very large, maintaining separate estimates for each action can become problematic. If you’re using function approximation, you might try to avoid that problem by generalizing over the actions. That is in fact what SAC, TD3, and DDPG do. But if you’re introducing function approximation across your actions, now you’ve opened the door for more biased estimates for each action. Furthermore, you can only really do one-step updates where you bootstrap from the next state’s Q-values5 and that adds another source of bias. These sources of bias are non-trivial – very often if algorithms like SAC fall apart, it’s because of bias in the Q-function estimate. For these reasons, estimating the state value function and using importance sampling may be preferable.
Or if there are ties for the highest Q-value, an optimal policy is any division of the probability between the actions that tie for the highest value. ↩︎
IMPALA doesn’t actually use the standard importance sampling correction ratios we discussed. They use an alternative sample weighting that tends to mitigate problems with compounding probabilities. ↩︎
+IMPALA doesn’t actually use the standard importance sampling correction ratios we discussed. They use an alternative sample weighting that tends to mitigate problems with compounding probabilities when you have to correct the distribution mismatch of larger trajectories. ↩︎
Many works, including my group’s work on GT Sophy, use $n$-step Q-learning updates in practice. This is technically wrong, because it doesn’t correct for any off-policy differences within the $n$-step, only at the end of the n-steps. Nevertheless, this approach tends to be useful in practice for reasons I won’t get into here. As long as $n$ is small, the off-policy error tends to not be too bad. ↩︎
diff --git a/public/posts/why_does_the_policy_gradient_include_log_prob/index.html b/public/posts/why_does_the_policy_gradient_include_log_prob/index.html index c9fa948..f557d08 100644 --- a/public/posts/why_does_the_policy_gradient_include_log_prob/index.html +++ b/public/posts/why_does_the_policy_gradient_include_log_prob/index.html @@ -130,7 +130,7 @@In this answer, we showed how REINFORCE comes to our rescue and allows us to use action samples to estimate the policy gradient. But there are other ways to address this problem. In particular, a method called reparameterization is a strong alternative. The soft actor-critic space of algorithms is perhaps the most well known setting where it is employed to estimate policy gradients. However, using reparameterization for policy gradients requires a different set of assumptions and trade offs relative to REINFORCE, so it’s not always the right pick. It particular, it often requires you to learn a Q-function to optimize the policy, and the learned Q-function introduces bias.
+In this answer, we showed how REINFORCE comes to our rescue and allows us to use action samples to estimate the policy gradient. But there are other ways to address this problem. In particular, a method called reparameterization is a strong alternative. The soft actor-critic space of algorithms is perhaps the most well known setting where it is employed to estimate policy gradients. However, using reparameterization for policy gradients requires a different set of assumptions and trade offs relative to REINFORCE, so it’s not always the right pick. It particular, it often requires you to learn a Q-function to optimize the policy, and the learned Q-function introduces bias.