From 51734fe7567cce7e27c58c4ff399b4ef81302755 Mon Sep 17 00:00:00 2001
From: James MacGlashan <jmacglashan@gmail.com>
Date: Sat, 9 Nov 2024 16:36:29 -0500
Subject: [PATCH] add links to mentioned papers

---
 content/posts/model_free_vs_model_based.md                  | 2 +-
 content/posts/off_policy_replay.md                          | 2 +-
 content/posts/q_learning_doesnt_need_importance_sampling.md | 6 +++---
 .../posts/why_does_the_policy_gradient_include_log_prob.md  | 2 +-
 public/posts/model_free_vs_model_based/index.html           | 2 +-
 public/posts/off_policy_replay/index.html                   | 2 +-
 .../q_learning_doesnt_need_importance_sampling/index.html   | 6 +++---
 .../index.html                                              | 2 +-
 8 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/content/posts/model_free_vs_model_based.md b/content/posts/model_free_vs_model_based.md
index 1a7652d..61b3bb3 100644
--- a/content/posts/model_free_vs_model_based.md
+++ b/content/posts/model_free_vs_model_based.md
@@ -25,7 +25,7 @@ If we know what all those elements of an MDP are, we can find a good policy befo
 
 But the RL problem isn’t so kind to us. What makes a problem an RL problem, rather than a planning problem, is the agent does _not_ know all the elements of the MDP, preventing it from planning.[^1] Specifically, the agent does not know how the environment will change in response to its actions (the transition function $T$), nor what immediate reward it will receive for doing so (the reward function $R$). The agent will simply have to try actions in the environment, observe what happens, and somehow, find a good policy from doing so.
 
-[^1]: Beware: I'm entering into controversial territory here! There are methods that use knowledge of the transitions dynamics that some people will still call "RL." Like most things, lines get blurry. For example, the authors of AlphaGo called their method "RL" even though they used MCTS and the dynamics of the game were known and provided to the algorithm. Usually, when people call "planning" methods "RL" methods, they do so because the method makes limited use of its knowledge of the transition dynamics. Typically for computational reasons. Nevertheless, methods like MCTS require the agent to have knowledge of how the environment is affected by its actions, and they exploit this knowledge. For this reason, I still call them "planning" methods. Call me a curmudgeon if you must.
+[^1]: Beware: I'm entering into controversial territory here! There are methods that use knowledge of the transitions dynamics that some people will still call "RL." Like most things, lines get blurry. For example, the authors of [AlphaGo](https://deepmind.google/research/breakthroughs/alphago/) called their method "RL" even though they used MCTS and the dynamics of the game were known and provided to the algorithm. Usually, when people call "planning" methods "RL" methods, they do so because the method makes limited use of its knowledge of the transition dynamics. Typically for computational reasons. Nevertheless, methods like MCTS require the agent to have knowledge of how the environment is affected by its actions, and they exploit this knowledge. For this reason, I still call them "planning" methods. Call me a curmudgeon if you must.
 
 How can an agent find a good policy if it does not know the transition function $T$ nor the reward function $R$? It turns out there are lots of ways!
 
diff --git a/content/posts/off_policy_replay.md b/content/posts/off_policy_replay.md
index 46c8f04..5b4172d 100644
--- a/content/posts/off_policy_replay.md
+++ b/content/posts/off_policy_replay.md
@@ -20,7 +20,7 @@ If your data is generated by some other policy, be it an exploration policy, old
 
 You may have noticed I keep naming two cases where on/off policy is relevant: for policy evaluation and policy improvement. For most algorithms, both the evaluation and improvement will be on-policy or off-policy. However, evaluation and improvement are two distinct steps. You could have one part be on-policy while the other is off-policy. 
 
-PPO is an example where the policy evaluation is on-policy while the improvement is off-policy. Although PPO does not use an experience replay buffer, its policy improvement requires an off-policy method for the same reason you need an off-policy method when using data from an experience replay buffer: it's improving the actor policy using data from an earlier policy.
+[PPO](https://arxiv.org/abs/1707.06347) is an example where the policy evaluation is on-policy while the improvement is off-policy. Although PPO does not use an experience replay buffer, its policy improvement requires an off-policy method for the same reason you need an off-policy method when using data from an experience replay buffer: it's improving the actor policy using data from an earlier policy.
 
 ### Evaluation vs improvement
 
diff --git a/content/posts/q_learning_doesnt_need_importance_sampling.md b/content/posts/q_learning_doesnt_need_importance_sampling.md
index 2bb73bb..a7c0408 100644
--- a/content/posts/q_learning_doesnt_need_importance_sampling.md
+++ b/content/posts/q_learning_doesnt_need_importance_sampling.md
@@ -174,14 +174,14 @@ $$
 
 If the algorithm estimates the Q-function, it can recover V from Q allowing it to be off-policy much like Q-learning is. And many off-policy actor-critic algorithms do just that. Examples include SAC, TD3, and DDPG.
 
-However, other off-policy actor-critic algorithms, like IMPALA[^6] do not. Instead, they directly use sampled returns to estimate $V$. Consequently, they need to correct for the sample distribution difference with importance sampling, much like we did in our bandit example.
+However, other off-policy actor-critic algorithms, like [IMPALA](https://arxiv.org/abs/1802.01561) do not.[^6] Instead, they directly use sampled returns to estimate $V$. Consequently, they need to correct for the sample distribution difference with importance sampling, much like we did in our bandit example.
 
-[^6]: IMPALA doesn't actually use the standard importance sampling correction ratios we discussed. They use an alternative sample weighting that tends to mitigate problems with compounding probabilities.
+[^6]: IMPALA doesn't actually use the standard importance sampling correction ratios we discussed. They use an alternative sample weighting that tends to mitigate problems with compounding probabilities when you have to correct the distribution mismatch of larger trajectories.
 
 ## Why wouldn't we always learn Q-values?
 
 You might wonder why we don't always estimate Q-values if we want to do off-policy learning. After all, it was probably the simpler approach you first imagined when I described the simple bandit problem. It also has the nice property that you don't have to know what the probabilities of the behavior policy were (e.g., Alice's policy $\mu$ in our bandit example). You only have to know the probabilities of the policy you want to evaluate.
 
-However, there are still some nice things about using state value estimates. First, if your action space is very large, maintaining separate estimates for each action can become problematic. If you're using function approximation, you might try to avoid that problem by generalizing over the actions. That is in fact what SAC, TD3, and DDPG do. But if you're introducing function approximation across your actions, now you've opened the door for more biased estimates for each action. Furthermore, you can only really do one-step updates where you bootstrap from the next state's Q-values[^7] and that adds another source of bias. These sources of bias are non-trivial -- very often if algorithms like SAC fall apart, it's because of bias in the Q-function estimate. For these reasons, estimating the state value function and using importance sampling may be preferable.
+However, there are still some nice things about using state value estimates. First, if your action space is very large, maintaining separate estimates for each action can become problematic. If you're using function approximation, you might try to avoid that problem by generalizing over the actions. That is in fact what [SAC](https://arxiv.org/abs/1801.01290), [TD3](https://arxiv.org/abs/1802.09477v3), and [DDPG](https://arxiv.org/abs/1509.02971) do. But if you're introducing function approximation across your actions, now you've opened the door for more biased estimates for each action. Furthermore, you can only really do one-step updates where you bootstrap from the next state's Q-values[^7] and that adds another source of bias. These sources of bias are non-trivial -- very often if algorithms like SAC fall apart, it's because of bias in the Q-function estimate. For these reasons, estimating the state value function and using importance sampling may be preferable.
 
 [^7]: Many works, including my group's work on GT Sophy, use $n$-step Q-learning updates in practice. This is technically wrong, because it doesn't correct for any off-policy differences within the $n$-step, only at the end of the n-steps. Nevertheless, this approach tends to be useful in practice for reasons I won't get into here. As long as $n$ is small, the off-policy error tends to not be too bad.
\ No newline at end of file
diff --git a/content/posts/why_does_the_policy_gradient_include_log_prob.md b/content/posts/why_does_the_policy_gradient_include_log_prob.md
index e67dc40..8d57d91 100644
--- a/content/posts/why_does_the_policy_gradient_include_log_prob.md
+++ b/content/posts/why_does_the_policy_gradient_include_log_prob.md
@@ -78,4 +78,4 @@ In addition to simplifying the expression, using the log probability is usually
 stable for floating-point math on computers. As such, we almost always use the log probability formulation.
 
 ## Any other heroes for hire?
-In this answer, we showed how REINFORCE comes to our rescue and allows us to use action samples to estimate the policy gradient. But there are other ways to address this problem. In particular, a method called _reparameterization_ is a strong alternative. The soft actor-critic space of algorithms is perhaps the most well known setting where it is employed to estimate policy gradients. However, using reparameterization for policy gradients requires a different set of assumptions and trade offs relative to REINFORCE, so it's not always the right pick. It particular, it often requires you to learn a Q-function to optimize the policy, and the learned Q-function introduces bias.
+In this answer, we showed how REINFORCE comes to our rescue and allows us to use action samples to estimate the policy gradient. But there are other ways to address this problem. In particular, a method called _reparameterization_ is a strong alternative. The [soft actor-critic](https://arxiv.org/abs/1801.01290) space of algorithms is perhaps the most well known setting where it is employed to estimate policy gradients. However, using reparameterization for policy gradients requires a different set of assumptions and trade offs relative to REINFORCE, so it's not always the right pick. It particular, it often requires you to learn a Q-function to optimize the policy, and the learned Q-function introduces bias.
diff --git a/public/posts/model_free_vs_model_based/index.html b/public/posts/model_free_vs_model_based/index.html
index a79729e..05bbb90 100644
--- a/public/posts/model_free_vs_model_based/index.html
+++ b/public/posts/model_free_vs_model_based/index.html
@@ -103,7 +103,7 @@ <h2 id="a-simple-heuristic">A simple heuristic</h2>
 <hr>
 <ol>
 <li id="fn:1">
-<p>Beware: I&rsquo;m entering into controversial territory here! There are methods that use knowledge of the transitions dynamics that some people will still call &ldquo;RL.&rdquo; Like most things, lines get blurry. For example, the authors of AlphaGo called their method &ldquo;RL&rdquo; even though they used MCTS and the dynamics of the game were known and provided to the algorithm. Usually, when people call &ldquo;planning&rdquo; methods &ldquo;RL&rdquo; methods, they do so because the method makes limited use of its knowledge of the transition dynamics. Typically for computational reasons. Nevertheless, methods like MCTS require the agent to have knowledge of how the environment is affected by its actions, and they exploit this knowledge. For this reason, I still call them &ldquo;planning&rdquo; methods. Call me a curmudgeon if you must.&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
+<p>Beware: I&rsquo;m entering into controversial territory here! There are methods that use knowledge of the transitions dynamics that some people will still call &ldquo;RL.&rdquo; Like most things, lines get blurry. For example, the authors of <a href="https://deepmind.google/research/breakthroughs/alphago/">AlphaGo</a> called their method &ldquo;RL&rdquo; even though they used MCTS and the dynamics of the game were known and provided to the algorithm. Usually, when people call &ldquo;planning&rdquo; methods &ldquo;RL&rdquo; methods, they do so because the method makes limited use of its knowledge of the transition dynamics. Typically for computational reasons. Nevertheless, methods like MCTS require the agent to have knowledge of how the environment is affected by its actions, and they exploit this knowledge. For this reason, I still call them &ldquo;planning&rdquo; methods. Call me a curmudgeon if you must.&#160;<a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
 </li>
 <li id="fn:2">
 <p>Training the transition function predict $T(s_2 | s_1, a_1)$ can be a little complicated, because $T(s_2 | s_1, a_1)$ is meant to represent the probability of reaching state $s_2$. If we can assume the environment is deterministic with small levels of noise, we can train a function to predict $s_2$ given $s_1$ and $a_1$, and assign its output probability 1. If the environment is more stochastic, things get a little harder. However, generative ML is increasingly getting better. While generative models often don&rsquo;t provide exact probabilities, there are many planning methods that only require a generative model and do not need a full probability distribution.&#160;<a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
diff --git a/public/posts/off_policy_replay/index.html b/public/posts/off_policy_replay/index.html
index adcf10f..6ecce4d 100644
--- a/public/posts/off_policy_replay/index.html
+++ b/public/posts/off_policy_replay/index.html
@@ -85,7 +85,7 @@ <h2 id="off-policy-vs-on-policy">Off-policy vs on-policy</h2>
 <p>If your data is generated by some other policy, be it an exploration policy, older versions of your policy, or maybe even some other expert, then you will need an off-policy method. Since the experience replay buffer is dominated by data generated by earlier versions of the agent&rsquo;s policy, you will need an off-policy method to do policy evaluation/improvement from it.</p>
 <h2 id="evaluation-vs-improvement-and-the-strange-case-of-ppo">Evaluation vs improvement and the strange case of PPO</h2>
 <p>You may have noticed I keep naming two cases where on/off policy is relevant: for policy evaluation and policy improvement. For most algorithms, both the evaluation and improvement will be on-policy or off-policy. However, evaluation and improvement are two distinct steps. You could have one part be on-policy while the other is off-policy.</p>
-<p>PPO is an example where the policy evaluation is on-policy while the improvement is off-policy. Although PPO does not use an experience replay buffer, its policy improvement requires an off-policy method for the same reason you need an off-policy method when using data from an experience replay buffer: it&rsquo;s improving the actor policy using data from an earlier policy.</p>
+<p><a href="https://arxiv.org/abs/1707.06347">PPO</a> is an example where the policy evaluation is on-policy while the improvement is off-policy. Although PPO does not use an experience replay buffer, its policy improvement requires an off-policy method for the same reason you need an off-policy method when using data from an experience replay buffer: it&rsquo;s improving the actor policy using data from an earlier policy.</p>
 <h3 id="evaluation-vs-improvement">Evaluation vs improvement</h3>
 <p>First, let&rsquo;s give some definitions to policy evaluation/improvement. These terms come from the steps of policy iteration, the foundation for many RL methods. In policy iteration, you repeat two steps until a policy stops improving.</p>
 <ol>
diff --git a/public/posts/q_learning_doesnt_need_importance_sampling/index.html b/public/posts/q_learning_doesnt_need_importance_sampling/index.html
index 502f151..d1534cb 100644
--- a/public/posts/q_learning_doesnt_need_importance_sampling/index.html
+++ b/public/posts/q_learning_doesnt_need_importance_sampling/index.html
@@ -183,10 +183,10 @@ <h2 id="why-off-policy-actor-critic-methods-use-importance-sampling">Why off-pol
 V^\pi(s) = \sum_a \pi(a | s) Q(s, a)
 $$</p>
 <p>If the algorithm estimates the Q-function, it can recover V from Q allowing it to be off-policy much like Q-learning is. And many off-policy actor-critic algorithms do just that. Examples include SAC, TD3, and DDPG.</p>
-<p>However, other off-policy actor-critic algorithms, like IMPALA<sup id="fnref:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup> do not. Instead, they directly use sampled returns to estimate $V$. Consequently, they need to correct for the sample distribution difference with importance sampling, much like we did in our bandit example.</p>
+<p>However, other off-policy actor-critic algorithms, like <a href="https://arxiv.org/abs/1802.01561">IMPALA</a> do not.<sup id="fnref:4"><a href="#fn:4" class="footnote-ref" role="doc-noteref">4</a></sup> Instead, they directly use sampled returns to estimate $V$. Consequently, they need to correct for the sample distribution difference with importance sampling, much like we did in our bandit example.</p>
 <h2 id="why-wouldnt-we-always-learn-q-values">Why wouldn&rsquo;t we always learn Q-values?</h2>
 <p>You might wonder why we don&rsquo;t always estimate Q-values if we want to do off-policy learning. After all, it was probably the simpler approach you first imagined when I described the simple bandit problem. It also has the nice property that you don&rsquo;t have to know what the probabilities of the behavior policy were (e.g., Alice&rsquo;s policy $\mu$ in our bandit example). You only have to know the probabilities of the policy you want to evaluate.</p>
-<p>However, there are still some nice things about using state value estimates. First, if your action space is very large, maintaining separate estimates for each action can become problematic. If you&rsquo;re using function approximation, you might try to avoid that problem by generalizing over the actions. That is in fact what SAC, TD3, and DDPG do. But if you&rsquo;re introducing function approximation across your actions, now you&rsquo;ve opened the door for more biased estimates for each action. Furthermore, you can only really do one-step updates where you bootstrap from the next state&rsquo;s Q-values<sup id="fnref:5"><a href="#fn:5" class="footnote-ref" role="doc-noteref">5</a></sup> and that adds another source of bias. These sources of bias are non-trivial &ndash; very often if algorithms like SAC fall apart, it&rsquo;s because of bias in the Q-function estimate. For these reasons, estimating the state value function and using importance sampling may be preferable.</p>
+<p>However, there are still some nice things about using state value estimates. First, if your action space is very large, maintaining separate estimates for each action can become problematic. If you&rsquo;re using function approximation, you might try to avoid that problem by generalizing over the actions. That is in fact what <a href="https://arxiv.org/abs/1801.01290">SAC</a>, <a href="https://arxiv.org/abs/1802.09477v3">TD3</a>, and <a href="https://arxiv.org/abs/1509.02971">DDPG</a> do. But if you&rsquo;re introducing function approximation across your actions, now you&rsquo;ve opened the door for more biased estimates for each action. Furthermore, you can only really do one-step updates where you bootstrap from the next state&rsquo;s Q-values<sup id="fnref:5"><a href="#fn:5" class="footnote-ref" role="doc-noteref">5</a></sup> and that adds another source of bias. These sources of bias are non-trivial &ndash; very often if algorithms like SAC fall apart, it&rsquo;s because of bias in the Q-function estimate. For these reasons, estimating the state value function and using importance sampling may be preferable.</p>
 <div class="footnotes" role="doc-endnotes">
 <hr>
 <ol>
@@ -200,7 +200,7 @@ <h2 id="why-wouldnt-we-always-learn-q-values">Why wouldn&rsquo;t we always learn
 <p>Or if there are ties for the highest Q-value, an optimal policy is any division of the probability between the actions that tie for the highest value.&#160;<a href="#fnref:3" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
 </li>
 <li id="fn:4">
-<p>IMPALA doesn&rsquo;t actually use the standard importance sampling correction ratios we discussed. They use an alternative sample weighting that tends to mitigate problems with compounding probabilities.&#160;<a href="#fnref:4" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
+<p>IMPALA doesn&rsquo;t actually use the standard importance sampling correction ratios we discussed. They use an alternative sample weighting that tends to mitigate problems with compounding probabilities when you have to correct the distribution mismatch of larger trajectories.&#160;<a href="#fnref:4" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
 </li>
 <li id="fn:5">
 <p>Many works, including my group&rsquo;s work on GT Sophy, use $n$-step Q-learning updates in practice. This is technically wrong, because it doesn&rsquo;t correct for any off-policy differences within the $n$-step, only at the end of the n-steps. Nevertheless, this approach tends to be useful in practice for reasons I won&rsquo;t get into here. As long as $n$ is small, the off-policy error tends to not be too bad.&#160;<a href="#fnref:5" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
diff --git a/public/posts/why_does_the_policy_gradient_include_log_prob/index.html b/public/posts/why_does_the_policy_gradient_include_log_prob/index.html
index c9fa948..f557d08 100644
--- a/public/posts/why_does_the_policy_gradient_include_log_prob/index.html
+++ b/public/posts/why_does_the_policy_gradient_include_log_prob/index.html
@@ -130,7 +130,7 @@ <h2 id="enter-log-probabilities">Enter log probabilities</h2>
 In addition to simplifying the expression, using the log probability is usually preferable because it tends to be more numerically
 stable for floating-point math on computers. As such, we almost always use the log probability formulation.</p>
 <h2 id="any-other-heroes-for-hire">Any other heroes for hire?</h2>
-<p>In this answer, we showed how REINFORCE comes to our rescue and allows us to use action samples to estimate the policy gradient. But there are other ways to address this problem. In particular, a method called <em>reparameterization</em> is a strong alternative. The soft actor-critic space of algorithms is perhaps the most well known setting where it is employed to estimate policy gradients. However, using reparameterization for policy gradients requires a different set of assumptions and trade offs relative to REINFORCE, so it&rsquo;s not always the right pick. It particular, it often requires you to learn a Q-function to optimize the policy, and the learned Q-function introduces bias.</p>
+<p>In this answer, we showed how REINFORCE comes to our rescue and allows us to use action samples to estimate the policy gradient. But there are other ways to address this problem. In particular, a method called <em>reparameterization</em> is a strong alternative. The <a href="https://arxiv.org/abs/1801.01290">soft actor-critic</a> space of algorithms is perhaps the most well known setting where it is employed to estimate policy gradients. However, using reparameterization for policy gradients requires a different set of assumptions and trade offs relative to REINFORCE, so it&rsquo;s not always the right pick. It particular, it often requires you to learn a Q-function to optimize the policy, and the learned Q-function introduces bias.</p>
 <div class="footnotes" role="doc-endnotes">
 <hr>
 <ol>