Skip to content

Commit

Permalink
Typos
Browse files Browse the repository at this point in the history
Co-authored-by: Phillip Lippe <[email protected]>
  • Loading branch information
KiaraGrouwstra and phlippe authored Jun 29, 2022
1 parent 77466c3 commit c42eab4
Show file tree
Hide file tree
Showing 7 changed files with 32 additions and 31 deletions.
4 changes: 2 additions & 2 deletions Reinforcement_Learning/rl_appendix.tex
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ \section{Deep RL in practice}
\item Highly complex tasks show whether a method scales with having lots of compute and training data available
\item Toy examples can point out difference between methods, so it is often good to have both a toy example, and a more complex, practical one
\end{itemize}
\item[Which parameters and architectures to test?] RL have been shown to be very sensitive to the selection of hyperparameters. Hence, you should also spend similar tuning efforts an \underline{all} your experiments, including the baseline, to ensure a fair comparison.
\item[Does a random seed affect my experiments?] Due to the high variance of the RL methods, we need to average all runs over sufficient amount of seeds. Furthermore, if we perform a gridsearch, we should always keep the seeds fixed for all hyperparameter settings, but in the final test, use a different set of random seeds to prevent overfitting on seeds.
\item[Which parameters and architectures to test?] RL have been shown to be very sensitive to the selection of hyperparameters. Hence, you should also spend similar tuning efforts on \underline{all} your experiments, including the baseline, to ensure a fair comparison.
\item[Does a random seed affect my experiments?] Due to the high variance of the RL methods, we need to average all runs over a sufficient amount of seeds. Furthermore, if we perform a gridsearch, we should always keep the seeds fixed for all hyperparameter settings, but in the final test, use a different set of random seeds to prevent overfitting on seeds.
\item[What to report?] Next to the mean and/or median performance, the spread of the result should be shown as well. Furthermore, it should represent what you want to show. For example, if we want to underline that a new method learns faster, we should show a plot over learning iterations instead of just the final performance.
\end{description}
2 changes: 1 addition & 1 deletion Reinforcement_Learning/rl_introduction.tex
Original file line number Diff line number Diff line change
Expand Up @@ -139,4 +139,4 @@ \subsection{Outline}
\item Section 4 (lecture slides 7 to 10) deals with policy-based RL introducing different techniques for approximating the optimal gradients in policy learning.
\item Model-based RL is discussed in section 5 (lecture slides 11 and 12), but in less details than the previous two.
\item The final chapter deals with partially-observable environments (Section 7, lecture 13), and how to take uncertainty into account.
\end{itemize}
\end{itemize}
8 changes: 4 additions & 4 deletions Reinforcement_Learning/rl_learning_with_approx.tex
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,8 @@ \subsection{Types of function approximations}
\caption{Simple types of aggregations on a 2D state space. Note that we can combine aggregations, meaning that we use both a vertical and a horizontal aggregation.}
\label{fig:rl_approximate_value_based_aggregation}
\end{figure}
\item \textit{Radial basis functions} that takes the distance to a mean in the state space, e.g. $||\mu_i-s||$, as input features. We can model this by having multiple Gaussian, and weight their influence by $p(s)$. It enables us to have smoother transitions between close-by states, but might be problematic for far-away states. This is why it is often problematic in high-dimensional state spaces.
\item \textit{Fourier basis} where we take different frequencies to model $s$. This can provide a quite flexible feature set.
\item \textit{Radial basis functions} that take the distance to a mean in the state space, e.g. $||\mu_i-s||$, as input features. We can model this by having multiple Gaussians, and weight their influence by $p(s)$. It enables us to have smoother transitions between close-by states, but might be problematic for far-away states. This is why it is often problematic in high-dimensional state spaces.
\item \textit{Fourier basis} where we take different frequencies to model $s$. This can provide quite a flexible feature set.
\end{itemize}
Note that tabular RL can also be expressed by linear function approximation where we simply use $\bm{x}(s)=\left[\delta(s=s_1), \delta(s=s_2),...\right]$, and $\bm{w}$ therefore contains one parameter per state.

Expand All @@ -37,7 +37,7 @@ \subsection{Types of function approximations}
\end{itemize}
\subsection{Prediction objective for on-policy prediction}
\begin{itemize}
\item In the case that we perform a on-policy prediction (i.e. policy evaluation for a fixed policy), the state importance is based on the visit frequency of $\pi$. To arrive at $\mu$, we also have to distinguish between the tasks:
\item In the case that we perform an on-policy prediction (i.e. policy evaluation for a fixed policy), the state importance is based on the visit frequency of $\pi$. To arrive at $\mu$, we also have to distinguish between the tasks:
\begin{itemize}
\item If we have a continuing task (never ending), we get a stationary distribution at the point:
$$\mu_{\pi}(s)=\sum_{s'}\sum_{a}p(s|s',a)\pi(a|s')\mu_{\pi}(s')$$
Expand Down Expand Up @@ -164,7 +164,7 @@ \subsubsection{Alternatives to semi-gradients}

Semi-gradient TD is converging to the point where $PBE=0$ as we reach a fix-point there. However, this does not have to be where the minimum Bellman error is reached because imagine $\delta_{\bm{w}}$ being orthogonal to $\bm{w}$-subspace. Then, the projected bellman error is 0, but without projection, we would continue changing $\bm{w}$, until we reach $\min \overline{\text{BE}}$.

At the same time, even if we would reach $\min \overline{\text{BE}}$, it would be most likely not be a optimum (i.e. gradients greater than zero) because the gradients can point to outside the representable $\bm{w}$-space (does not need to be orthogonal as before), and hence the projected Bellman error can be unequals zero.
At the same time, even if we would reach $\min \overline{\text{BE}}$, it would most likely not be a optimum (i.e. gradients greater than zero) because the gradients can point to outside the representable $\bm{w}$-space (does not need to be orthogonal as before), and hence the projected Bellman error can be unequal to zero.

\item The last objective we consider here is the true-gradient TD error, meaning: $$\overline{\text{TDE}}(\bm{w})=\sum_{s\in\mathcal{S}}\mu(s)\E\left[\delta_t^2 |S_t=s,A_t\sim \pi\right] = \E_{b}[\rho_t \delta_t^2] \hspace{4mm}\text{(if we assume $\mu$ is under $b$)}$$
Following SGD updates, we get:
Expand Down
16 changes: 8 additions & 8 deletions Reinforcement_Learning/rl_model_based.tex
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ \subsection{Dyna-Q}
\begin{itemize}
\item Dyna makes two assumptions of the environment:
\begin{enumerate}
\item Our environment is deterministic, meaning that any transition probability are either 1 or 0.
\item Our environment is deterministic, meaning that any transition probabilities are either 1 or 0.
\item The state and action space is discrete and limited, so that we can store it in a tabular setting.
\end{enumerate}
Note that we can relax the first requirement slightly by storing e.g. how often we came from one state-action pair to another state.
Expand All @@ -32,7 +32,7 @@ \subsection{Dyna-Q}
\caption{Real experience is generated by the interaction of the agent (according to the policy) and the environment. This real data is used to update our policy (direct RL), but at the same time update our model, from which we can generate new samples to learn from (indirect RL).}
\label{fig:rl_model_based_dyna_Q}
\end{figure}
\item The general overview of the idea is shown in Figure~\ref{fig:rl_model_based_dyna_Q}. We have two source from which we train our policy and/or value function: direct and indirect. The samples from the real environment are used to perform "direct Reinforcement Learning" as we use the actual samples to learn. At the same time, we use the real samples to update our model, and can generate from there as many samples as we want.
\item The general overview of the idea is shown in Figure~\ref{fig:rl_model_based_dyna_Q}. We have two sources from which we train our policy and/or value function: direct and indirect. The samples from the real environment are used to perform "direct Reinforcement Learning" as we use the actual samples to learn. At the same time, we use the real samples to update our model, and can generate from there as many samples as we want.
\item Written down as an algorithm, we arrive at Figure~\ref{fig:rl_model_based_dyna_Q_algorithm}. The parameter $n$ which specifies how often we train from the simulated/learned model compared to the real environment, is a hyperparameter and depends on the access of the environment (how expensive is it, etc.). However, it is usually $n\gg 1$.
\begin{figure}[ht!]
\centering
Expand Down Expand Up @@ -76,7 +76,7 @@ \subsubsection{What to update}
\begin{itemize}
\item The priority in the queue is given by the TD error we would get at the time we add the state in the queue. This supports that states with high errors, i.e. wrong estimates, are updated first.
\item To limit the queue, we can define a threshold $\theta$ over which the TD error has to be to add a state in the queue
\item In the simulation step (indirect RL), we perform updates based on the queue until either the queue is empty, or we reached a maximum of $n$ steps. If the queue is non-empty, it is kept for next iteration as well.
\item In the simulation step (indirect RL), we perform updates based on the queue until either the queue is empty, or we reached a maximum of $n$ steps. If the queue is non-empty, it is kept for the next iteration as well.
\item For this model, we require at least a sample/generative model because we need to be able to start at any state-action pair
\begin{figure}[ht!]
\centering
Expand All @@ -87,7 +87,7 @@ \subsubsection{What to update}
\item An alternative is performing \textbf{trajectory sampling} where we start from the start state (or sample one if multiple exist), and follow our current policy.
\begin{itemize}
\item While updating the more frequently visited states, we have the disadvantage of limited exploration because we highly focus on states of our distribution
\item Hence, if we have a (close to) deterministic environment, trajectory sampling might work well, but in a stochastic environment where we continuously have to explore, it might perform worse than uniformly sample any state-action pair
\item Hence, if we have a (close to) deterministic environment, trajectory sampling might work well, but in a stochastic environment where we continuously have to explore, it might perform worse than uniformly sampling any state-action pair
\item For this method, we only require a trajectory model, making it less complex
\end{itemize}
\end{itemize}
Expand All @@ -100,13 +100,13 @@ \subsubsection{How to update}
\subsubsection{When to plan}
\begin{itemize}
\item Currently, we only use the environment to generate new samples for training
\item Knowing the system dynamics can however be valuable in more than this situation. For example, we can easily plan ahead by trying out different actions, and observing the reward in simulation. Afterwards, we take the actions in real-world which gave the best result in the simulation
\item Knowing the system dynamics can however be valuable in more than this situation. For example, we can easily plan ahead by trying out different actions, and observing the reward in simulation. Afterwards, we take the actions in the real world which gave the best result in the simulation
\item This idea is used in Monte Carlo Tree Search algorithms, which we will discuss in more detail in Section~\ref{sec:MCTS_Alpha_Go}.
\end{itemize}
\subsection{Model-based policy search}
\begin{itemize}
\item In the previous discussion, we mainly focused on value-based updates. However, we could of course use policy-based methods as well.
\item Again, the decision of whether to use policy-based or value-based methods is based on multiple decisions. For example, if we need to learn a stochastic policy, or we have continuous actions, then we might want to use policy-based methods. In the case that we have discrete actions and aim for learning a deterministic, greedy policy, value-based methods are more suited because for policy gradient we require the policy to be smoothly changable/differentiable.
\item Again, the decision of whether to use policy-based or value-based methods is based on multiple decisions. For example, if we need to learn a stochastic policy, or we have continuous actions, then we might want to use policy-based methods. In the case that we have discrete actions and aim for learning a deterministic, greedy policy, value-based methods are more suited because for policy gradient we require the policy to be smoothly changeable/differentiable.
\item Let's assume that the reward is known for now (e.g. we have defined the reward for a problem by our own). Then, we could reformulate the transition function as:
$$s_{t+1} = f(s_t, a_t) + w$$
where $f(s_t, a_t)$ is a deterministic function that maps a state-action pair to a new state, and $w$ is additive noise (e.g. Gaussian for continuous states). To ensure this formulation to work well, we would require a mostly deterministic environment, as otherwise $f(s_t, a_t)$ cannot model the different outcomes.
Expand Down Expand Up @@ -193,5 +193,5 @@ \subsubsection{AlphaGo Zero}
\caption{Self-play RL in AlphaGo zero.}
\label{fig:rl_model_based_alphago_zero_selfplay}
\end{figure}
\item Nevertheless, the training might not be 100\% stable. In a small amount of times, it can happen that the network diverges. To prevent this, we evaluate the network every $n$ steps by playing against itself/an older version of itself. If the policy did not improve (i.e. loosing more games than winning against older version), we throw away the new model and start again from the old weights.
\end{itemize}
\item Nevertheless, the training might not be 100\% stable. In a small amount of times, it can happen that the network diverges. To prevent this, we evaluate the network every $n$ steps by playing against itself/an older version of itself. If the policy did not improve (i.e. losing more games than winning against older version), we throw away the new model and start again from the old weights.
\end{itemize}
10 changes: 5 additions & 5 deletions Reinforcement_Learning/rl_partially_observable.tex
Original file line number Diff line number Diff line change
Expand Up @@ -32,18 +32,18 @@ \subsection{Markov functions and histories}
\subsubsection{Sample Markov functions}
\begin{itemize}
\item The simplest function is the identity, meaning $s_t=H_t$. However, this is neither compact, nor can we use it in any tabular policy setting efficiently (all possible sequences must be stored).
\item We can define an probability distribution over the latent space $X$, and try to do Bayesian inference (i.e. finding the posterior). This is done by calculating a \textbf{belief state}:
\item We can define a probability distribution over the latent space $X$, and try to do Bayesian inference (i.e. finding the posterior). This is done by calculating a \textbf{belief state}:
\begin{equation*}
\begin{split}
s'(x')=p(x'|o',a,s)& =\frac{p(o'|x',a,s)p(x'|a,s)}{p(o'|a,s)}\\
& = \frac{p(o'|x',a)p(x'|a,s)}{p(o'|a,s)}\hspace{5mm}\text{(removing $s$ as $x'$ is given as true state)}\\
& = \frac{p(o'|a,x')\overbrace{\sum_x p(x'|x,a)s(x)}^{=p(x'|a,s)}}{\sum_{x'}p(o'|a,x')\sum_x p(x'|x,a)s(x)}
\end{split}
\end{equation*}
where $s(x)$ is the old belief (i.e. belief over $x$ from last step). We can further define $p(o'|a,x')$ as the \underline{observation model} (i.e. what do I see from the latent space), and $p(x'|x,a)$ as the \underline{transition model} (i.e. how likely is it to move from one latent state the another). If we know these model dynamics by a full model description (or can estimate them), we have as state the probability distribution over latent state $x$.
where $s(x)$ is the old belief (i.e. belief over $x$ from last step). We can further define $p(o'|a,x')$ as the \underline{observation model} (i.e. what do I see from the latent space), and $p(x'|x,a)$ as the \underline{transition model} (i.e. how likely is it to move from one latent state to another). If we know these model dynamics by a full model description (or can estimate them), we have as state the probability distribution over latent state $x$.

This method is the classical approach for POMDPs, as it is compact, can be updated recursively and is easily interpretable by a human. However, the disadvantages are that we need the underlying model (not always given), and that it is only feasible for a discrete latent state (otherwise sums become integrals etc.).
\item As last example, we can consider the obvious approach of determining all the observation probabilities:
\item As a last example, we can consider the obvious approach of determining all the observation probabilities:
$$f(h)=\begin{bmatrix}
f_{o_1a_1}(h)\\ f_{o_1a_2}(h)\\\vdots\\f_{o_2a_1}(h)\\\vdots\\
\end{bmatrix}\hspace{5mm}\text{where}\hspace{5mm}f_{oa}(h)=\Prob{O_{t+1}=o|H=h,A_t=a}$$
Expand All @@ -56,7 +56,7 @@ \subsubsection{Approximations with non-Markov functions}
\item Alternatively, we can also consider non-Markov functions which cannot guarantee to find the optimal policy, but at least an approximate one
\item The simplest method here is just using the last state, $S_t=O_t$. However, this might not contain all the information we need (e.g. in Atari games, movement cannot be captured), and is often not compact (still have the whole screen)

A slight improvement is stacking a few observations, as in Atari games. This allows us to observe movement, but we still loose long-term dependencies.
A slight improvement is stacking a few observations, as in Atari games. This allows us to observe movement, but we still lose long-term dependencies.
\item We can also apply RNNs which take $O_t$ and $A_t$ as input including the last state $S_{t-1}$, and generate a new state $S_t$. This feature extractor can be learned end-to-end, and applied to a wide range of environments. However, the training might be a bit tricky in terms of hyperparameter tuning.
\end{itemize}

Expand Down Expand Up @@ -84,4 +84,4 @@ \subsubsection{Bayesian Adaptive MDP and Meta-reinforcement learning}
\item We can show that to find the optimal strategy for finding the best policy in a unknown MDP can be learned by sampling from the prior over MDPs, and use simple gradient estimates
\item Hence, with a prior over MDPs, optimal exploration can be phrased as greedy behaviour in an augmented MDP, where the hyperstates include the unknown transition and reward probabilities
\item Such techniques are investigated under the term \textbf{Meta-reinforcement learning}. Here the agent is not told which exact MDP it gets, but has to learn patterns across MDPs, and find the optimal way of exploring.
\end{itemize}
\end{itemize}
Loading

0 comments on commit c42eab4

Please sign in to comment.