ctutest-MDP-Theory.tex


\chapter{Fully Observable Markov Decision Processes}

The Markov decision processes (MDPs) are a part of the optimization problem solvers. The term MDP was probably first used in \cite{cite:1} and came from the name of Russian mathematician Andrey Markov who researched the stochastic processes on which MDPs are based. The MDP problems are described with the environment. The environment consists of fully observable states and actions (with a corresponding stochastic transition model) for which the agent either pays a cost or receives a reward. The agent is an entity that lives inside of an MDP environment, and its task is to maximize the reward obtained for its actions. With the agent entity, we can intuitively demonstrate the MDPs solution. The environment does not change over time. The problem can be finite or infinite according to the problem's description.

This chapter will first draw a motivation on why we use MDPs in modeling environments in planning problems. Then define the background of MDPs, possible approaches to solving MDPs, and in the end, we are going to discuss the advantages of using time-dependent MDPs.

The definitions, algorithms, and the construction of the following sections are based on \cite{Kolobov2012}. 


% \section{Environment}
% The MDP environment we briefly introduced above contains states, actions, transition probabilities, and costs for each action. However, we did not assign any rules to it! 
% Without these rules, one can not create any connection between states and actions or actions and transition probabilities.

% The environmental rules could range anywhere from the ones applied in the crossword puzzle game (fully observable, deterministic, sequential, static, discrete with a single agent) up to the ones applied in the taxi driving planning (partially observable, stochastic, sequential, dynamic, continuous with multiple agents). We will not describe the meaning of all specific rules in this work, as it is out of its scope. In case the reader is interested, we refer to read  \cite{russel2010}.


% For the sake of simplicity, let us assume the following properties of the given environment:
% \begin{description}


%   \item[$\bullet$ Fully observable] The agent has all the information needed about the environment.
%   \item[$\bullet$ Stochastic] The movement of the agent is not certain. The result of movement can be different from the desired one as a result of noise,
%   \item[$\bullet$ Sequential] The current action is going to have an impact on future actions,
%   \item[$\bullet$ Static] The environment can not change over time,
%   \item[$\bullet$ Discrete] The environment has a finite number of states and actions.
% \end{description}


\section{MDP and its features}
% With the environmental properties sorted out, let us move on to MDPs themselves. 
% \\ 
% \\

The Fully Observable Markov Decision Processes (MDPs) are tuples \emph{(S, A, T, R)} containing a set of environment states $S$, actions $A$ that the agent can execute in corresponding states, stochastic transition function $T$ modeling the relationships between origin states $s$, actions $a$ from them and the distributions of the destination states $s'$. Finally, the MDPs also contain the reward function $R$ that assigns reward (or penalty) to tuples \emph{(s, a, s')} - for transitions from $s$ to $s'$ when executing the action $a$ More formally:


\begin{definition}[\textbf{Infinite Horizon Fully Observable Markov Decision Process (MDP)}]\label{def:MDP}
A fully observable MDP is a tuple $(S, A, T, R)$, where:
\begin{description}
  \item[$\bullet$ ] $S$ is the finite set of all possible states of the environment, also called the state space;
  \item[$\bullet$ ] $A$ is the finite set of all actions an agent can take;
%   \item[$\bullet$ ] $D$ is an infinite sequence of the natural numbers of the form $(1, 2, 3, \ldots)$, denoting the decision epochs, also called time steps, at which agent needs to act;
  \item[$\bullet$ ] $T : S \times A \times S \rightarrow [ \,0, 1] \,$ is a transition function mapping the probability $T(s, a, s')$ of reaching the state $s'$ if action $a$ is executed when the agent is in state $s$;
  \item[$\bullet$ ] $R : S \times A \times S \rightarrow R$ is a reward function that gives a finite numeric reward value $R(s, a, s')$ obtained when the system goes from state $s$ to state $s'$ as a result of executing action $a$.
\end{description}
\end{definition}

However, the MDP agent does not have free will. It simply moves as it is ordered. The agent's orders come from a policy. The policy is a function mapping every state to some action. Based on the agent's state, the agent executes an action corresponding to a policy's mapping of the agent's current state. The resulting quality of every policy is then a reward, which the agent receives for following the policy's mappings. In our context, the best quality is the maximal one.

Naturally, we want the agent to move in such a way that it maximizes its reward. Such a policy that is better than any other policy is called optimal policy. To select the optimal policy, we have to evaluate all possible policies. The number of possible policies $|A|^{|S|}$ exponentially grows with the increasing MDP state space. Furthermore, we do not consider separate time steps and different possible histories that could lead to a given state \textit{s} in a time \textit{t}. The history is a sequence ($s_0, a_0, s_1, a_1, \ldots, s_n$) of consequent states and actions that ends up in the current state $s_n$ by following a policy $\pi$. Such a formulation would inevitably rise to an uncountable number of policies.

Policies that do not distinguish between different histories that end up in the same state simultaneously are called Markovian policies. This relaxation in terminology vastly reduces the number of policies possible to $|A|^{|S|}$. In the following text, we are using the Markovian Policies.


% To obtain the maximum possible reward in the MDP environment, the agent must execute the best action for every state the agents get to. To determine which action is optimal in each state, we have to evaluate the execution of an action in each state. Say by using simulations. After an infinite number of simulation executions ending in terminal states, we get a vast number of state and action sequences with the value associated with them. The yet unknown mechanism creates a mapping that strives to create a sequence with the best reward from these sequences. Such a mapping of states to actions is called policy. Every agent then operates in its MDP environment by simply following the optimal policy. 

% However, the number of suboptimal policies obtained while selecting the optimal one is overwhelming, reaching the size of higher-order polynomials. Thus, we are going to define them in a more relaxed way.

% The Markovian Policy does not distinguish between two different sequences ending in the same state simultaneously, assigning the same best action to both, vastly reducing the number of policies.


\begin{definition}[\textbf{Markovian Policy}]
A probabilistic (deterministic) history-dependent policy $\pi: H \times A \rightarrow [ \,0, 1] \,(\pi: H \rightarrow A)$ is \textit{Markovian} if for any two histories $h_{s,t}$ and $h'_{s,t}$, both of which end in the same state $s$ at the same time step $t$, and for any action $a$, 
$\pi(h_{s,t}, a) = \pi(h'_{s,t}, a)$ $(\pi(h_{s,t}) = a$ if and only if $\pi(h_{s,t}) = a)$.
\end{definition}

Other than a whole policy, we can also evaluate a given state $s$. The value of a given state $s$ is calculated using a value function. The value function $V^{\pi}(s)$ returns the expected reward the agent obtains when executing the actions given by a policy $\pi$ from a state $s$ that end up in a terminal state. The terminal state stands for a sink state that the agent can not leave. The value function corresponds to $V: S \rightarrow [ \,-\infty, \infty] \,$. In the following text, we will refer to the value function as $V(s, t)$ or $V(s)$.


% Every state in every time step is evaluated according to \textbf{value function}. A Markovian value function corresponds to $V: S \rightarrow [ \,-\infty, \infty] \,$. In the following text, we will refer to the value function as $V(s, t)$ or $V(s)$.

% The value function of a policy is a value corresponding to executing actions according to a policy, obtaining rewards along the way. \\ \\

\begin{definition}[\textbf{The Value Function of a Policy}]
Let $h_{s,t}$ be a sequence that terminates at state $s$ and time $t$. Let $R^{\pi_{h_{s,t}}}_{t'}$ be random variables for the amount of reward obtained in an MDP as a result of executing policy $\pi$ starting in state $s$ for all time steps $t'$ s.t. $t \leqslant t' \leqslant |D|$ if the MDP ended up in state $s$ at time $t$ via sequence $h_{s,t}$.The value function $V^{\pi}: H \rightarrow [ \,−\infty,\infty] \,$ \textit{of a sequence-dependent policy} $\pi$ is a utility function $u$ of the reward sequence $R^{\pi_{h_{s,t}}}_t, R^{\pi_{h_{s,t}}}_{t+1}, \ldots$ that one can accumulate by executing $\pi$ at time steps $t, t + 1,\ldots$  after sequence $h_{s,t}$. Mathematically, $V^{\pi} (h_{s,t}) = u(R^{\pi_{h_{s,t}}}_t , R^{\pi_{h_{s,t}}}_{t+1}, \ldots)$.
\end{definition}


Among the policies evaluated with the value function $V^{\pi}$ from an initial state, we will be able to find an \textbf{optimal MDP solution} (or optimal policy) denoted as $\pi^*$ with value $V^*$ called the optimal value function. Such optimal solution then satisfies $V^{*} (h) \leqslant V^{\pi} (h)$ for all sequences $h \in H$.

% Among the policies evaluated according to the previous rule, we will be able to find an \textbf{optimal MDP solution} (or optimal policy) denoted as $\pi^*$ with value $V^*$ called the optimal value function. Such optimal solution then satisfies $V^{*} (h) \leqslant V^{\pi} (h)$ for all sequences $h \in H$.

One possible approach to evaluating the value function is using the expected sum of rewards from executing actions given by the policy, called the Expected Linear Additive Utility.
\\
\begin{definition}[\textbf{Expected Linear Additive Utility}]
An \textit{expected linear additive utility} function is a function $u(R_t, R_{t+1}, \ldots) = E[ \,\sum_{t'=t}^{|D|} \gamma ^{t'−t} R_{t'}] \, = E[ \,\sum_{t'=0}^{|D|-t} \gamma^{t'} R_{t'+t}] \,$ that computes the utility of a
reward sequence as the expected sum of (possibly discounted) rewards in this sequence, where $\gamma \geqslant 0$
is the discount factor.
\end{definition}

This approach eliminates the possibility of multiple different solution values resulting from multiple stochastic runs of the same policy. It employs the expected value and does not need to perform vector equality comparison as its results are scalars.

It turns out that this approach is even better as it guarantees a fundamental property of MDPs called \textbf{The Optimality Principle}, according to which \textit{among the policies that we evaluated by the expected linear additive utility, there exists a policy that is optimal at every time step.}

\section{MDPs techniques}

This part will briefly introduce a few fundamental algorithms for MDPs solving: Brute-Force algorithm, Policy Iteration, Value Iteration, and in the end, we present a possible solution for its disadvantages: Prioritizations. \\
After reading this section, the reader should know the options for solving MDPs and be ready for the following section to discuss the finite horizon's advantages.

\subsection{Brute-Force Algorithm}
As its name suggests, the Brute-Force algorithm is the most naive algorithm, similar to all problem-solving classes. During the execution of the solver, the method evaluates all possible output combinations and chooses the best one.
Moreover, computer scientists are not using it in the vast majority of cases because the algorithm needs to evaluate everything. In MDPs, the number of policies evaluated is $|A|^{|S|}$.

However, as the number of states or actions rises, such evaluation becomes enormous. Another problem is that action outcomes in MDPs are not deterministic, and MDP structure is often cyclic. Both of these problems result in a steep complexity rise in terms of computational resources.

That is why another approach comes in.

\subsection{An iterative approach to policy evaluation}

The evaluation of cyclic environments requires a suitable equation that captures all transitions and rewards. That means that every state's value function should correspond to the sum of rewards for moving towards successor states. Let us state the formula as follows:

% Moreover, the solver should output such a policy, whose value function for a given state maximizes its reward from future states. Thus, the successor state's value function multiplied by its probability should also appear in the equation. 

\begin{equation}
\begin{aligned}
V^{\pi} (s) = \sum_{s' \in S} T(s, \pi (s), s') [ \,R(s, \pi(s), s') + V^{\pi} (s')]
\end{aligned}
\end{equation}


% \begin{equation}
% \begin{aligned}
% V^{\pi} (s) & = 0 && \text{(if $s \in G$)} \\
% & = sum_{s' \in S} T(s, \pi (s), s') [ \,R(s, \pi(s), s') + V^{\pi} (s')] \, && \text{(otherwise)}
% \end{aligned}
% \end{equation}

For all states, the formula above forms a system of linear equations. This system of linear equations reduces the complexity of the previous approach. It \textit{can} be solved using Gaussian Elimination in $O(|S|^3)$. It is still not efficient enough, but we use the idea to find a solution with \textbf{an iterative approach}.

The iterative approach starts with a rough estimation and continuously works its way up to the asymptotically same solution as the linear system. The algorithm stops when the maximum gap between two consequent value functions of all states is lesser than a given value, called residual. This solution is optimal \cite{Kolobov2012}. As it stores the previous iteration, it is a part of dynamic programming. Every iteration of this algorithm runs in $O(|S|^2)$.

\LinesNumbered
\begin{algorithm}[ht]
\SetAlgoLined
//Assumption: $\pi$ is proper (ends up in a goal state) \\
initialize $V^{\pi}_0$ arbitrarily for each state \\
$n \xleftarrow{} 0$ \\
\Repeat{$ max_{s\in S} residual_n (s) \leq \epsilon$}{
$n \xleftarrow{} n + 1$ \\
\ForEach{$ s \in S $}{
compute $V^{\pi}_n (s) \xleftarrow{} \sum_{s' \in S} T (s, \pi (s), s') [ \,R(s, \pi (s), s') + V^{\pi}_{n−1} (s') ] \,$ \\
compute $residual_n (s) \xleftarrow{} |V^{\pi}_n (s) − V^{\pi}_{n−1} (s)| $ \\
}
}
return $V^{\pi}_n$
\caption{Iterative Policy Evaluation}
\end{algorithm}

We know how to evaluate the environment's state given some policy, but we have not yet described how to choose the best policy for a given evaluation. 

\subsection{Policy iteration}
Given that the iterative policy evaluation is optimal after an undefined number of iterations for a specific policy, we can further improve it by iterating policy evaluation and policy improvement.
For every iteration, we have \textit{almost} optimal policy evaluation for the previous policy. We can improve the policy by iterating over possible actions from all states and choosing that action if it has a greater reward than the previous one. The algorithm stops when the policy remains without a change after the following iteration. This algorithm is called Policy Iteration \ref{alg:PI}.

\LinesNumbered
\begin{algorithm}[ht]
\SetAlgoLined
initialize $\pi_0$ to be an arbitrary policy ending in the goal state\\
$n \xleftarrow{} 0$ \\
\Repeat{$\pi_n == \pi_{n−1}$ }{
    $n \xleftarrow{} n + 1$ \\
    Policy Evaluation: compute $V^{\pi_{n−1}}$ \\
    Policy Improvement: \\
    \ForEach{state $s \in S$}{
        $ \pi_n (s) \xleftarrow{} \pi_{n−1} (s) $ \\
        $ \forall a \in A$ compute $Q^{(V^{\pi_{n−1}})} (s, a) $ \\
        $ V_n (s) \xleftarrow{} max_{a \in A} Q^{(V^{\pi_{n−1}})} (s, a) $ \\
        \uIf{$ Q^{(V^{\pi_{n−1}})} (s, \pi_{n−1} (s)) > V_n (s)$}{
            $ \pi_n (s) \xleftarrow{} argmax_{a \in A} Q^{(V^{\pi_{n−1}})} (s, a)$
        }
    \textbf{end}
    }
}
return $\pi_n$
\caption{Policy Iteration}
\label{alg:PI}
\end{algorithm}

Until now, whenever we talked about policy iteration, we mentioned that the initial policy has to end in the goal state. If the policy ends up in the goal states, it converges significantly fast. On the other hand, if we do not meet the initial condition of policy ending in the terminal state, the policy evaluation step will diverge. Nevertheless, another algorithm can solve this drawback.


\subsection{Value iteration} \label{VI}
Value iteration focuses on improving the value function of all states instead of evaluating the policy. The algorithm creates a policy at its very end to maximize the reward in a given state. Thanks to this property, the initial policies are not used in the algorithm and can not lead to divergence.

The algorithm draws on from \textbf{Bellman equations}, which mathematically express the optimal solution of an MDP.

\begin{equation}
\begin{aligned}
V^* (s) & = max_{a \in A} Q^* (s, a) \\ 
Q^* (s, a) & = \sum_{s' \in S} T(s, a, s')[ \,R(s, a, s') + V^*(s')]  \\
\end{aligned}
\end{equation}

% \begin{equation}
% \begin{aligned}
% V^* (s) & = 0 && \text{(if $s \in G$)} \\
% & = max_{a \in A} Q^* (s, a) &&  (s \notin G) \\ 
% Q^* (s, a) & = \sum_{s' \in S} T(s, a, s')[ \,R(s, a, s') + V^*(s')] \, \\
% \end{aligned}
% \end{equation}

The equation's interpretation is maximizing the optimal value function of a given state. It can have a value of 0 if the state is among the goal states or a value-maximizing the result of performing a given action and then following the optimal policy if the state is not.

To approximate $V^* (s)$, the algorithm uses the so-called \textbf{Bellman backup} to improve the current $V_n (s)$ with $V_{n-1} (s')$.


\begin{equation}
\begin{aligned}
V_n (s) \xleftarrow{} max_{a \in A} \sum_{s' \in S} T(s, a, s') [ \,R(s, a, s') + V_{n - 1} (s')] \,
\end{aligned}
\end{equation}

% \begin{equation}
% \begin{aligned}
% V_n (s) \xleftarrow{} min_{a \in A} \sum_{s' \in S} T(s, a, s') [ \,R(s, a, s') + V_{n - 1} (s')] \,
% \end{aligned}
% \end{equation}

Value iteration is iteratively improving its value function estimation and is guaranteed to converge to the optimal solution \cite{Kolobov2012}. Its stopping criterion is either the maximum residual, the maximal difference between consecutive state value function estimations, or the number of iterations.

\LinesNumbered
\begin{algorithm}
\SetAlgoLined
initialize $V_0$ arbitrarily for each state \\
$n \xleftarrow{} 0$ \\ 
\Repeat{ $max_{s \in S} \: residual_n $ (s) < $ \epsilon $ }{
    $n \xleftarrow{} n + 1$ \\
    \ForEach{$ s \in S $ }{
        compute $V_n (s)$ using Bellman backup at $s$ \\ 
        compute $residual_n (s) = |V_n (s) - V_{n-1} (s)|$
        }
    }
return greedy policy: $\pi^{V_n} (s) = argmax_{a \in A} \sum_{s' \in S} \tau(s, a, s')[ \,C(s, a, s') + V_n (s')] \,$
\caption{Value Iteration}
\end{algorithm}


\subsection{Prioritization}
One of the most significant drawbacks of Value Iteration is that it requires full sweeps of the state space. This drawback results in many unnecessary value function evaluations because its value remains the same as the value function updates are heading from the goals state at first. It can take lots of iterations to evaluate all successor state's value functions. Furthermore, due to cyclic moves, the speed of convergence among states can differ.

Both of these problems often result in flawed performing algorithm execution.

Logical improvement is to, instead of iterating the whole state space, iterate over each state separately. Iterating over each separate states has two consequences:
\begin{enumerate}
    \item We have to develop or choose an algorithm that prioritizes states and 
    \item does not starve some of them (meaning every state gets updated accordingly).
\end{enumerate}

Furthermore, iterating over the separate states means that we can no longer use the number of iterations as a termination condition because we do not know which states' value functions and how often they were updated. The only terminal condition remains maximum residual.

\cite{Kolobov2012} formalizes the intuition in Algorithm \ref{alg:prior}:

\LinesNumbered
\begin{algorithm}[!ht]
\SetAlgoLined
initialize $V$ \\
initialize priority queue $q$ \\
\Repeat{ termination }{
    select state $s' = q.pop()$ \\
    compute V(s') using a Bellman backup at s'\\
    \ForEach{predecessor s of s', i.e. $\{s|\exists a [\, T(s, a, s') > 0] \,\}$}{
        compute $priority(s)$\\
        $q.push(s, priority(s))$\\
        }
    }
return greedy policy $\pi^{V}$
\caption{Prioritized Value Iteration}
\label{alg:prior}
\end{algorithm}

We will introduce a few priority metrics whose features generally differ and may diverge under specific conditions. For details, head over to \cite{Kolobov2012}.


\subsubsection{Prioritized Sweeping} 

Prioritized sweeping estimates the expected change in the value of a state if a backup were to be performed on it now.

\nopagebreak

\begin{equation}priority_{PS} (s) \xleftarrow{} max \Big\{ priority_{PS} (s), max_{a \in A} \big\{ T(s, a, s') Res^{V}(s')\big\} \Big\} \end{equation}


\subsubsection{Improved Prioritized Sweeping}

Improved Prioritized Sweeping employs the idea that states with fast converging value functions are to be prioritized. The consequence is that the algorithm first iterates the states near the goal and moves to other states only after the change between state iterations becomes small.

\nopagebreak

\begin{equation}priority_{IPS} (s) \xleftarrow{} \frac{Res^{V} (s)}{V (s)} \end{equation}

\subsubsection{Focused Dynamic Programming}

Focused Dynamic Programming is a particular case of prioritization techniques used when the start state $s_0$ is known. In such cases, the start case's knowledge can be employed and added as a penalty factor.


\begin{equation}priority_{FDP} \xleftarrow{} h(s) + V(s)\end{equation}


where:
\begin{description}[1cm]
  \item[h(s)] \textit{is lower bound on the expected reward for reaching s from $s_0$}, and
  \item[V(s)] is regular value function for given state s.
\end{description}

In the previous sections, we have very briefly introduced the notion of MDPs and their solving methods. First, we showed the Brute-Force algorithm, which naively evaluated all possible policies. Then we discussed the value and policy iterations that improved the performance but still did lots of useless evaluations. And finally, we presented a solution that intelligently iterates through prioritized states. 

\section{Finite-Horizon MDP} \label{FHMDP}

With all the necessary background layed out, Let us present the Finite-Horizon MDP.\\ \\

\begin{definition}[\textbf{Finite-Horizon MDP}]
A finite-horizon MDP is an MDP as described in ~\ref{def:MDP} with a finite number of time steps. %, i.e., with $|D| = T_{max} < \infty$.
\end{definition}


The Infinite-Horizon MDPs discussed so far are a slightly different class of problems from Finite-Horizon MDPs. However, the solution of Finite-Horizon is somewhat similar to the prioritized value iteration of Infinite-Horizon MDPs. 
Moreover, each Infinite-Horizon MDP can be transformed into Finite-Horizon one.

This is critical because the Finite-Horizon MDP has an acyclic state space, and \textbf{the acyclic MDPs can be solved optimally using only one backup if used optimal backup order.} \cite{Kolobov2012}

 On the other hand, the transformation to Finite-Horizon MDP %polynomially increases the state space by copying a whole former MDP into a single stage of Finite Horizon MDP. This way, the memory consumption linearly increases with the number of stages. 

Furthermore, the transformation to Finite Horizon MDP does not necessarily mean that the result will optimal or even the same. With a finite horizon, the finite horizon agents focus on getting rewards obtainable in a given time span. The Infinite Horizon MDPs do not have such a constraint and can focus on getting a bigger reward later on.

As the finite horizon number is known, we can evaluate all states' value functions from maximum horizon time and work our way down to starting time. This way, we evaluate each state's successor's value functions, employing optimal backup order and finding the optimal policy on the first pass.

In conclusion, in cases where the Infinite-Horizon methods are not sufficient, the MDPs can be transformed into Finite-Horizon ones. However, while solving the MDP in one pass, this transformation also blows up the state space.