-
Notifications
You must be signed in to change notification settings - Fork 0
/
ctutest-FH-POMDP-Theory.tex
323 lines (207 loc) · 19.7 KB
/
ctutest-FH-POMDP-Theory.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
\chapter{Finite Horizon POMDPs}
In contradiction to Finite Horizon MDPs, the Finite Horizon POMDPs do not offer a one-pass optimal solution. Furthermore, it causes additional memory and performance requirements.
Causing additional memory and performance requirements does not mean that it is necessarily a flawed concept. In the following chapter, we will explain why the Finite Horizon POMDPs are actually beneficial, why we need to use appropriate methods or develop new ones designed to solve finite horizon problems instead of the infinite horizon ones that introduce the state-of-the-art algorithm for solving Finite Horizon POMDPs. Besides, we will introduce possible heuristics that the algorithm can employ.
In the following chapter, we will mainly draw on from \cite{Walraven19}.
\section{Finite Horizon Problems}
The real-world problems span from non-stop operational agents to the ones that are time-limited. Say, an elevator focusing on short-term rewards from personnel transport in the former case and the electric vehicle charging provider focusing on day-to-day forecasts in the latter case.
The Finite Horizon POMDPs are defined in line with Definition \ref{def:POMDP}, but their time-span is limited by the horizon $h$ or $t_{max}$. We call each specific time moment a $stage$. Distinguishing between stages opens up the possibility of different state spaces in various stages. Note that the transition is defined as a mapping from state $s$ in stage $t$ and action $a$ to state $s'$ in stage $t+1$. The observation function $\Omega$, observation set $O$, and reward function $R$ are typically extended with the stage information as well.
In this chapter, we consider the discount $\gamma$ = 1, as we focus on problems, which do not include the discount factor. In such cases, we want to obtain the policy that maximizes the expected sum of rewards:
\begin{equation}
E \left[ \sum_{t=1}^{h}r_t \right],
\end{equation}
where $t=1$ is the first time step and $h$ stands for the horizon number, meaning there are $h$ steps total.
In finite horizon problems, we store different value functions $V^{\pi}(t, b)$ and policies $\pi(t, b)$ for each time step, mapping both time and state to the optimal action and its value. The value function computes the expected sum of rewards the agent obtains when following the policy from stage $t$ and belief $b$ to stage $h$. We define the formula as follows:
% TODO: \right] se nechce vypsat
\begin{equation}
V^{\pi^*}(t, b) = \begin{cases} \begin{aligned}
\operatorname{max}_{a \in A} \Bigg[ \sum_{s \in S} R(s, a)b(s) + \sum_{o \in O} & P(o|b, a) V^{\pi ^ *}(t+1, b^a_o) \Bigg] && t \leq h \\
& 0 && t > h \\
\end{aligned} \end{cases}
,
\end{equation}
where $b^a_o$ is the belief updated with observation $o$ obtained after executing action $a$ defined in Eq. \ref{eq:bao}.
The optimal policy ${\pi}^*$ corresponding to the optimal value function is defined as:
\begin{equation}
{\pi}^*(t, b) = \operatorname{argmax}_{a \in A} \left[ {\sum_{s \in S} R(s, a)b(s) + \sum_{o \in O} P(o|b, a) V^{\pi ^ *}(t+1, b^a_o) }\right].
\end{equation}
For time steps from \textit{1} to $h$ the policy returns optimal action corresponding to optimal value-function in time $t$ and belief $b$.
\section{Limitations of POMDP Algorithms in Finite Horizon Settings}
The existing POMDP solvers work with discounting reward, but it turns out that they can not easily generalize to the finite horizon setting without discounting. This section will discuss the approaches that could be leveraged in Finite Horizon POMDP solving and why the existing solvers can not apply the discount factor $\gamma = 1$.
\subsection{Solution Strategies for Finite Horizon Problems}
The first approach solves the POMDP as if it is a fully observable MDP. This model change is possible thanks to the finite number of reachable beliefs of POMDP. Thanks to that, we can treat each belief as an MDP state and solve it with MDP solvers. This approach, however, creates up to $(|A||O|)^h$ beliefs and often becomes intractable.
The second approach assumes solving the Finite Horizon POMDP as if it is an infinite horizon one. In this approach, we can safely assume discounting. However, because of wrongly assuming the infinite horizon, the policy does not have the information that its execution is terminated after a finite number of steps, meaning that it focuses on obtaining the unreachable reward obtainable even after the maximum number of steps. The second disadvantage is that the policy focuses on early reward because of the discounting, even if there is a larger reward later on. The agent, however, does not obtain the larger reward because its value is negatively decreased by discounting and, as such, becomes lower than the early small reward.
The third approach supposes that the Finite Horizon POMDP states are reachable in all stages with a terminal state after the horizon stage, resulting in $|S| \times h + 1$ states. This approach is inefficient as it increases the number of beliefs and the other solution representation as $\alpha$-vectors, resulting in an undesired increase in both performance and memory requirements.
% TODO TADY MUZU DODELAT DALSI APPROACHes
\subsection{Discarding the Discount Factor in Infinite Horizon Algorithms}
In the text above, we discussed possible approaches to solving Finite Horizon POMDPs without discounting and why they are often intractable or have other undesirable effects. In the following lines, we will describe why the solvers introduced in the section on Infinite Horizon POMDPs can not be used in the Finite Horizon setting with discount factor $\gamma = 1$. Besides, we discuss their optimality and whether they compute bounds on their results. The optimality of the result refers to the policy starting from a known initial belief. The results may be suboptimal if starting from the other than the initial belief. We will start from the basic solvers and work our way to the most advanced heuristics presented in this work.
Exact Value Iteration can be used with the discount factor $\gamma = 1$. It computes the optimal solution for any initial belief. However, it is intractable for problems with large state spaces.
Point-Based Value Iteration does not suffer from computing the value function for the whole belief space. It incrementally expands its belief space, and it can evaluate its methods without using the discounting factor. Unfortunately, the worst-case error bound assumes a discount $\gamma < 1$, which opposes our requirements. The number of beliefs in PBVI can still become large, and its improvements showed better results than their foundation.
Perseus algorithm also requires the lower bound to be initialized with $\gamma < 1$. It does not keep track of the upper bound. It is randomized and, in particular, does not provide any guarantee on performance and optimality.
Both SARSOP and HSVI incrementally expand their set of belief points, and their backup and upper bound update are well-defined for $\gamma = 1$. They produce the upper bound on the optimal value function and thus guarantee the optimality in the limit. However, their lower bound initialization, as in the preceding algorithm, requires a discount $\gamma < 1$. Thus, to employ both algorithms in the finite horizon environment, we need to adapt them.
All algorithms presented in the previous chapter as infinite horizon solvers lack at least one of the three declared properties. The algorithm we will present in the next section meets all the required properties. It also draws on the well-performing properties of the Finite Horizon POMDP solvers. Furthermore, the algorithm converges to optimality, computes both bounds, and does not require problems to contain discounting.
\section{FiVI: Finite Horizon Point-Based Value Iteration}
% TODO DOPSAT INTRODUCTION
TBC
\subsection{Time-Dependent Value Functions and Backups}
The standard way of computing value functions in point-based value iteration algorithms is to run a value function update similar to the one defined by Eqs. \ref{eq:5}, \ref{eq:6} and \ref{eq:7}. In FiVI, we have to extend the value update with the non-stationary staged environment elements. The algorithm utilizes staged value functions $V_t$ ranging from stage 1 to the horizon stage. Each value function $V_t$ is represented by a set of $\alpha$-vectors $\Gamma_t$.
The vectors constituting to a value function are defined by a time-dependent backup function:
\begin{equation}
\Gamma_t = \bigcup_{(b, \bar{v}) \in B_t} backup(b, t)
\end{equation}
The FiVI needs to keep track of beliefs' upper bounds $\bar{v}$ to converge. The backup(b, t) exploits the knowledge of next step value function $\Gamma_{t + 1}$ to compute $\alpha$-vectors of $\Gamma_t$. The FiVI defines the time-dependent backup(b, t) operator as following:
\begin{equation} backup(b, t) = \operatorname*{argmax}_{\{z_{b, a, t}\}_{a\in A}} b \cdot z_{b, a, t}, \end{equation}
where
\begin{equation} z_{b, a, t} = \begin{cases} \begin{aligned}
r_a + \sum_{o \in \Omega} \operatorname*{argmax}_{\{z^{k, t+1}_{a, o}\}_k}&\ b \cdot z^{k, t+1}_{a, o} && t < h \\
&r_a && t = h \\
\end{aligned}\end{cases},\end{equation}
and $z^{k,t}_{a,o}$ denotes the back-projection of vector $\alpha^{k,t} \in \Gamma_t$:
\begin{equation} z^{k,t}_{a,o}(s) = \sum_{s' \in S} P(o|a, s') P(s'| s, a) \alpha^{k, t} (s')\ \forall s.\end{equation}
The vector $r_a$ represents the immediate reward vector for action $a$. We also assume that the backup has access to all vector sets and the reward vectors $r_a$.
\subsection{Time-Dependent Value Upper Bounds and Bound Updates}
Computing the upper bound, typical for Point-based Value Iteration algorithms, enables the solution quality assessment. The FiVI algorithm stores the upper bound value as $\bar{v}$ in the pairs of $(b, \bar{v}$ kept in belief sets \textit{B}$_t$. Generally, we compute the upper bound values $\bar{v}$ by employing linear programming interpolation. However, this approach is computationally expensive, and the researchers introduced the sawtooth approximation in \cite{Hauskrecht_2000}.
\LinesNumbered
\begin{algorithm}[H]
\SetKwInOut{Input}{Input}
\SetKwInOut{Output}{Output}
\Input{belief $b'$, set $B$ containing belief-bound pairs}
\Output{upper bound corresponding to belief $b'$}
\SetAlgoLined
\For{$(b, \bar{v}) \in B \setminus \{(e_s, \cdot)\ |\ s \in S$}{
$f(b) \xleftarrow{} \bar{v} - \sum_{s \in S} b(s) B(e_s) $ \\
$c(b) \xleftarrow{} \operatorname{min}_{s \in S} b'(s)\ /\ b(s) $ \\
}
$ b^* \xleftarrow{} \operatorname{argmin}_{\{b|(b, \bar{v} \in B \setminus \{e_s\ |\ s \in S\}\}} c(b) f(b) $ \\
\textbf{return} $c(b^*) f(b^*) + \sum_{s \in S} b'(s) B(e_s) $ \\
\caption{Sawtooth approximation (UB)}
\end{algorithm}
The sawtooth algorithm simplifies the upper bound update by imposing the weights $c_b$ only to so-called corner beliefs. The corner belief marked $e(s)$ stands for the belief that has a deterministic distribution over a given state $s$. The algorithm takes a belief $b'$, and a belief set $B$ and returns the upper bound interpolation for $b'$. The $B(e_s)$ notation denotes the upper bound of $e(s)$ in the belief set $B$.
In the finite horizon settings we update the upper bound value of the belief $b$ in time $t < h$ as follows:
\begin{equation}
\operatorname{max}_{a \in A} \sum_{s \in S} R(s, a)b(s) + \sum_{o \in I} P(o|b, a) UB(b^a_o, B_{t+1}),
\end{equation}
where we interpolate based on the belief set $B_{t+1}$ from the next stage. In the horizon stage we compute only $\operatorname{max}_{a \in A} \sum_{s \in S} R(s, a)b(s)$ as the upper bound value of the sink state is zero.
\subsection{Algorithm Description of FiVI}
The FiVI algorithm is an iterative algorithm that takes the POMDP, required precision, and time limit as the inputs and outputs the vector of $\alpha$-vector sets $\Gamma_t$ and the upper bound of the initial belief.
At the very beginning, the algorithm initiates its model representations of belief sets, $\alpha$-vectors, the reward vector of each action, the variables storing the number of iterations and time elapsed (lines 1 - 5).
Following lines 6 - 34, execute the body of the algorithm. The do-while loop consists of 3 main parts - belief expansion, belief backup, and belief upper bound update.
The belief expansion on line 8 finds new beliefs $B_t$ to shrink the gap between the upper and the lower bound.
The lines 9 - 29 are iterating over all non-terminal stages backwards, computing belief backup $\Gamma_t \xleftarrow{} \bigcup_{(b, \bar{v}) \in B_t} backup(b, t)$ on lines 11 - 14 and belief upper bound value on lines 15 - 27 of stage $T$ with the model representations of stage $t + 1$.
An iteration ends with a check of termination conditions - the maximum gap $g_a$ and minimal time elapsed $\tau$. The difference between the upper bound $v_u$ of initial belief $b_1$ and the lower bound $v_u$ denotes the current gap. The algorithm stops if the gap is at most one unit at the $\rho$-th significant digit or if a time limit $\tau$ has been exceeded.
The FiVI algorithm returns the vector of $\alpha$-vector sets $\{\Gamma_1, \dots, \Gamma_h\}$ denoting the lower bound and the upper bound $v_u$ of the initial belief $b_1$. The gap implicitly defines a guarantee on the quality of computed solution \cite{Walraven19}.
% \newpage
\LinesNumbered
\begin{algorithm}[H]
\SetKwInOut{Input}{Input}
\SetKwInOut{Output}{Output}
\Input{POMDP M, precision $\rho$, time limit $\tau$}
\Output{sets $\Gamma_t$ for each time step $t$, upper bound $v_u$}
\SetAlgoLined
$\Gamma_t \xleftarrow{} \emptyset\ \forall t $ \\
$\textit{B}_t \xleftarrow{} \emptyset\ \forall t $ \\
$ r_a \xleftarrow{} (R(s_1, a), R(s_2, a), \dots, R(s_{|S|}, a))\ \forall a $ \\
add corner beliefs to \textit{B}$_t$ with upper bound $\infty\ (\forall t) $ \\
$ \tau' \xleftarrow{} 0, \delta \xleftarrow{} 0 $ \\
\SetKwRepeat{Do}{do}{while}
\Do{$\tau' < \tau \wedge v_u - v_l > g_a$}{
$\delta \xleftarrow{} \delta + 1$ \\
expand(M, $\{\Gamma_1, \dots, \Gamma_h\}, \{B_1, \dots, B_h \}$, r) \\
\For{t = h, h - 1, \dots, 1}{
$\Gamma_t \xleftarrow{} \emptyset$ \\
\For{$(b, \bar{v}) \in B_t$}{
$\alpha \xleftarrow{} backup(b, t) $ \\
$ \Gamma_t \xleftarrow{} \Gamma_t \cup \{\alpha\} $ \\
}
\For{$(b, \bar{v}) \in B_t$}{
% $\bar{v} \xleftarrow{} Update Upper Bound$\\
$\bar{v} \xleftarrow{} -\infty$ \\
\For{$a \in A$}{
$v \xleftarrow{} r_a \cdot b$ \\
\If{t < h}{
\For{$o \in O$}{
\If{$P(o|b, a) > 0$}{
$ v \xleftarrow{} v + P(o|b, a) \cdot $ UB$(b^a_o, B_{t+1}) $
}
}
}
$ \bar{v} \xleftarrow{} \operatorname{max}(\bar{v}, v) $ \\
}
}
}
$ v_l \xleftarrow{} \operatorname{max}_{\alpha \in \Gamma_1} \alpha \cdot b_1$ \\
$ v_u \xleftarrow{} $ upper bound $\bar{v}$ associated with $(b_1, \bar{v}) \in B_1 $ \\
$ g_a \xleftarrow{} 10^{\lceil \operatorname{log}_{10}(\operatorname{max}(|v_l|,|v_u|))\rceil - \rho}$ \\
$ \tau' \xleftarrow{} $ elapsed time after the start of the algorithm \\
}
\textbf{return} ($\{\Gamma_1, \dots, \Gamma_h\}, v_u$) \\
\caption{Finite Horizon Point-Based Value Iteration (FiVI)}
\end{algorithm}
\subsection{Belief Points and Convergence of the Algorithm}
The optimality and speed of convergence of FiVI and Point-Based algorithms heavily depends on the quality of expanded belief backup executions. The algorithm then can go in two directions. Either to randomly select new beliefs or to steer the new beliefs into space, the optimal policy would go as well. However, we do not know the optimal policy beforehand.
The FiVI algorithm uses a heuristic procedure similar to HSVI, SARSOP, or GapMIN to explore the promising areas.
The algorithm chooses the action and observation that lowers the gap as much as possible, effectively reducing the overall gap. We will motivate this by the term $U(t, b, a) - V(t, b, a)$ which shows the difference between the potential upper bound $U(t, b, a)$ of the belief $b$ in the stage $t$ after executing action $a$ and its expected value function $V(t, b, a)$. The value function is, by definition, maximizing its value and the upper bound minimizing, reducing the resulting gap, and improving the quality of the result.
The expected value function for belief $b$ in stage $t$ after executing the action $a$ is:
\begin{equation}
V(t, b, a) = \sum_{s \in S} R(s, a) b(s) + \sum_{o \in O} P(o|b, a) V(t + 1, b^a_o),
\end{equation}
and the potential upper bound:
\begin{equation}
U(t, b, a) = \sum_{s \in S} R(s, a) b(s) + \sum_{o \in O} P(o|b, a) UB(b^a_o, B_{t+1}).
\end{equation}
After choosing the best action a, the algorithm chooses the observation leading to a belief with a maximum gap in stage $t + 1$.
All expanded beliefs with stage $t > 1$ contribute to the final bounds and the gap associated with the initial belief $b_1$.
\begin{equation}
\operatorname{argmax}_{\{o \in O\ |\ P(o|b, a) > 0\}} \{UB(b^a_o, B_{t + 1}) - \operatorname{max}_{\alpha \in \Gamma_{t + 1}} \alpha \dot b^a_o \}.
\end{equation}
The expand algorithm expands at most one belief for each stage $t$. It adds the resulting belief to the next stage belief set $B_{t + 1}$ as the new belief comes from the stage $t$ belief $b$. The procedure follows the rules outlined in the lines above, starting from the initial belief. We do not have to consider the stage $t = h$ as we are not interested in the terminal beliefs.
\LinesNumbered
\begin{algorithm}[H]
\SetKwInOut{Input}{Input}
\Input{M, $\{\Gamma_1, \dots, \Gamma_h\}, \{B_1, \dots, B_h \}$, r}
\SetAlgoLined
$ b \xleftarrow{} b_1$ \\
\For{t = h, h - 1, \dots, 1}{
$ a \xleftarrow{} \operatorname{argmax}_{a \in A} \{r_a \cdot b + \sum_{\{o \in O |\ P(o|b, a) > 0\}} P(o| b, a) \cdot$ UB$(b^a_o, B_{t+1})\}$ \\
$ o \xleftarrow{} \operatorname{argmax}_{\{o \in O\ |\ P(o|b, a) > 0\}} \{$ UB$(b^a_o, B_{t+1}) - \operatorname{max}_{\alpha \in \Gamma_{t+1}} \alpha \cdot b^a_o \}$ \\
$ B_{t+1} \xleftarrow{} B_{t+1} \cup \{(b^a_o, \infty)\} $ \\
$ b \xleftarrow{} b^a_o $\\
}
\caption{Belief expansion (expand)}
\end{algorithm}
The number of iterations performed by FiVI is at most $(|A||O|)^h$ as is the number of all beliefs. However, this number of iterations is unlikely as the algorithm leads its exploration to the promising areas of belief space.
\subsection{Heuristics}
TBC
% \LinesNumbered
% \begin{algorithm}[H]
% \SetAlgoLined
% $\bar{v} \xleftarrow{} -\infty$ \\
% \For{$a \in A$}{
% $v \xleftarrow{} r_a \cdot b$ \\
% \If{t < h}{
% \For{$o \in O$}{
% \If{$P(o|b, a) > 0$}{
% $ v \xleftarrow{} v + P(o|b, a) \cdot $ UB$(b^a_o, B_{t+1}) $
% }
% }
% }
% $ \bar{v} \xleftarrow{} \operatorname{max}(\bar{v}, v) $ \\
% }
% \textbf{return} $\bar{v}$ \\
% \caption{Update Upper Bound}
% \end{algorithm}
% \documentclass[a4paper, 11pt]{article}
% \usepackage[norelsize, linesnumbered, ruled, lined, boxed, commentsnumbered]{algorithm2e}
% \begin{document}
% \begin{algorithm}[H]
% \SetAlgoLined
% \LinesNumbered
% \SetKwInOut{Input}{Input}
% \Input{$Graph\ G(V, E)$}
% \SetKwProg{Function}{function}{}{end}
% \SetKwRepeat{Do}{do}{while}
% \Function{function(param: int) : int}{
% \Do{done = false}{ something }
% }
% \caption{Algorithm}
% \end{algorithm}
% \end{document}