-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathreport.Rmd
349 lines (295 loc) · 15.5 KB
/
report.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
---
title: "AMP for SLOPE"
author: "Bartłomiej Polaczyk"
date: "`r format(Sys.time(), '%d %B %Y')`"
output:
pdf_document:
latex_engine: xelatex
# html_document:
# css: AMP-SLOPE.css
# mathjax: local
# self_contained: false
abstract: |
The aim of this project is to investigate the the approximate message passing algorithm for SLOPE regularization problem based on [@bu2019algorithmic] and compare it with classical convex optimization methods.
Some numerical experiments regarding the cases that do not fit into the theoretical framework of [@bu2019algorithmic] are also performed and analyzed.
bibliography: AMP-SLOPE.bib
# header-includes:
# - \usepackage{mathtools}
---
%LaTeX commands
\newcommand{\R}{\mathbb{R}}
\newcommand{\E}{\mathbb{E}}
\newcommand{\prox}{\operatorname{prox}}
\newcommand{\sign}{\operatorname{sign}}
\newcommand{\abs}[1]{\left\vert #1 \right\vert}
\newcommand{\norm}[1]{\left\Vert #1 \right\Vert}
%numbering equations in html format
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
TeX: { equationNumbers: { autoNumber: "AMS" } }
});
</script>
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
load("myData.RData")
# Libraries
packages <- c("foreign", "lubridate", "ggplot2", "reshape2")
lapply(packages, library, character.only = TRUE)
```
# Theoretical bacground
## Introduction
We are interested in solving the standard linear inverse problem
\begin{equation}\label{eq:LM-problem}
y = Ax + w,
\end{equation}
where $y\in \R^n$ and $A\in\R^{n\times p}$ are known parameters of the model, $w\in\R^n$ is a random noise vector and $x\in\R^p$ is an unknown vector of paramteres we wish to estimate.
We assume $p\gg n$, i.e. the number of features is much greater than the size of the sample data and whence there might be many potential solutions to the problem$~\eqref{eq:LM-problem}$.
To resolve this issue and prevent overfitting, we introduce a penalty function $\phi$ which faforizes sparse solutions of$~\eqref{eq:LM-problem}$, i.e. now we are looking among the minimizers of the following form
\begin{equation}\label{eq:SLOPE}
\widehat{x} = \operatorname*{argmin}_x \{\, \frac{1}{2}\Vert Ax - y \Vert_2^2 + \phi(x) \,\}.
\end{equation}
The usual choices of $\phi$ are scaled $l^2$ penalty (Tikhonov regularization) and $l^1$ penalty (LASSO).
This note concerns sorted $l^1$ penalized estimation (abbrev. SLOPE), introduced for the first time in [@MR3418717], which assumes $\phi$ to be the sorted $l^1$ penalty, i.e.
$$
\phi(x)= \phi_{\lambda}(x) = \sum_{i=1}^n \lambda_ix_i^{\downarrow},
$$
where $x_1^\downarrow \ge x_2^\downarrow \ge \ldots \ge x_n^\downarrow$ is the ordered permutation of the vector $\abs{x}=(\abs{x_1},\abs{x_2},\ldots,\abs{x_n})$ and $\lambda_1 \ge \lambda_2 \ge \ldots \lambda_p \ge 0$ are hyperparameters of the model.
To lighten notation, we denote $\Delta_p = \{\, x\in\R^p\colon x_1\ge x_2 \ge \ldots \ge x_p \ge 0 \,\}$, so that the above requirements read: $x^\downarrow,\lambda\in\Delta_p$, where $x^\downarrow = P\abs{x}$ for some permutation $P$ of the set $\{1,2,\ldots,p\}$.
Such a choice of regulizer is a generalization of the $l^1$ regularization, as can be seen by taking $\lambda_i\equiv\operatorname{const}$.
The fact that $\phi_\lambda$ is non-separable makes the analysis of its teoretical properties much more onerous than in case of classical (separable) models, cf. [@MR3418717, @bu2019algorithmic].
Nonetheless, it turns out that SLOPE has two advantages over other regularization methods such as LASSO and knocoffs, namely:
1. it achieves certain minimax estimation properties under particular random designs *without* requiring any knowledge of the sparsity degree of $\widehat{x}$, cf. [@MR3852663];
2. it controls the false discovery rate in the case of independent predictors, cf. [@MR3418717].
We are interested in the algorithmic solutions to the problem $\eqref{eq:SLOPE}$.
Since the objective function in $\eqref{eq:SLOPE}$ is convex but not smooth, one can not apply directly the classical gradient descent and has to turn to other methods.
A natural alternative solution is the plethora of proximal algorithms, e.g. ISTA (and its improvement -- FISTA, cf. [@MR2486527]) or more classical alternating direction methods of multipliers (ADMM).
The methods have been throughly studied in the literature, cf. [@MR3719240] for a detailed treatment of the subject.
In this note we will focus on another approach, via the approximate message passing, considered for the first time in context of the LASSO problem in [@donoho2009message] and subsequentially developed in e.g. [@MR2810285], and for the SLOPE regularization in [@bu2019algorithmic] -- see e.g. [@zdeborova2016statistical] for an accessible derivation of the method.
In the subsequent sections we will describre briefly some of these approaches.
## Proximal methods
Denoting $g(x)=\frac{1}{2}\Vert Ax - y \Vert_2^2$, the iterative shrinkage thresholding algorithm (ISTA) iteration for SLOPE with given $\lambda$ can be written as:
******
**ISTA-SLOPE:**<br>
**Input:** $y\in\R^p$, $\lambda\in\Delta_p$, $A\in\R^{n\times p}$<br>
Initialize $g(x)=\frac{1}{2}\norm{Ax-y}^2$, $x\in\R^p$, $t>0$ <br>
**while** (*stopping condition*) **{** <br>
  $x \leftarrow \prox_{t\phi_\lambda}\big(x - t\nabla g(x)\big)$;<br>
**} return** $x$
******
where $t$ can be thought of as the learning rate and $\operatorname{prox}$ denotes the proximal operator given by
$$
\prox_{h}(y) := \operatorname*{argmin}_x \{\, h(x) + \frac{1}{2}\Vert x-y \Vert_2^2 \,\}.
$$
@MR2486527 have introduced a faster version of ISTA, a.k.a. FISTA, which is based on the idea of Nesterov momentum.
The general form of the algorithm is the following:
******
**FISTA-SLOPE:**<br>
**Input:** $y\in\R^p$, $\lambda\in\Delta_p$, $A\in\R^{n\times p}$<br>
Initialize $g(x)=\frac{1}{2}\norm{Ax-y}^2$, $x=x_{old}\in\R^p$, $r = 1$, $t>0$<br>
**while** *(stopping condition)* **{**<br>
  $u \leftarrow x_{old} + r(x-x_{old})$;<br>
  $x_{old}\leftarrow x$;<br>
  $x \leftarrow \prox_{t\phi_\lambda}\big(u - t\nabla g(u)\big)$;<br>
  *update*($r$);<br>
**} return** $x$
******
Here $r$ can be thought of as a acceleration term, which (if updated correctly through the update rule) can increase substantialy the speed of convergence of the algorithm.
Note that keeping $r\equiv 1$ restores ISTA.
One of the difficulties in dealing with SLOPE is that the regulizer $\phi_\lambda$ is non-separable and thus its proximal operator cannot be applied element-wise.
[@MR3418717] have proposed an efficient algorithm that for any $u\in\R^p$ computes
$$
\widehat{x}=
\prox_{\phi_{\lambda}}(u) =
\operatorname*{argmin}_x \big\{ \frac{1}{2}\norm{u-x}^2 + \sum_i \lambda_ix_i^\downarrow \big\}
=
\operatorname*{argmin}_x \big\{\sum_i\big[ \frac{1}{2}(x^\downarrow_i)^2 + x_i^\downarrow\lambda_i - x_iu_i\big]\big\}
$$
It is based on the following simple observations that follow immediatly from the above formulation:
1. $\sign(\widehat{x}_i)=\sign(u_i)$ for each $i$ such that $\widehat{x}_i\neq 0$;
2. $P\widehat{x}=\prox_{\phi_\lambda}(Pu)$ for any permutation $P$;
3. If $u\in \Delta_p$, then $\widehat{x}\in\Delta_p$ (i.e., $u=u^\downarrow \Rightarrow \widehat{x}=\widehat{x}^\downarrow$);
Therefore we can and do assume in the analysis below that $u\in\Delta_p$.
The optimization procedure now reads:
$$
\widehat{x} =
\operatorname*{argmin}_{x^\downarrow\in\Delta_p} \big\{\sum_i\big[ \frac{1}{2}(x^\downarrow_i)^2 - x_i^\downarrow(u-\lambda)_i \big]\big\},
$$
whence
4. $\widehat{x}$ depends only on the vector $(\lambda - u)$;
5. If $(u - \lambda)_+\in\Delta_p$ then $\widehat{x} = (u - \lambda)_+$, where $v_+ = (\max(v_i,0))_i$;
6. If $(u-\lambda)_i \le (u-\lambda)_{i+1}$, then $\widehat{x}_i \le \widehat{x}_{i+1}$, whence $\widehat{x}_i = \widehat{x}_{i+1}$;
7. Consequently, if $(u-\lambda)$ is nondecreasing along the indices $(i,i+1,\ldots,j)$, then $\widehat{x}_i = \widehat{x}_{i+1}= \ldots = \widehat{x}_{i+1}$ and
$$
\sum_{i\le r \le j} \widehat{x}_r^\downarrow(u-\lambda)_r
=
\sum_{i\le r \le j} \widehat{x}_r^\downarrow(\bar{u}-\bar{\lambda} ) \,
$$
where $\bar{u},\bar{\lambda}$ are the means of $u$ and $\lambda$ along the indices $(i,i+1,\ldots,j)$, so replacing $u_r,\lambda_r$ with $\bar{u},\bar{\lambda}$ does not change $\widehat{x}$.
The above observations justify the procedure below proposed by [@MR3418717].
******
**FastProxSL1:**<br>
**Input:** $u\in\R^p$, $\lambda\in\Delta_p$<br>
\# Define the operator $H_u(v) = P(\sign(u_i)v_i)_i$ for some permutation $P$, so that $H_u(u) = u^\downarrow\in\Delta_p$<br>
$u' \leftarrow H_u(u)$;<br>
**while** $(u'-\lambda)_+\notin\Delta_p$ **{**<br>
  identify nondecreasing and nonconstant segments $i:j$ of $(u'-\lambda)$<br>
  replace $u'_r,\lambda_r$ for $r\in\{i,i+1,\ldots,j\}$ by their averages $\bar{u'},\bar{\lambda}$ <br>
**} return** $H_u^{-1}(u'-\lambda)_+$;
******
## Approximate message passing
The AMP approach is Bayes in nature.
Namely, we assume that the true values of $x$ are i.i.d. from some apriori distribution $X=(X_1,\ldots,X_p)$ satisfying some intregrability choditions and that the noise vector $w$ is elementwise i.i.d. with zero mean and variance $\sigma_w^2<\infty$.
The apriori distribution $X$ and the parameter $\sigma_w^2$ will play a crucial role in the construction of the algorithm as will be seen in a while.
The base of the AMP algorithm for SLOPE is as follows
******
**AMP-SLOPE:**<br>
**Input:** $y\in\R^p$, $\alpha\in\Delta_p$, $A\in\R^{n\times p}$<br>
Initialize: $g(x)=\frac{1}{2}\norm{Ax-y}^2$, $x=x_{old}\in\R^p$, $v=\nabla g(x)$, $t=t(X)\in\R_+$<br>
**while** *(stopping condition)* **{**<br>
  $x_{old}\leftarrow x$;<br>
  $x \leftarrow \prox_{\phi_{\tau\alpha}}\big(x_{old} - v\big)$;<br>
  $v \leftarrow \nabla g(x) + \frac{v}{n} [\nabla\prox_{\phi_{\tau\alpha}}(x_{old} - v)]$;<br>
  *update*($\tau$);<br>
**} return** $x$;
******
Based on the observations 1.-7. above, it is easy to verify that for any vector $u\in\R^p$
$$
\nabla\prox_{\phi_\lambda}(u) =
\Vert \prox_{\phi_\lambda}(u) \Vert_0^\ast,
\quad\text{where}\quad
\norm{u}_0^\ast:=
\#\{ \text{unique non-zero magintudes in } \abs{u} \}.
$$
E.g., $\norm{(0,3,-3,3,1)}_0^\ast = 2$.
Therefore, $v$ update in the AMP-SLOPE algorithm can be read as
$$
v \leftarrow \nabla g(x) + \frac{v}{n} \norm{x}_0^\ast.
$$
Moreover, the scaling factor $\tau$ and its update rule in the AMP scheme are dictated by the so-called state evolution equation.
Namely, let
$$
F(\tau, \alpha) = \sigma_w^2 + \frac{1}{n} \E\norm{\prox_{\phi_{\tau\alpha}}(X+\tau Z) - X}^2,
\quad
\tau\in\R_+,\,\alpha\in\Delta^p,
$$
where $Z$ is the $\mathcal{N}(0,I_p)$ random vector independent on the whole model.
Then, setting
\begin{equation}\label{eq:SE}
\tau_0^2=\sigma_w^2 + \frac{p}{n}\mathbb{E} X_1^2,
\quad
\tau_{k+1}^2 = F(\tau_k,\alpha),
\quad
\tau_\ast = \lim_k \tau_k
\end{equation}
it can be verified that the stationary point of the AMP algorithm is also a minimizer to the SLOPE problem with $\lambda=\hat{\alpha}$ being the solution to
\begin{equation}\label{eq:lambda-alpha}
\lambda = \alpha\tau_\ast\big( 1- \frac{1}{n}\mathbb{E}\Vert{\operatorname{prox}_{\phi_{\tau_\ast\alpha}}(X+\tau_\ast Z)}\Vert_0^\ast \big).
\end{equation}
@bu2019algorithmic have shown under some technical assumptions the fixed point of the state evolution$~\eqref{eq:SE}$ has unique fixed point (i.e., that $\tau^\ast$ is well defined), and that the solution $\hat{\alpha}$ to$~\eqref{eq:lambda-alpha}$ is unique and well-defined. They have also proposed a numerical scheme for calculating $\hat{\alpha}$ based on the bisection method.
# Numerical experiments
```{r, echo=TRUE}
source("R-codes/FastProxSL1.R")
source("R-codes/alpha-lambda.R")
source("R-codes/IterativeAlgs.R")
```
Parameters of the model:
```{r, eval=FALSE}
set.seed(351759)
p=300
delta=0.5
eps=0.1
sigma2=0.2
n=p*delta
A=matrix(rnorm(n*p,mean=0,sd=1/sqrt(p*delta)), n, p)
alpha=2*c(rep(1,p*delta),rep(0,p*(1-delta)))
stepSize=10
tol=0
prior= function() rnorm(p)*rbinom(p,1,eps)
x0=prior()
x=prior()
y=as.numeric(A%*%x+sqrt(sigma2)*rnorm(n))
tau <- sqrt(sigma2+eps/delta)
tau_ast <- alpha_to_tau_ast(alpha, x, delta, tau=tau, sigma2 = sigma2, verbose= FALSE, ergodic = TRUE)
lambdahat <- alpha_to_lambda(alpha, x, delta, tau_ast = tau_ast, sigma2 = sigma2)
print(paste0("True loss: ",loss <- sum( (A %*% x - y)^2)/2 + MapToDelta(x)$w %*% MapToDelta(lambdahat)$w))
```
Fixed point of the state evolution equation:
```{r, warning=FALSE, eval=FALSE}
t <- seq(from = 0, to = 4, length.out = 100)
Ft <- as.numeric(lapply(t, function(x) sqrt(F(x, alpha, delta, prior, sigma2, iter=10))))
```
Plot the SE:
```{r, warning=FALSE}
df <- data.frame(t, t, Ft)
colnames(df) <- c("t","x=y","sqrt(F)")
df <- melt(df, id="t")
ggplot(data = df,
aes(x=t, y=value, colour=variable))+
geom_line(size=.5) +
ggtitle("AMP precision")
```
## Simulations
### Compare AMP with different MC precision
```{r, eval=FALSE}
AMP0 <- new("AMP-SLOPE",
y=y, A=A, prior=prior, tau=tau, lambda=lambdahat, tol=tol, max_iter=25,
alpha=alpha, sigma2=sigma2, F_iter=1)
AMP1 <- new("AMP-SLOPE",
y=y, A=A, prior=prior, tau=tau, lambda=lambdahat, tol=tol, max_iter=25,
alpha=alpha, sigma2=sigma2, F_iter=10)
AMP2 <- new("AMP-SLOPE",
y=y, A=A, prior=prior, tau=tau, lambda=lambdahat, tol=tol, max_iter=25,
alpha=alpha, sigma2=sigma2, F_iter=10^2)
AMP3 <- new("AMP-SLOPE",
y=y, A=A, prior=prior, tau=tau, lambda=lambdahat, tol=tol, max_iter=25,
alpha=alpha, sigma2=sigma2, F_iter=10^3)
AMP_res0 <- runAlg(AMP0)
AMP_res1 <- runAlg(AMP1)
AMP_res2 <- runAlg(AMP2)
AMP_res3 <- runAlg(AMP3)
```
Plot the results
```{r, warning=FALSE}
t <- c(5:25)
df <- data.frame(t,
AMP_res0@loss[t],
AMP_res1@loss[t],
AMP_res2@loss[t],
AMP_res3@loss[t])
colnames(df) <- c("iteration","AMP0","AMP1","AMP2","AMP3")
df <- melt(df, id="iteration")
ggplot(data = df,
aes(x=iteration, y=value, colour=variable))+
geom_line(size=.5) +
ggtitle("AMP precision")
```
### Compare different methods
```{r, eval=FALSE}
ISTA <- new("ISTA-SLOPE", y=y, A=A, x=x0, lambda=lambdahat, stepSize=stepSize,
max_iter=100, tol=tol, init_params=FALSE, verbose=FALSE)
ISTA_res <- runAlg(ISTA)
# print(paste0("ISTA loss=", paste0(tail(ISTA_res@loss,1), collapse = "; "),
# " after ",ISTA_res@iteration," iterations"))
FISTA <- new("FISTA-SLOPE", y=y, A=A, x=x0, lambda=lambdahat, stepSize=stepSize,
tol=tol, max_iter=100, verbose=FALSE)
FISTA_res <- runAlg(FISTA)
# print(paste0("FISTA loss=", paste0(tail(FISTA_res@loss,1), collapse = "; "),
# " after ",FISTA_res@iteration," iterations"))
```
PLot the results:
```{r, warning=FALSE}
t <- c(2:100)
df <- data.frame(t, ISTA_res@loss[t], FISTA_res@loss[t], AMP_res3@loss[t])
colnames(df) <- c("iteration","ISTA","FISTA","AMP")
df <- melt(df, id="iteration")
ggplot(data = df,
aes(x=iteration, y=value, colour=variable))+
geom_line(size=.5) +
scale_x_continuous(trans='log10') +
ggtitle("Loss evolution")
```
## Conclusions
- AMP works only for the noisy models
- AMP is very resource consuming due to the MC simulation part -- it could be greatly improved if the integral in the state evolution could be computed analytically, which maybe could be obtained by choosing an appropriate prior.
- AMP converges faster than ISTA/FISTA to some approximation of the optimum but the latter methods give more precise estimates (is more accurate due to the lack of randomness)
# References