report.Rmd

---
title: "AMP for SLOPE"
author: "Bartłomiej Polaczyk"
date: "`r format(Sys.time(), '%d %B %Y')`"
output: 
  pdf_document:
    latex_engine: xelatex
  # html_document:
  #  css: AMP-SLOPE.css
  #  mathjax: local
  #  self_contained: false
abstract: |
  The aim of this project is to investigate the the approximate message passing algorithm for SLOPE regularization problem based on [@bu2019algorithmic] and compare it with classical convex optimization methods.
  Some numerical experiments regarding the cases that do not fit into the theoretical framework of [@bu2019algorithmic] are also performed and analyzed.
bibliography: AMP-SLOPE.bib
# header-includes:
#   - \usepackage{mathtools}
---
%LaTeX commands
\newcommand{\R}{\mathbb{R}}
\newcommand{\E}{\mathbb{E}}
\newcommand{\prox}{\operatorname{prox}}
\newcommand{\sign}{\operatorname{sign}}
\newcommand{\abs}[1]{\left\vert #1 \right\vert}
\newcommand{\norm}[1]{\left\Vert #1 \right\Vert}

%numbering equations in html format
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
  TeX: { equationNumbers: { autoNumber: "AMS" } }
});
</script>

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

load("myData.RData")
# Libraries
packages <- c("foreign", "lubridate", "ggplot2", "reshape2")
lapply(packages, library, character.only = TRUE)
```


# Theoretical bacground
## Introduction
We are interested in solving the standard linear inverse problem
\begin{equation}\label{eq:LM-problem} 
  y = Ax + w,
\end{equation} 
where $y\in \R^n$ and $A\in\R^{n\times p}$ are known parameters of the model, $w\in\R^n$ is a random noise vector and $x\in\R^p$ is an unknown vector of paramteres we wish to estimate.
We assume $p\gg n$, i.e. the number of features is much greater than the size of the sample data and whence there might be many potential solutions to the problem$~\eqref{eq:LM-problem}$.

To resolve this issue and prevent overfitting, we introduce a penalty function $\phi$ which faforizes sparse solutions of$~\eqref{eq:LM-problem}$, i.e. now we are looking among the minimizers of the following form
\begin{equation}\label{eq:SLOPE}
  \widehat{x} = \operatorname*{argmin}_x \{\, \frac{1}{2}\Vert Ax - y \Vert_2^2 + \phi(x) \,\}.
\end{equation}
The usual choices of $\phi$ are scaled $l^2$ penalty (Tikhonov regularization) and $l^1$ penalty (LASSO).
This note concerns sorted $l^1$ penalized estimation (abbrev. SLOPE), introduced for the first time in [@MR3418717], which assumes $\phi$ to be the sorted $l^1$ penalty, i.e.
$$
  \phi(x)= \phi_{\lambda}(x) = \sum_{i=1}^n \lambda_ix_i^{\downarrow},
$$
where $x_1^\downarrow \ge x_2^\downarrow \ge \ldots \ge x_n^\downarrow$ is the ordered permutation of the vector $\abs{x}=(\abs{x_1},\abs{x_2},\ldots,\abs{x_n})$ and $\lambda_1 \ge \lambda_2 \ge \ldots \lambda_p \ge 0$ are hyperparameters of the model.
To lighten notation, we denote $\Delta_p = \{\, x\in\R^p\colon x_1\ge x_2 \ge \ldots \ge x_p \ge 0 \,\}$, so that the above requirements read: $x^\downarrow,\lambda\in\Delta_p$, where $x^\downarrow = P\abs{x}$ for some permutation $P$ of the set $\{1,2,\ldots,p\}$.
Such a choice of regulizer is a generalization of the $l^1$ regularization, as can be seen by taking $\lambda_i\equiv\operatorname{const}$.

The fact that $\phi_\lambda$ is non-separable makes the analysis of its teoretical properties much more onerous than in case of classical (separable) models, cf. [@MR3418717, @bu2019algorithmic].
Nonetheless, it turns out that SLOPE has two advantages over other regularization methods such as LASSO and knocoffs, namely:

1. it achieves certain minimax estimation properties under particular random designs *without* requiring any knowledge of the sparsity degree of $\widehat{x}$, cf. [@MR3852663];
2. it controls the false discovery rate in the case of independent predictors, cf. [@MR3418717].

We are interested in the algorithmic solutions to the problem $\eqref{eq:SLOPE}$.
Since the objective function in $\eqref{eq:SLOPE}$ is convex but not smooth, one can not apply directly the classical gradient descent and has to turn to other methods.

A natural alternative solution is the plethora of proximal algorithms, e.g. ISTA (and its improvement -- FISTA, cf. [@MR2486527]) or more classical alternating direction methods of multipliers (ADMM).
The methods have been throughly studied in the literature, cf. [@MR3719240] for a detailed treatment of the subject.

In this note we will focus on another approach, via the approximate message passing, considered for the first time in context of the LASSO problem in [@donoho2009message] and subsequentially developed in e.g. [@MR2810285], and for the SLOPE regularization in [@bu2019algorithmic] -- see e.g. [@zdeborova2016statistical] for an accessible derivation of the method.

In the subsequent sections we will describre briefly some of these approaches.

## Proximal methods

Denoting $g(x)=\frac{1}{2}\Vert Ax - y \Vert_2^2$, the iterative shrinkage thresholding algorithm (ISTA) iteration for SLOPE with given $\lambda$ can be written as:

******
**ISTA-SLOPE:**<br>  
**Input:** $y\in\R^p$, $\lambda\in\Delta_p$, $A\in\R^{n\times p}$<br>  
Initialize $g(x)=\frac{1}{2}\norm{Ax-y}^2$, $x\in\R^p$, $t>0$ <br>  
**while** (*stopping condition*) **{** <br>  
&emsp;  $x \leftarrow \prox_{t\phi_\lambda}\big(x - t\nabla g(x)\big)$;<br>  
**} return** $x$

******

where $t$ can be thought of as the learning rate and $\operatorname{prox}$ denotes the proximal operator given by
$$
  \prox_{h}(y) := \operatorname*{argmin}_x \{\, h(x) + \frac{1}{2}\Vert x-y \Vert_2^2 \,\}.
$$

@MR2486527 have introduced a faster version of ISTA, a.k.a. FISTA, which is based on the idea of Nesterov momentum.
The general form of the algorithm is the following:

******
**FISTA-SLOPE:**<br>  
**Input:** $y\in\R^p$, $\lambda\in\Delta_p$, $A\in\R^{n\times p}$<br>  
Initialize $g(x)=\frac{1}{2}\norm{Ax-y}^2$, $x=x_{old}\in\R^p$, $r = 1$, $t>0$<br>  
**while** *(stopping condition)* **{**<br>  
&emsp;  $u \leftarrow x_{old} + r(x-x_{old})$;<br>  
&emsp;  $x_{old}\leftarrow x$;<br>  
&emsp;  $x \leftarrow \prox_{t\phi_\lambda}\big(u - t\nabla g(u)\big)$;<br>  
&emsp;  *update*($r$);<br>  
**} return** $x$

******

Here $r$ can be thought of as a acceleration term, which (if updated correctly through the update rule) can increase substantialy the speed of convergence of the algorithm.
Note that keeping $r\equiv 1$ restores ISTA.

One of the difficulties in dealing with SLOPE is that the regulizer $\phi_\lambda$ is non-separable and thus its proximal operator cannot be applied element-wise.
[@MR3418717] have proposed an efficient algorithm that for any $u\in\R^p$ computes 
$$
 \widehat{x}=
 \prox_{\phi_{\lambda}}(u) = 
 \operatorname*{argmin}_x \big\{ \frac{1}{2}\norm{u-x}^2 + \sum_i \lambda_ix_i^\downarrow \big\}
 =
 \operatorname*{argmin}_x \big\{\sum_i\big[ \frac{1}{2}(x^\downarrow_i)^2 + x_i^\downarrow\lambda_i - x_iu_i\big]\big\}
$$
It is based on the following simple observations that follow immediatly from the above formulation:

1. $\sign(\widehat{x}_i)=\sign(u_i)$ for each $i$ such that $\widehat{x}_i\neq 0$;
2. $P\widehat{x}=\prox_{\phi_\lambda}(Pu)$ for any permutation $P$;
3. If $u\in \Delta_p$, then $\widehat{x}\in\Delta_p$ (i.e., $u=u^\downarrow \Rightarrow \widehat{x}=\widehat{x}^\downarrow$);

Therefore we can and do assume in the analysis below that $u\in\Delta_p$.
The optimization procedure now reads:
$$
  \widehat{x} = 
  \operatorname*{argmin}_{x^\downarrow\in\Delta_p} \big\{\sum_i\big[ \frac{1}{2}(x^\downarrow_i)^2 - x_i^\downarrow(u-\lambda)_i \big]\big\},
$$
whence

4. $\widehat{x}$ depends only on the vector $(\lambda - u)$;
5. If $(u - \lambda)_+\in\Delta_p$ then $\widehat{x} = (u - \lambda)_+$, where $v_+ = (\max(v_i,0))_i$;
6. If $(u-\lambda)_i \le (u-\lambda)_{i+1}$, then $\widehat{x}_i \le \widehat{x}_{i+1}$, whence $\widehat{x}_i = \widehat{x}_{i+1}$;
7. Consequently, if $(u-\lambda)$ is nondecreasing along the indices $(i,i+1,\ldots,j)$, then $\widehat{x}_i = \widehat{x}_{i+1}= \ldots = \widehat{x}_{i+1}$ and 
$$
  \sum_{i\le r \le j}  \widehat{x}_r^\downarrow(u-\lambda)_r 
  =
  \sum_{i\le r \le j} \widehat{x}_r^\downarrow(\bar{u}-\bar{\lambda} ) \,
$$
where $\bar{u},\bar{\lambda}$ are the means of $u$ and $\lambda$ along the indices $(i,i+1,\ldots,j)$, so replacing $u_r,\lambda_r$ with $\bar{u},\bar{\lambda}$ does not change $\widehat{x}$.


The above observations justify the procedure below proposed by [@MR3418717].

******
**FastProxSL1:**<br>  
**Input:** $u\in\R^p$, $\lambda\in\Delta_p$<br>  
\# Define the operator $H_u(v) = P(\sign(u_i)v_i)_i$ for some permutation $P$, so that $H_u(u) = u^\downarrow\in\Delta_p$<br>  
$u' \leftarrow H_u(u)$;<br>  
**while** $(u'-\lambda)_+\notin\Delta_p$ **{**<br>  
&emsp; identify nondecreasing and nonconstant segments $i:j$ of $(u'-\lambda)$<br>  
&emsp; replace $u'_r,\lambda_r$ for $r\in\{i,i+1,\ldots,j\}$ by their averages $\bar{u'},\bar{\lambda}$ <br>  
**} return** $H_u^{-1}(u'-\lambda)_+$;

******

## Approximate message passing

The AMP approach is Bayes in nature.
Namely, we assume that the true values of $x$ are i.i.d. from some apriori distribution $X=(X_1,\ldots,X_p)$ satisfying some intregrability choditions and that the noise vector $w$ is elementwise i.i.d. with zero mean and variance $\sigma_w^2<\infty$.
The apriori distribution $X$ and the parameter $\sigma_w^2$ will play a crucial role in the construction of the algorithm as will be seen in a while.
The base of the AMP algorithm for SLOPE is as follows

******
**AMP-SLOPE:**<br>  
**Input:** $y\in\R^p$, $\alpha\in\Delta_p$, $A\in\R^{n\times p}$<br>  
Initialize: $g(x)=\frac{1}{2}\norm{Ax-y}^2$, $x=x_{old}\in\R^p$, $v=\nabla g(x)$, $t=t(X)\in\R_+$<br>  
**while** *(stopping condition)* **{**<br>  
&emsp;  $x_{old}\leftarrow x$;<br>  
&emsp;  $x \leftarrow \prox_{\phi_{\tau\alpha}}\big(x_{old} - v\big)$;<br>  
&emsp;  $v \leftarrow \nabla g(x) + \frac{v}{n} [\nabla\prox_{\phi_{\tau\alpha}}(x_{old} - v)]$;<br>  
&emsp;  *update*($\tau$);<br>  
**} return** $x$;

******

Based on the observations 1.-7. above, it is easy to verify that for any vector $u\in\R^p$
$$
  \nabla\prox_{\phi_\lambda}(u) = 
  \Vert \prox_{\phi_\lambda}(u) \Vert_0^\ast,
  \quad\text{where}\quad
  \norm{u}_0^\ast:= 
  \#\{ \text{unique non-zero magintudes in } \abs{u} \}.
$$
E.g., $\norm{(0,3,-3,3,1)}_0^\ast = 2$.
Therefore, $v$ update in the AMP-SLOPE algorithm can be read as
$$
 v \leftarrow \nabla g(x) + \frac{v}{n} \norm{x}_0^\ast.
$$

Moreover, the scaling factor $\tau$ and its update rule in the AMP scheme are dictated by the so-called state evolution equation.
Namely, let
$$
  F(\tau, \alpha) = \sigma_w^2 + \frac{1}{n} \E\norm{\prox_{\phi_{\tau\alpha}}(X+\tau Z) - X}^2,
  \quad
  \tau\in\R_+,\,\alpha\in\Delta^p,
$$
where $Z$ is the $\mathcal{N}(0,I_p)$ random vector independent on the whole model.
Then, setting
\begin{equation}\label{eq:SE}
\tau_0^2=\sigma_w^2 + \frac{p}{n}\mathbb{E} X_1^2,
\quad
\tau_{k+1}^2 = F(\tau_k,\alpha),
\quad
\tau_\ast = \lim_k \tau_k
\end{equation}
it can be verified that the stationary point of the AMP algorithm is also a minimizer to the SLOPE problem with $\lambda=\hat{\alpha}$ being the solution to
\begin{equation}\label{eq:lambda-alpha}
  \lambda = \alpha\tau_\ast\big( 1- \frac{1}{n}\mathbb{E}\Vert{\operatorname{prox}_{\phi_{\tau_\ast\alpha}}(X+\tau_\ast Z)}\Vert_0^\ast \big).
\end{equation}

@bu2019algorithmic have shown under some technical assumptions the fixed point of the state evolution$~\eqref{eq:SE}$ has unique fixed point (i.e., that $\tau^\ast$ is well defined), and that the solution $\hat{\alpha}$ to$~\eqref{eq:lambda-alpha}$ is unique and well-defined. They have also proposed a numerical scheme for calculating $\hat{\alpha}$ based on the bisection method.

# Numerical experiments
```{r, echo=TRUE}
source("R-codes/FastProxSL1.R")
source("R-codes/alpha-lambda.R")
source("R-codes/IterativeAlgs.R")
```

Parameters of the model:
```{r, eval=FALSE}
set.seed(351759)

p=300
delta=0.5
eps=0.1
sigma2=0.2
n=p*delta
A=matrix(rnorm(n*p,mean=0,sd=1/sqrt(p*delta)), n, p) 
alpha=2*c(rep(1,p*delta),rep(0,p*(1-delta)))
stepSize=10
tol=0

prior= function() rnorm(p)*rbinom(p,1,eps)

x0=prior()
x=prior() 
y=as.numeric(A%*%x+sqrt(sigma2)*rnorm(n))
tau <- sqrt(sigma2+eps/delta)
tau_ast <- alpha_to_tau_ast(alpha, x, delta, tau=tau, sigma2 = sigma2, verbose= FALSE, ergodic = TRUE)
lambdahat <- alpha_to_lambda(alpha, x, delta, tau_ast = tau_ast, sigma2 = sigma2)

print(paste0("True loss: ",loss <- sum( (A %*% x - y)^2)/2 + MapToDelta(x)$w %*% MapToDelta(lambdahat)$w))
```
Fixed point of the state evolution equation:
```{r, warning=FALSE, eval=FALSE}
t <- seq(from = 0, to = 4, length.out = 100)
Ft <- as.numeric(lapply(t, function(x) sqrt(F(x, alpha, delta, prior, sigma2, iter=10))))
```
Plot the SE:
```{r, warning=FALSE}
df <- data.frame(t, t, Ft)
colnames(df) <- c("t","x=y","sqrt(F)")
df <- melt(df, id="t")
ggplot(data = df,
       aes(x=t, y=value, colour=variable))+
  geom_line(size=.5) +
  ggtitle("AMP precision")
```

## Simulations

### Compare AMP with different MC precision
```{r, eval=FALSE}
AMP0 <-   new("AMP-SLOPE", 
             y=y, A=A, prior=prior, tau=tau, lambda=lambdahat, tol=tol, max_iter=25,
             alpha=alpha, sigma2=sigma2, F_iter=1)
AMP1 <-   new("AMP-SLOPE", 
             y=y, A=A, prior=prior, tau=tau, lambda=lambdahat, tol=tol, max_iter=25,
             alpha=alpha, sigma2=sigma2, F_iter=10)
AMP2 <-   new("AMP-SLOPE", 
             y=y, A=A, prior=prior, tau=tau, lambda=lambdahat, tol=tol, max_iter=25,
             alpha=alpha, sigma2=sigma2, F_iter=10^2)
AMP3 <-   new("AMP-SLOPE", 
             y=y, A=A, prior=prior, tau=tau, lambda=lambdahat, tol=tol, max_iter=25,
             alpha=alpha, sigma2=sigma2, F_iter=10^3)

AMP_res0 <- runAlg(AMP0)
AMP_res1 <- runAlg(AMP1)
AMP_res2 <- runAlg(AMP2)
AMP_res3 <- runAlg(AMP3)
```
Plot the results

```{r, warning=FALSE}
t <- c(5:25)
df <- data.frame(t, 
                 AMP_res0@loss[t], 
                 AMP_res1@loss[t], 
                 AMP_res2@loss[t], 
                 AMP_res3@loss[t])
colnames(df) <- c("iteration","AMP0","AMP1","AMP2","AMP3")
df <- melt(df, id="iteration")
ggplot(data = df,
       aes(x=iteration, y=value, colour=variable))+
  geom_line(size=.5) +
  ggtitle("AMP precision")

```

### Compare different methods
```{r, eval=FALSE}

ISTA <- new("ISTA-SLOPE", y=y, A=A, x=x0, lambda=lambdahat, stepSize=stepSize,
            max_iter=100, tol=tol, init_params=FALSE, verbose=FALSE)
ISTA_res <- runAlg(ISTA)
# print(paste0("ISTA loss=", paste0(tail(ISTA_res@loss,1), collapse = "; "),
#              " after ",ISTA_res@iteration," iterations"))


FISTA <- new("FISTA-SLOPE", y=y, A=A, x=x0, lambda=lambdahat, stepSize=stepSize, 
             tol=tol, max_iter=100, verbose=FALSE)
FISTA_res <- runAlg(FISTA)
# print(paste0("FISTA loss=", paste0(tail(FISTA_res@loss,1), collapse = "; "), 
#              " after ",FISTA_res@iteration," iterations"))
```
PLot the results:
```{r, warning=FALSE}
t <- c(2:100)
df <- data.frame(t, ISTA_res@loss[t], FISTA_res@loss[t], AMP_res3@loss[t])
colnames(df) <- c("iteration","ISTA","FISTA","AMP")
df <- melt(df, id="iteration")
ggplot(data = df,
       aes(x=iteration, y=value, colour=variable))+
  geom_line(size=.5) +
  scale_x_continuous(trans='log10') +
  ggtitle("Loss evolution")

```

## Conclusions
- AMP works only for the noisy models
- AMP is very resource consuming due to the MC simulation part -- it could be greatly improved if the integral in the state evolution could be computed analytically, which maybe could be obtained by choosing an appropriate prior.
- AMP converges faster than ISTA/FISTA to some approximation of the optimum but the latter methods give more precise estimates (is more accurate due to the lack of randomness)

# References