Connections between optimization methods

In this post, I will start by presenting some famous optimization algorithms. In a second step, I will give some interesting connections between them, allowing the reader to get a better understanding of the field.

I. Greedy methods

II. Primal methods

Gradient descent

In this section, we consider a function $f : \mathbb R^d \to \mathbb R$ , that is differentiable and convex, and we tackle the problem:

$\min_{x \in \mathbb R^d}f(x).$

The gradient descent is a simple algorithm to reach a minimizer of $f$ . Suppose we are
provided with an initial point $x_0 \in \mathbb R^d$ . Then the gradient descent algorithm is:

$x_{k+1}:= x_k - \gamma_k \nabla f(x_k),$

where $\gamma_k$ is a step size to be tuned. The interpretation of the gradient descent method is minimizing a local quadratic approximation
of $f$ :

$f(x) \approx f(x_k)+\nabla f(x_k) ^{\top}( x-x_k) + \frac{1}{2\gamma_k} \|x-x_k\|^2_2.$

where $\gamma_k$ is unknown a priori. However, there are several manners to choose the step size $\gamma_k$ . The first one is to consider it constant. In this case, we require $f$ to be gradient Lipschitz in order to ensure convergence. Another way to choose the step size is to perform an adaptive and local computation, called a line search. One example is the so-called backtracking line search, also know as Armijo’s rule.

Drawbacks	Advantages
Often slow	Each iteration has small cost
$f$ needs to be differentiable	It is a first order method

Gradient descent: pro and cons.

To address the first issue, methods to improve convergence are:
quasi-Newton methods – conjugate gradient method – accelerated gradient method.
To address the second issue, methods for nondifferentiable or constrained problems are:
subgradient method – proximal gradient method – smoothing methods – cutting-plane methods.

2. Quasi-Newton

The Newton method comes from minimizing a second-order approximation of $f$ around

$f(x) \approx f(x_k) + \nabla f(x_k) ^{\top} (x-x_k) + \frac12 (x-x_k)^{\top} \nabla ^2 f(x_k) (x-x_k),$

leading to the update rule

$x_{k+1} = x_k - \left(\nabla ^2f(x_k) \right)^{-1} \nabla f(x_k).$

If the Newton method demonstrates fast convergence, it has the disadvantage to be expensive for large scale applications. To overcome that, the Hessian can be approximated by a metric $H \in \mathbb R^{d\times d}$ , that is symmetric positive definite.
Hence, the Quasi-Newton methods differ by considering different way to compute the sequence of matrix $H_k$ . One of the most famous Quasi-Newton method is the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method.

3. Subgradient method

From now on, we no longer require $f$ to be differentiable (but $f$ is still convex). Subgradient method is certainly the simplest method for minimizing $f$ . It is similar to gradient descent but replacing gradients by subgradients.

$x_{k+1} = x_k - \gamma_k v_k,$

where $v_k \in \partial f(x_k)$ and $\gamma_k$ is the step size.

Remark : Contrarily to a negative gradient $-\nabla f(x_k)$ , a negative subgradient $-v_k, \; v_k \in \partial f(x_k)$ is not a direction of descent in general. This means that the subgradient method is not a descent method ( $f(x_{k+1})>f(x_k)$ can occur).

4. Proximal gradient method

Optimization problems in machine learning are often of the form:

$\min_{ x \in \mathbb R^d} f(x) + g(x),$

where $f : \mathbb R^d \to R$ is a differentiable and convex function and $g : \mathbb R^d \to (-\infty,+\infty]$ is a convex function. In this section, we leverage the special structure of this problem to introduce fast
algorithms (compared to subgradient methods) even though the global function $f+g$ is not differentiable.

Definition (Proximal operator). Let $h: \mathbb R^d \to [-\infty,+\infty]$ be a proper, lower semi-continuous convex function. The proximal operator of $h$ is defined by:

$\forall x \in \mathbb R^d, \quad \mathrm{prox}_h (x) = \arg \min_{ u \in \mathbb R^d} \; f(u) + \frac12 \|x-u\|^2_2.$

(By strong convexity, the arg min exists and is a singleton, so proxg is well defined.)

The following algorithm is workable in the setting described previously, that is when:

$f :\mathbb R^d \to R$ is convex and differentiable;
$g : \mathbb R^d \to (-\infty,+\infty]$ is convex with an easy-to-compute proximal operator.

$x_{k+1} = \mathrm{prox}_{\gamma_k g}(x_k - \gamma_k \nabla f(x_k),$

where $\gamma_k$ is a step size.

The interpretation of the proximal gradient method is very similar to the one of the gradient descent. It consists in minimizing a local quadratic approximation of $f$ plus the original non-differentiable function $g$ :

$\begin{align*}\forall x \in \mathbb R^d, \quad f(x)+g(x)&\approx f(x_k)+\nabla f(x_k)^{\top} (x-x_k) + \frac{1}{2\gamma_k} \|x-x_k\|^2_2+g(x)\\ &=g(x) + \frac{1}{2\gamma_k} \| x- \left(x_k - \gamma_k \nabla f(x_k)\right) \|^2_2 - \frac{\gamma_k}{2} \|\nabla f(x_k)\|^2_2,\end{align*}$

where $\gamma_k$ is unknown a priori.

5. Douglas-Rachford method

Here, we focus on the optimization problem:

$\min_{x \in \mathbb R^d} f(x) + g(x),$

where $f :\mathbb R^d \to (-\infty,+\infty]$ and $g :\mathbb R^d \to (-\infty,+\infty]$ are two convex functions. Contrarily to the proximal gradient method, the Douglas-Rachford does not require f to be differentiable. Starting from an initial point $y_0 \in \mathbb R^d$ , the Douglas-Rachford algorithm reads:

$\begin{align*}x_k &= \mathrm{prox}_{\gamma_k f}(y_k)\\y_{k+1} &= y_k + \mu_k \left\{ \mathrm{prox}_{\gamma_k g}(2x_k-y_k)-x_k \right\},\end{align*}$

where $\gamma_k>0$ is a step size and $\mu_k \in (0,2)$ .

We can derive three special cases of the proximal gradient method:
– when $g \equiv 0$ , the proximal gradient method is a gradient descent;
– when $g \equiv \chi_C$ for a set $C \subset \mathbb R^d$ , the proximal method is a projected gradient descent;
– when $f \equiv 0$ , we get the proximal point method.

III. Primal-Dual methods

Let us consider the optimization problem:
minimize

$\begin{equation*} \min_{x \in \mathbb R^d} f(x) + g(Ax) ,\label{P10}\end{equation}$

where $f : \mathbb R^d \to (-\infty, +\infty]$ and $g :\mathbb R^p \to (-\infty, +\infty]$ are proper convex and $A \in \mathbb R^{p\times d}$ . This problem is equivalent to

$\begin{align*} \min_{x \in \mathbb R^d, \; y \in \mathbb R^p} \; &f(x)+g(y) \\& s.t. \; \; Ax = y\end{align*}$

The dual function to this last problem is:

$\begin{align*}\forall \nu \in \mathbb R^p, \quad G(\nu)&= \inf_{x \in \mathbb R^d, y \in \mathbb R^p} \left\{ f(x) + g(y) + \nu^{\top}Ax - \nu^{\top}y \right\}\\&=- \sup_{y \in\mathbb R^p} \left\{ \nu^{\top}y-g(y)\right\}- \sup_{x \in\mathbb R^d} \left\{ -\nu^{\top}Ax-f(x)\right\}\\&=-g^*(\nu)-f^*(-A^{\top}\nu).\end{align*}$

Therefore, the dual problem of interest is:

(1) $\begin{align*} \max_{\nu \in \mathbb R^p} -g^*(\nu)-f^*(-A^{\top}\nu) \end{align*}$

Theorem
Let $f :\mathbb R^d \to (-\infty,+\infty]$ , $g : :\mathbb R^p \to (-\infty,+\infty]$ be proper convex functions and $A \in \mathbb R^{p\times d}$ . Assume that either $\mathrm{dom}(f)= \mathbb R^d$ or $\mathrm{dom}(g)= \mathbb R^p$ and that $\exists x \in \mathbb R^d \; : \; Ax\in \mathrm{dom}(g)$ .
If the optima are attained in Problem \eqref{P10} and Problem \eqref{P11}, then strong duality holds:

$\min_{x \in \mathbb R^d} f(x) + g(Ax) = \max_{\nu \in \mathbb R^p}- g^*(\nu)-f^*(-A^{\top}\nu).$

Moreover, a primal-dual optimum is a solution to the saddle-point problem:

$\min{x \in \mathbb R^d} \max_{ \nu \in \mathbb R^p} f(x)+\nu^{\top}Ax-g^*(\nu).$

We have seen previously that the proximal gradient method, used for minimizing a composite objective function, reduces to:

the gradient method when $g \equiv 0$
the proximal point method when $f \equiv 0$ .

In this section, we exploit these two simple algorithms with the dual problem \eqref{P11} to devise primal-dual methods.

Quentin Duchemin