Quentin Duchemin

Connections between optimization methods

In this post, I will start by presenting some famous optimization algorithms. In a second step, I will give some interesting connections between them, allowing the reader to get a better understanding of the field.

I. Greedy methods

II. Primal methods

  1. Gradient descent

In this section, we consider a function f : \mathbb R^d \to \mathbb R, that is differentiable and convex, and we tackle the problem:

    \[ \min_{x \in \mathbb R^d}f(x).\]

The gradient descent is a simple algorithm to reach a minimizer of f. Suppose we are
provided with an initial point x_0 \in \mathbb R^d. Then the gradient descent algorithm is:

    \[ x_{k+1}:= x_k - \gamma_k \nabla f(x_k),\]

where \gamma_k is a step size to be tuned. The interpretation of the gradient descent method is minimizing a local quadratic approximation
of f:

    \[ f(x) \approx f(x_k)+\nabla f(x_k) ^{\top}( x-x_k) + \frac{1}{2\gamma_k} \|x-x_k\|^2_2.\]


where \gamma_k is unknown a priori. However, there are several manners to choose the step size \gamma_k. The first one is to consider it constant. In this case, we require f to be gradient Lipschitz in order to ensure convergence. Another way to choose the step size is to perform an adaptive and local computation, called a line search. One example is the so-called backtracking line search, also know as Armijo’s rule.

DrawbacksAdvantages
Often slowEach iteration has small cost
f needs to be differentiableIt is a first order method
Gradient descent: pro and cons.
  • To address the first issue, methods to improve convergence are:
    quasi-Newton methods – conjugate gradient method – accelerated gradient method.
  • To address the second issue, methods for nondifferentiable or constrained problems are:
    subgradient method – proximal gradient method – smoothing methods – cutting-plane methods.

2. Quasi-Newton

The Newton method comes from minimizing a second-order approximation of f around

    \[f(x) \approx f(x_k) + \nabla f(x_k) ^{\top} (x-x_k) + \frac12 (x-x_k)^{\top} \nabla ^2 f(x_k) (x-x_k),\]


leading to the update rule

    \[x_{k+1} = x_k - \left(\nabla ^2f(x_k) \right)^{-1} \nabla f(x_k).\]


If the Newton method demonstrates fast convergence, it has the disadvantage to be expensive for large scale applications. To overcome that, the Hessian can be approximated by a metric H \in \mathbb R^{d\times d}, that is symmetric positive definite.
Hence, the Quasi-Newton methods differ by considering different way to compute the sequence of matrix H_k. One of the most famous Quasi-Newton method is the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method.

3. Subgradient method

From now on, we no longer require f to be differentiable (but f is still convex). Subgradient method is certainly the simplest method for minimizing f. It is similar to gradient descent but replacing gradients by subgradients.

    \[x_{k+1} = x_k - \gamma_k v_k,\]


where v_k \in \partial f(x_k) and \gamma_k is the step size.

Remark : Contrarily to a negative gradient -\nabla f(x_k), a negative subgradient -v_k, \; v_k \in \partial f(x_k) is not a direction of descent in general. This means that the subgradient method is not a descent method (f(x_{k+1})>f(x_k) can occur).

4. Proximal gradient method

Optimization problems in machine learning are often of the form:

    \[ \min_{ x \in \mathbb R^d} f(x) +  g(x),\]


where f : \mathbb R^d \to R is a differentiable and convex function and g : \mathbb R^d \to (-\infty,+\infty] is a convex function. In this section, we leverage the special structure of this problem to introduce fast
algorithms (compared to subgradient methods) even though the global function f+g is not differentiable.

Definition (Proximal operator). Let h: \mathbb R^d \to [-\infty,+\infty] be a proper, lower semi-continuous convex function. The proximal operator of h is defined by:

    \[ \forall x \in \mathbb R^d, \quad \mathrm{prox}_h (x) = \arg \min_{ u \in \mathbb R^d} \; f(u) + \frac12 \|x-u\|^2_2.\]


(By strong convexity, the arg min exists and is a singleton, so proxg is well defined.)

The following algorithm is workable in the setting described previously, that is when:

  1. f :\mathbb R^d \to R is convex and differentiable;
  2. g : \mathbb R^d \to (-\infty,+\infty] is convex with an easy-to-compute proximal operator.

    \[x_{k+1} = \mathrm{prox}_{\gamma_k g}(x_k - \gamma_k \nabla f(x_k),\]


where \gamma_k is a step size.

The interpretation of the proximal gradient method is very similar to the one of the gradient descent. It consists in minimizing a local quadratic approximation of f plus the original non-differentiable function g:

    \begin{align*}\forall x \in \mathbb R^d, \quad f(x)+g(x)&\approx f(x_k)+\nabla f(x_k)^{\top} (x-x_k) + \frac{1}{2\gamma_k} \|x-x_k\|^2_2+g(x)\\ &=g(x) + \frac{1}{2\gamma_k} \| x- \left(x_k - \gamma_k \nabla f(x_k)\right) \|^2_2 - \frac{\gamma_k}{2} \|\nabla f(x_k)\|^2_2,\end{align*}


where \gamma_k is unknown a priori.

5. Douglas-Rachford method

Here, we focus on the optimization problem:

    \[ \min_{x \in \mathbb R^d} f(x) + g(x),\]


where f :\mathbb R^d \to (-\infty,+\infty] and g :\mathbb R^d \to (-\infty,+\infty] are two convex functions. Contrarily to the proximal gradient method, the Douglas-Rachford does not require f to be differentiable. Starting from an initial point y_0 \in \mathbb R^d, the Douglas-Rachford algorithm reads:

    \begin{align*}x_k &= \mathrm{prox}_{\gamma_k f}(y_k)\\y_{k+1} &= y_k + \mu_k \left\{ \mathrm{prox}_{\gamma_k g}(2x_k-y_k)-x_k \right\},\end{align*}


where \gamma_k>0 is a step size and \mu_k \in (0,2).


We can derive three special cases of the proximal gradient method:
– when g \equiv 0, the proximal gradient method is a gradient descent;
– when g \equiv \chi_C for a set C \subset \mathbb R^d, the proximal method is a projected gradient descent;
– when f \equiv 0, we get the proximal point method.

III. Primal-Dual methods

Let us consider the optimization problem:
minimize

    \begin{equation*} \min_{x \in \mathbb R^d} f(x) + g(Ax) ,\label{P10}\end{equation}


where f : \mathbb R^d \to (-\infty, +\infty] and g :\mathbb R^p \to (-\infty, +\infty] are proper convex and A \in \mathbb R^{p\times d}. This problem is equivalent to

    \begin{align*} \min_{x \in \mathbb R^d, \; y \in \mathbb R^p} \; &f(x)+g(y) \\& s.t. \; \;  Ax = y\end{align*}


The dual function to this last problem is:

    \begin{align*}\forall \nu \in \mathbb R^p, \quad G(\nu)&= \inf_{x \in \mathbb R^d, y \in \mathbb R^p} \left\{   f(x) + g(y) + \nu^{\top}Ax - \nu^{\top}y \right\}\\&=- \sup_{y \in\mathbb R^p} \left\{  \nu^{\top}y-g(y)\right\}- \sup_{x \in\mathbb R^d} \left\{ -\nu^{\top}Ax-f(x)\right\}\\&=-g^*(\nu)-f^*(-A^{\top}\nu).\end{align*}


Therefore, the dual problem of interest is:

(1)   \begin{align*} \max_{\nu \in \mathbb R^p} -g^*(\nu)-f^*(-A^{\top}\nu) \end{align*}

Theorem
Let f :\mathbb R^d \to (-\infty,+\infty], g : :\mathbb R^p \to (-\infty,+\infty] be proper convex functions and A \in \mathbb R^{p\times d}. Assume that either \mathrm{dom}(f)= \mathbb R^d or \mathrm{dom}(g)= \mathbb R^p and that \exists x \in \mathbb R^d \; : \;  Ax\in \mathrm{dom}(g).
If the optima are attained in Problem \eqref{P10} and Problem \eqref{P11}, then strong duality holds:

    \[ \min_{x \in \mathbb R^d} f(x) + g(Ax) =  \max_{\nu \in \mathbb R^p}- g^*(\nu)-f^*(-A^{\top}\nu).\]


Moreover, a primal-dual optimum is a solution to the saddle-point problem:

    \[ \min{x \in \mathbb R^d} \max_{ \nu \in \mathbb R^p} f(x)+\nu^{\top}Ax-g^*(\nu).\]

We have seen previously that the proximal gradient method, used for minimizing a composite objective function, reduces to:

  1. the gradient method when g \equiv 0
  2. the proximal point method when f \equiv 0.

In this section, we exploit these two simple algorithms with the dual problem \eqref{P11} to devise primal-dual methods.