Optimization#

This package contains a collection of routines that optimize

\[\begin{equation} \label{eq:min_prob} \tag{1} \underset{x}{\arg\min} \quad \datafid{x}{y} + \lambda \reg{x}, \end{equation}\]

where the first term \(\datafidname:\xset\times\yset \mapsto \mathbb{R}_{+}\) enforces data-fidelity, the second term \(\regname:\xset\mapsto \mathbb{R}_{+}\) acts as a regularization and \(\lambda > 0\) is a regularization parameter. More precisely, the data-fidelity term penalizes the discrepancy between the data \(y\) and the forward operator \(A\) applied to the variable \(x\), as

\[\datafid{x}{y} = \distance{A(x)}{y},\]

where \(\distance{\cdot}{\cdot}\) is a distance function, and where \(A:\xset\mapsto \yset\) is the forward operator (see deepinv.physics.Physics)

Note

The regularization term often (but not always) depends on a hyperparameter \(\sigma\) that can be either fixed or estimated. For example, if the regularization is implicitly defined by a denoiser, the hyperparameter is the noise level.

A typical example of optimization problem is the \(\ell_1\)-regularized least squares problem, where the data-fidelity term is the squared \(\ell_2\)-norm and the regularization term is the \(\ell_1\)-norm. In this case, a possible algorithm to solve the problem is the Proximal Gradient Descent (PGD) algorithm writing as

\[\qquad x_{k+1} = \operatorname{prox}_{\gamma \lambda \regname} \left( x_k - \gamma \nabla \datafidname(x_k, y) \right),\]

where \(\operatorname{prox}_{\lambda \regname}\) is the proximity operator of the regularization term, \(\gamma\) is the step size of the algorithm, and \(\nabla \datafidname\) is the gradient of the data-fidelity term.

The following example illustrates the implementation of the PGD algorithm with DeepInverse to solve the \(\ell_1\)-regularized least squares problem.

>>> import torch
>>> import deepinv as dinv
>>> from deepinv.optim import L2, TVPrior
>>>
>>> # Forward operator, here inpainting
>>> mask = torch.ones((1, 2, 2))
>>> mask[0, 0, 0] = 0
>>> physics = dinv.physics.Inpainting(tensor_size=mask.shape, mask=mask)
>>> # Generate data
>>> x = torch.ones((1, 1, 2, 2))
>>> y = physics(x)
>>> data_fidelity = L2()  # The data fidelity term
>>> prior = TVPrior()  # The prior term
>>> lambd = 0.1  # Regularization parameter
>>> # Compute the squared norm of the operator A
>>> norm_A2 = physics.compute_norm(y, tol=1e-4, verbose=False).item()
>>> stepsize = 1/norm_A2  # stepsize for the PGD algorithm
>>>
>>> # PGD algorithm
>>> max_iter = 20  # number of iterations
>>> x_k = torch.zeros_like(x)  # initial guess
>>>
>>> for it in range(max_iter):
...     u = x_k - stepsize*data_fidelity.grad(x_k, y, physics)  # Gradient step
...     x_k = prior.prox(u, gamma=lambd*stepsize)  # Proximal step
...     cost = data_fidelity(x_k, y, physics) + lambd*prior(x_k)  # Compute the cost
...
>>> print(cost < 1e-5)
tensor([True])
>>> print('Estimated solution: ', x_k.flatten())
Estimated solution:  tensor([1.0000, 1.0000, 1.0000, 1.0000])

Potentials#

The class deepinv.optim.Potential implements potential scalar functions \(h : \xset \to \mathbb{R}\) used to define an optimization problems. For example, both \(f\) and \(\regname\) are potentials. This class comes with methods for computing operators useful for optimization, such as its proximal operator \(\operatorname{prox}_{h}\), its gradient \(\nabla h\), its convex conjugate \(h^*\), etc.

The following classes inherit from deepinv.optim.Potential

Class

\(h(x)\)

Requires

deepinv.optim.Bregman

\(\phi(x)\) with \(\phi\) convex

None

deepinv.optim.Distance

\(d(x,y)\)

\(y\)

deepinv.optim.DataFidelity

\(d(A(x),y)\) where \(d\) is a distance.

\(y\) & operator \(A\)

deepinv.optim.Prior

\(g_{\sigma}(x)\)

optional denoising level \(\sigma\)

Bregman#

Bregman potentials are defined as \(\phi(x)\) where \(x\in\xset\) is a variable and where \(\phi\) is a convex scalar function, and are defined via the base class deepinv.optim.Bregman.

In addition to the methods inherited from deepinv.optim.Potential (gradient \(\nabla \phi\), conjugate \(\phi^*\) and its gradient \(\nabla \phi^*\)), this class provides the Bregman divergence \(D(x,y) = \phi(x) - \phi^*(y) - x^{\top} y\), and is well suited for performing Mirror Descent.

Table 7 Bregman potentials#

Class

Bregman potential \(\phi(x)\)

deepinv.optim.bregman.BregmanL2

\(\|x\|_2^2\)

deepinv.optim.bregman.BurgEntropy

\(- \sum_i \log x_i\)

deepinv.optim.bregman.NegEntropy

\(\sum_i x_i \log x_i\)

deepinv.optim.bregman.Bregman_ICNN

Convolutional Input Convex NN

Data Fidelity#

The base class deepinv.optim.DataFidelity implements data fidelity terms \(\distance{A(x)}{y}\) where \(A\) is the forward operator, \(x\in\xset\) is a variable and \(y\in\yset\) is the data, and where \(d\) is a distance function from the class deepinv.optim.Distance. The class deepinv.optim.Distance is implemented as a child class from deepinv.optim.Potential.

This data-fidelity class thus comes with useful methods, such as \(\operatorname{prox}_{\distancename\circ A}\) and \(\nabla (\distancename \circ A)\) (among others) which are used by most optimization algorithms.

Table 8 Data Fidelity Overview#

Data Fidelity

\(d(A(x), y)\)

deepinv.optim.L1

\(\|A(x) - y\|_1\)

deepinv.optim.L2

\(\|A(x) - y\|_2^2\)

deepinv.optim.IndicatorL2

Indicator function of \(\|A(x) - y\|_2 \leq \epsilon\)

deepinv.optim.PoissonLikelihood

\(\datafid{A(x)}{y} = -y^{\top} \log(A(x)+\beta)+1^{\top}A(x)\)

deepinv.optim.LogPoissonLikelihood

\(N_0 (1^{\top} \exp(-\mu A(x))+ \mu \exp(-\mu y)^{\top}A(x))\)

deepinv.optim.AmplitudeLoss

\(\sum_{i=1}^{m}{(\sqrt{|b_i^{\top} x|^2}-\sqrt{y_i})^2}\)

Priors#

Prior functions are defined as \(\reg{x}\) where \(x\in\xset\) is a variable and where \(\regname\) is a function.

The base class is deepinv.optim.Prior implemented as a child class from deepinv.optim.Potential and therefore it comes with methods for computing operators such as \(\operatorname{prox}_{\regname}\) and \(\nabla \regname\). This base class is used to implement user-defined differentiable priors (eg. Tikhonov regularization) but also implicit priors (eg. plug-and-play methods).

Table 9 Priors Overview#

Prior

\(\reg{x}\)

Explicit \(\regname\)

deepinv.optim.PnP

\(\operatorname{prox}_{\gamma \regname}(x) = \operatorname{D}_{\sigma}(x)\)

No

deepinv.optim.RED

\(\nabla \reg{x} = x - \operatorname{D}_{\sigma}(x)\)

No

deepinv.optim.ScorePrior

\(\nabla \reg{x}=\left(x-\operatorname{D}_{\sigma}(x)\right)/\sigma^2\)

No

deepinv.optim.Tikhonov

\(\reg{x}=\|x\|_2^2\)

Yes

deepinv.optim.L1Prior

\(\reg{x}=\|x\|_1\)

Yes

deepinv.optim.WaveletPrior

\(\reg{x} = \|\Psi x\|_{p}\) where \(\Psi\) is a wavelet transform

Yes

deepinv.optim.TVPrior

\(\reg{x}=\|Dx\|_{1,2}\) where \(D\) is a finite difference operator

Yes

deepinv.optim.PatchPrior

\(\reg{x} = \sum_i h(P_i x)\) for some prior \(h(x)\) on the space of patches

Yes

deepinv.optim.PatchNR

Patch prior via normalizing flows.

Yes

deepinv.optim.L12Prior

\(\reg{x} = \sum_i\| x_i \|_2\)

Yes

Predefined Algorithms#

Optimization algorithm inherit from the base class deepinv.optim.BaseOptim, which serves as a common interface for all predefined optimization algorithms.

The function deepinv.optim.optim_builder() returns an instance of deepinv.optim.BaseOptim with the optimization algorithm of choice, either a predefined one ("PGD", "ADMM", "HQS", etc.), or with a user-defined one. For example, we can create the same proximal gradient algorithm as the one at the beginning of this page, in one line of code:

>>> model = dinv.optim.optim_builder(iteration="PGD", prior=prior, data_fidelity=data_fidelity,
...                             params_algo={"stepsize": stepsize, "lambda": lambd}, max_iter=max_iter)
>>> x_hat = model(y, physics)
>>> dinv.utils.plot([x, y, x_hat], ["signal", "measurement", "estimate"], rescale_mode='clip')

Some predefined optimizers are provided:

Algorithm

Iteration

Parameters

Gradient Descent (GD)

\(v_{k} = \nabla f(x_k) + \lambda \nabla \reg{x_k}\)
\(x_{k+1} = x_k-\gamma v_{k}\)

"stepsize", "lambda", "g_param"

Proximal Gradient Descent (PGD)

\(u_{k} = x_k - \gamma \nabla f(x_k)\)
\(x_{k+1} = \operatorname{prox}_{\gamma \lambda \regname}(u_k)\)

"stepsize", "lambda", "g_param"

Fast Iterative Shrinkage-Thresholding Algorithm (FISTA)

\(u_{k} = z_k - \gamma \nabla f(z_k)\)
\(x_{k+1} = \operatorname{prox}_{\gamma \lambda \regname}(u_k)\)
\(z_{k+1} = x_{k+1} + \alpha_k (x_{k+1} - x_k)\)

"stepsize", "lambda", "g_param"

Half-Quadratic Splitting (HQS)

\(u_{k} = \operatorname{prox}_{\gamma f}(x_k)\)
\(x_{k+1} = \operatorname{prox}_{\sigma \lambda \regname}(u_k)\)

"gamma", "lambda", "g_param"

Alternating Direction Method of Multipliers (ADMM)

\(u_{k+1} = \operatorname{prox}_{\gamma f}(x_k - z_k)\)
\(x_{k+1} = \operatorname{prox}_{\gamma \lambda \regname}(u_{k+1} + z_k)\)
\(z_{k+1} = z_k + \beta (u_{k+1} - x_{k+1})\)

"gamma", "lambda", "g_param", "beta"

Douglas-Rachford Splitting (DRS)

\(u_{k+1} = \operatorname{prox}_{\gamma f}(z_k)\)
\(x_{k+1} = \operatorname{prox}_{\gamma \lambda \regname}(2*u_{k+1}-z_k)\)
\(z_{k+1} = z_k + \beta (x_{k+1} - u_{k+1})\)

"stepsize", "lambda", "g_param", "beta"

Chambolle-Pock (CP)

\(u_{k+1} = \operatorname{prox}_{\sigma F^*}(u_k + \sigma K z_k)\)
\(x_{k+1} = \operatorname{prox}_{\tau \lambda G}(x_k-\tau K^\top u_{k+1})\)
\(z_{k+1} = x_{k+1} + \beta(x_{k+1}-x_k)\)

"gamma", "lambda", "g_param", "beta", "stepsize_dual"

Mirror Descent (MD)

\(v_{k} = \nabla f(x_k) + \lambda \nabla \reg{x_k}\)
\(x_{k+1} = \nabla h^*(\nabla h(x_k) - \gamma v_{k})\)

"stepsize", "lambda", "g_param", convex potential h

Spectral Methods (SM)

\(M = \conj{B} \text{diag}(T(y)) B + \lambda I\)

(phase-retrieval only)

Parameters#

The parameters of generic optimization algorithms, such as stepsize, regularization parameter, standard deviation of denoiser prior, etc. are stored in a dictionary "params_algo", whose typical entries are:

Key

Meaning

Recommended Values

"stepsize"

Step size of the optimization algorithm.

Should be positive. Depending on the algorithm,
needs to be small enough for convergence;
e.g. for PGD with g_first=False,
should be smaller than \(1/(\|A\|_2^2)\).

"lambda"

Regularization parameter \(\lambda\)
multiplying the regularization term.

Should be positive.

"g_param"

Optional parameter \(\sigma\) which \(\regname\) depends on.
For priors based on denoisers,
corresponds to the noise level.

Should be positive.

"beta"

Relaxation parameter used in
ADMM, DRS, CP.

Should be positive.

"stepsize_dual"

Step size in the dual update in the
Primal Dual algorithm (only required by CP).

Should be positive.

Each value of the dictionary can be either an iterable (i.e., a list with a distinct value for each iteration) or a single float (same value for each iteration).

Iterators#

An optim iterator is an object that implements a fixed point iteration for minimizing the sum of two functions \(F = \datafidname + \lambda \regname\) where \(\datafidname\) is a data-fidelity term that will be modeled by an instance of physics and \(\regname\) is a regularizer. The fixed point iteration takes the form

\[\qquad (x_{k+1}, z_{k+1}) = \operatorname{FixedPoint}(x_k, z_k, \datafidname, \regname, A, y, ...),\]

where \(x\) is a variable converging to the solution of the minimization problem, and \(z\) is an additional variable that may be required in the computation of the fixed point operator.

The implementation of the fixed point algorithm in deepinv.optim, following standard optimization theory, is split in two steps:

\[\begin{split}z_{k+1} = \operatorname{step}_{\datafidname}(x_k, z_k, y, A, ...)\\ x_{k+1} = \operatorname{step}_{\regname}(x_k, z_k, y, A, ...)\end{split}\]

where \(\operatorname{step}_{\datafidname}\) and \(\operatorname{step}_{\regname}\) are gradient and/or proximal steps on \(\datafidname\) and \(\regname\), while using additional inputs, such as \(A\) and \(y\), but also stepsizes, relaxation parameters, etc…

The deepinv.optim.optim_iterators.fStep and deepinv.optim.optim_iterators.gStep classes precisely implement these steps.

Utils#

We provide some useful routines for optimization algorithms.