Note

Go to the end to download the full example code.

Implementing DPS#

In this tutorial, we will go over the steps in the Diffusion Posterior Sampling (DPS) algorithm introduced in Chung et al.[1]. The full algorithm is implemented in deepinv.sampling.DPS.

Installing dependencies#

Let us import the relevant packages, and load a sample image of size 64 x 64. This will be used as our ground truth image.

Note

We work with an image of size 64 x 64 to reduce the computational time of this example. The DiffUNet we use in the algorithm works best with images of size 256 x 256.

import torch

import deepinv as dinv
from deepinv.utils.plotting import plot
from deepinv.optim.data_fidelity import L2
from deepinv.utils.demo import load_example
from tqdm import tqdm  # to visualize progress

device = dinv.utils.get_freer_gpu() if torch.cuda.is_available() else "cpu"

x_true = load_example("butterfly.png", img_size=64).to(device)
x = x_true.clone()

In this tutorial we consider random inpainting as the inverse problem, where the forward operator is implemented in deepinv.physics.Inpainting. In the example that we use, 90% of the pixels will be masked out randomly, and we will additionally have Additive White Gaussian Noise (AWGN) of standard deviation 12.75/255.

sigma = 12.75 / 255.0  # noise level

physics = dinv.physics.Inpainting(
    img_size=(3, x.shape[-2], x.shape[-1]),
    mask=0.1,
    pixelwise=True,
    device=device,
)

y = physics(x_true)

imgs = [y, x_true]
plot(
    imgs,
    titles=["measurement", "groundtruth"],
)

Diffusion model loading#

We will take a pre-trained diffusion model that was also used for the DiffPIR algorithm, namely the one trained on the FFHQ 256x256 dataset. Note that this means that the diffusion model was trained with human face images, which is very different from the image that we consider in our example. Nevertheless, we will see later on that DPS generalizes sufficiently well even in such case.

model = dinv.models.DiffUNet(large_model=False).to(device)

Downloading: "https://huggingface.co/deepinv/diffunet/resolve/main/diffusion_ffhq_10m.pt?download=true" to /home/runner/.cache/torch/hub/checkpoints/diffusion_ffhq_10m.pt

  0%|          | 0.00/357M [00:00<?, ?B/s]
  6%|▌         | 20.8M/357M [00:00<00:01, 207MB/s]
 13%|█▎        | 46.2M/357M [00:00<00:01, 241MB/s]
 22%|██▏       | 79.6M/357M [00:00<00:01, 290MB/s]
 31%|███       | 109M/357M [00:00<00:00, 298MB/s]
 40%|███▉      | 142M/357M [00:00<00:00, 313MB/s]
 48%|████▊     | 172M/357M [00:00<00:00, 287MB/s]
 58%|█████▊    | 206M/357M [00:00<00:00, 308MB/s]
 68%|██████▊   | 242M/357M [00:00<00:00, 329MB/s]
 77%|███████▋  | 273M/357M [00:00<00:00, 326MB/s]
 86%|████████▌ | 306M/357M [00:01<00:00, 332MB/s]
 95%|█████████▌| 340M/357M [00:01<00:00, 328MB/s]
100%|██████████| 357M/357M [00:01<00:00, 314MB/s]

Define diffusion schedule#

We will use the standard linear diffusion noise schedule. Once \(\beta_t\) is defined to follow a linear schedule that interpolates between \(\beta_{\rm min}\) and \(\beta_{\rm max}\), we have the following additional definitions: \(\alpha_t := 1 - \beta_t\), \(\bar\alpha_t := \prod_{j=1}^t \alpha_j\). The following equations will also be useful later on (we always assume that \(\mathbf{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\) hereafter.)

\[ \begin{align}\begin{aligned}\mathbf{x}_t = \sqrt{1 - \beta_t}\mathbf{x}_{t-1} + \sqrt{\beta_t}\mathbf{\epsilon}\\\mathbf{x}_t = \sqrt{\bar\alpha_t}\mathbf{x}_0 + \sqrt{1 - \bar\alpha_t}\mathbf{\epsilon}\end{aligned}\end{align} \]

where we use the reparametrization trick.

num_train_timesteps = 1000  # Number of timesteps used during training


betas = torch.linspace(1e-4, 2e-2, num_train_timesteps).to(device)
alphas = (1 - betas).cumprod(dim=0)

The DPS algorithm#

Now that the inverse problem is defined, we can apply the DPS algorithm to solve it. The DPS algorithm is a diffusion algorithm that alternates between a denoising step, a gradient step and a reverse diffusion sampling step. The algorithm writes as follows, for \(t\) decreasing from \(T\) to \(1\):

\[\begin{split}\begin{equation*} \begin{aligned} \widehat{\mathbf{x}}_{0} (\mathbf{x}_t) &= \denoiser{\mathbf{x}_t}{\sqrt{1-\overline{\alpha}_t}/\sqrt{\overline{\alpha}_t}} \\ \mathbf{g}_t &= \nabla_{\mathbf{x}_t} \log p( \widehat{\mathbf{x}}_{0}(\mathbf{x}_t) | \mathbf{y} ) \\ \mathbf{\varepsilon}_t &= \mathcal{N}(0, \mathbf{I}) \\ \mathbf{x}_{t-1} &= a_t \,\, \mathbf{x}_t + b_t \, \, \widehat{\mathbf{x}}_0 + \tilde{\sigma}_t \, \, \mathbf{\varepsilon}_t + \mathbf{g}_t, \end{aligned} \end{equation*}\end{split}\]

where \(\denoiser{\cdot}{\sigma}\) is a denoising network for noise level \(\sigma\), \(\eta\) is a hyperparameter in [0, 1], and the constants \(\tilde{\sigma}_t, a_t, b_t\) are defined as

\[\begin{split}\begin{equation*} \begin{aligned} \tilde{\sigma}_t &= \eta \sqrt{ (1 - \frac{\overline{\alpha}_t}{\overline{\alpha}_{t-1}}) \frac{1 - \overline{\alpha}_{t-1}}{1 - \overline{\alpha}_t}} \\ a_t &= \sqrt{1 - \overline{\alpha}_{t-1} - \tilde{\sigma}_t^2}/\sqrt{1-\overline{\alpha}_t} \\ b_t &= \sqrt{\overline{\alpha}_{t-1}} - \sqrt{1 - \overline{\alpha}_{t-1} - \tilde{\sigma}_t^2} \frac{\sqrt{\overline{\alpha}_{t}}}{\sqrt{1 - \overline{\alpha}_{t}}} \end{aligned} \end{equation*}\end{split}\]

Denoising step#

The first step of DPS consists of applying a denoiser function to the current image \(\mathbf{x}_t\), with standard deviation \(\sigma_t = \sqrt{1 - \overline{\alpha}_t}/\sqrt{\overline{\alpha}_t}\).

This is equivalent to sampling \(\mathbf{x}_t \sim q(\mathbf{x}_t|\mathbf{x}_0)\), and then computing the posterior mean.

t = 200  # choose some arbitrary timestep
at = alphas[t]
sigmat = (1 - at).sqrt() / at.sqrt()

x0 = x_true
xt = x0 + sigmat * torch.randn_like(x0)

# apply denoiser
x0_t = model(xt, sigmat)

# Visualize
imgs = [x0, xt, x0_t]
plot(
    imgs,
    titles=["ground-truth", "noisy", "posterior mean"],
)

DPS approximation#

In order to perform gradient-based posterior sampling with diffusion models, we have to be able to compute \(\nabla_{\mathbf{x}_t} \log p(\mathbf{x}_t|\mathbf{y})\). Applying Bayes rule, we have

\[\nabla_{\mathbf{x}_t} \log p(\mathbf{x}_t|\mathbf{y}) = \nabla_{\mathbf{x}_t} \log p(\mathbf{x}_t) + \nabla_{\mathbf{x}_t} \log p(\mathbf{y}|\mathbf{x}_t)\]

For the former term, we can simply plug-in our estimated score function as in Tweedie’s formula. As the latter term is intractable, DPS proposes the following approximation (for details, see Theorem 1 of Chung et al.[1])

\[\nabla_{\mathbf{x}_t} \log p(\mathbf{x}_t|\mathbf{y}) \approx \nabla_{\mathbf{x}_t} \log p(\mathbf{x}_t) + \nabla_{\mathbf{x}_t} \log p(\mathbf{y}|\widehat{\mathbf{x}}_{0}(\mathbf{x_t}))\]

Remarkably, we can now compute the latter term when we have Gaussian noise, as

\[\log p(\mathbf{y}|\widehat{\mathbf{x}}_0(\mathbf{x_t})) = -\frac{\|\mathbf{y} - A\widehat{\mathbf{x}}_0((\mathbf{x_t})\|_2^2}{2\sigma_y^2}.\]

Moreover, taking the gradient w.r.t. \(\mathbf{x}_t\) can be performed through automatic differentiation. Let’s see how this can be done in PyTorch. Note that when we are taking the gradient w.r.t. a tensor, we first have to enable the gradient computation by tensor.requires_grad_()

Note

The DPS algorithm assumes that the images are in the range [-1, 1], whereas standard denoisers usually output images in the range [0, 1]. This is why we rescale the images before applying the steps.

x0 = x_true * 2.0 - 1.0  # [0, 1] -> [-1, 1]

data_fidelity = L2()

# xt ~ q(xt|x0)
t = 200  # choose some arbitrary timestep
at = alphas[t]
sigma_cur = (1 - at).sqrt() / at.sqrt()
xt = x0 + sigma_cur * torch.randn_like(x0)

# DPS
with torch.enable_grad():
    # Turn on gradient
    xt.requires_grad_()

    # normalize to [0, 1], denoise, and rescale to [-1, 1]
    x0_t = model(xt / 2 + 0.5, sigma_cur / 2) * 2 - 1
    # Log-likelihood
    ll = data_fidelity(x0_t, y, physics).sqrt().sum()
    # Take gradient w.r.t. xt
    grad_ll = torch.autograd.grad(outputs=ll, inputs=xt)[0]

# Visualize
imgs = [x0, xt, x0_t, grad_ll]
plot(
    imgs,
    titles=["groundtruth", "noisy", "posterior mean", "gradient"],
)

groundtruth, noisy, posterior mean, gradient

DPS Algorithm#

As we visited all the key components of DPS, we are now ready to define the algorithm. For every denoising timestep, the algorithm iterates the following

Get \(\hat{\mathbf{x}}\) using the denoiser network.
Compute \(\nabla_{\mathbf{x}_t} \log p(\mathbf{y}|\hat{\mathbf{x}}_t)\) through backpropagation.
Perform reverse diffusion sampling with DDPM(IM), corresponding to an update with \(\nabla_{\mathbf{x}_t} \log p(\mathbf{x}_t)\).
Take a gradient step with \(\nabla_{\mathbf{x}_t} \log p(\mathbf{y}|\hat{\mathbf{x}}_t)\).

There are two caveats here. First, in the original work, DPS used DDPM ancestral sampling. As the DDIM sampler Song et al.[2] is a generalization of DDPM in a sense that it retrieves DDPM when \(\eta = 1.0\), here we consider DDIM sampling. One can freely choose the \(\eta\) parameter here, but since we will consider 1000 neural function evaluations (NFEs), it is advisable to keep it \(\eta = 1.0\). Second, when taking the log-likelihood gradient step, the gradient is weighted so that the actual implementation is a static step size times the \(\ell_2\) norm of the residual:

\[\nabla_{\mathbf{x}_t} \log p(\mathbf{y}|\hat{\mathbf{x}}_{t}(\mathbf{x}_t)) \simeq \rho \nabla_{\mathbf{x}_t} \|\mathbf{y} - \mathbf{A}\hat{\mathbf{x}}_{t}\|_2\]

With these in mind, let us solve the inverse problem with DPS!

Note

We only use 200 steps to reduce the computational time of this example. As suggested by the authors of DPS, the algorithm works best with num_steps = 1000.

num_steps = 200

skip = num_train_timesteps // num_steps

batch_size = 1
eta = 1.0  # DDPM scheme; use eta < 1 for DDIM


# measurement
x0 = x_true * 2.0 - 1.0
# x0 = x_true.clone()
y = physics(x0.to(device))

# initial sample from x_T
x = torch.randn_like(x0)

xs = [x]
x0_preds = []

for t in tqdm(reversed(range(0, num_train_timesteps, skip))):
    at = alphas[t]
    at_next = alphas[t - skip] if t - skip >= 0 else torch.tensor(1)
    # we cannot use bt = betas[t] if skip > 1:
    bt = 1 - at / at_next

    xt = xs[-1].to(device)

    with torch.enable_grad():
        xt.requires_grad_()

        # 1. denoising step
        aux_x = xt / (2 * at.sqrt()) + 0.5  # renormalize in [0, 1]
        sigma_cur = (1 - at).sqrt() / at.sqrt()  # sigma_t

        x0_t = 2 * model(aux_x, sigma_cur / 2) - 1
        x0_t = torch.clip(x0_t, -1.0, 1.0)  # optional

        # 2. likelihood gradient approximation
        l2_loss = data_fidelity(x0_t, y, physics).sqrt().sum()

    norm_grad = torch.autograd.grad(outputs=l2_loss, inputs=xt)[0]
    norm_grad = norm_grad.detach()

    sigma_tilde = (bt * (1 - at_next) / (1 - at)).sqrt() * eta
    c2 = ((1 - at_next) - sigma_tilde**2).sqrt()

    # 3. noise step
    epsilon = torch.randn_like(xt)

    # 4. DDIM(PM) step
    xt_next = (
        (at_next.sqrt() - c2 * at.sqrt() / (1 - at).sqrt()) * x0_t
        + sigma_tilde * epsilon
        + c2 * xt / (1 - at).sqrt()
        - norm_grad
    )
    x0_preds.append(x0_t.to("cpu"))
    xs.append(xt_next.to("cpu"))

recon = xs[-1]

# plot the results
x = recon / 2 + 0.5
imgs = [y, x, x_true]
plot(imgs, titles=["measurement", "model output", "groundtruth"])

0it [00:00, ?it/s]
1it [00:00,  2.31it/s]
2it [00:00,  2.32it/s]
3it [00:01,  2.33it/s]
4it [00:01,  2.33it/s]
5it [00:02,  2.34it/s]
6it [00:02,  2.34it/s]
7it [00:03,  2.34it/s]
8it [00:03,  2.34it/s]
9it [00:03,  2.34it/s]
10it [00:04,  2.33it/s]
11it [00:04,  2.33it/s]
12it [00:05,  2.33it/s]
13it [00:05,  2.33it/s]
14it [00:06,  2.32it/s]
15it [00:06,  2.33it/s]
16it [00:06,  2.32it/s]
17it [00:07,  2.32it/s]
18it [00:07,  2.33it/s]
19it [00:08,  2.33it/s]
20it [00:08,  2.33it/s]
21it [00:09,  2.33it/s]
22it [00:09,  2.31it/s]
23it [00:09,  2.32it/s]
24it [00:10,  2.31it/s]
25it [00:10,  2.31it/s]
26it [00:11,  2.32it/s]
27it [00:11,  2.32it/s]
28it [00:12,  2.32it/s]
29it [00:12,  2.32it/s]
30it [00:12,  2.32it/s]
31it [00:13,  2.32it/s]
32it [00:13,  2.32it/s]
33it [00:14,  2.33it/s]
34it [00:14,  2.33it/s]
35it [00:15,  2.33it/s]
36it [00:15,  2.33it/s]
37it [00:15,  2.32it/s]
38it [00:16,  2.32it/s]
39it [00:16,  2.32it/s]
40it [00:17,  2.32it/s]
41it [00:17,  2.32it/s]
42it [00:18,  2.32it/s]
43it [00:18,  2.32it/s]
44it [00:18,  2.32it/s]
45it [00:19,  2.32it/s]
46it [00:19,  2.33it/s]
47it [00:20,  2.33it/s]
48it [00:20,  2.33it/s]
49it [00:21,  2.33it/s]
50it [00:21,  2.33it/s]
51it [00:21,  2.33it/s]
52it [00:22,  2.33it/s]
53it [00:22,  2.33it/s]
54it [00:23,  2.33it/s]
55it [00:23,  2.33it/s]
56it [00:24,  2.33it/s]
57it [00:24,  2.33it/s]
58it [00:24,  2.33it/s]
59it [00:25,  2.32it/s]
60it [00:25,  2.33it/s]
61it [00:26,  2.33it/s]
62it [00:26,  2.33it/s]
63it [00:27,  2.32it/s]
64it [00:27,  2.32it/s]
65it [00:27,  2.32it/s]
66it [00:28,  2.32it/s]
67it [00:28,  2.33it/s]
68it [00:29,  2.32it/s]
69it [00:29,  2.32it/s]
70it [00:30,  2.33it/s]
71it [00:30,  2.33it/s]
72it [00:30,  2.33it/s]
73it [00:31,  2.32it/s]
74it [00:31,  2.32it/s]
75it [00:32,  2.32it/s]
76it [00:32,  2.32it/s]
77it [00:33,  2.32it/s]
78it [00:33,  2.32it/s]
79it [00:33,  2.32it/s]
80it [00:34,  2.32it/s]
81it [00:34,  2.32it/s]
82it [00:35,  2.30it/s]
83it [00:35,  2.30it/s]
84it [00:36,  2.31it/s]
85it [00:36,  2.32it/s]
86it [00:37,  2.32it/s]
87it [00:37,  2.32it/s]
88it [00:37,  2.32it/s]
89it [00:38,  2.32it/s]
90it [00:38,  2.33it/s]
91it [00:39,  2.32it/s]
92it [00:39,  2.30it/s]
93it [00:40,  2.30it/s]
94it [00:40,  2.31it/s]
95it [00:40,  2.31it/s]
96it [00:41,  2.31it/s]
97it [00:41,  2.31it/s]
98it [00:42,  2.31it/s]
99it [00:42,  2.32it/s]
100it [00:43,  2.32it/s]
101it [00:43,  2.32it/s]
102it [00:43,  2.32it/s]
103it [00:44,  2.32it/s]
104it [00:44,  2.32it/s]
105it [00:45,  2.32it/s]
106it [00:45,  2.32it/s]
107it [00:46,  2.32it/s]
108it [00:46,  2.32it/s]
109it [00:46,  2.32it/s]
110it [00:47,  2.32it/s]
111it [00:47,  2.32it/s]
112it [00:48,  2.32it/s]
113it [00:48,  2.32it/s]
114it [00:49,  2.32it/s]
115it [00:49,  2.32it/s]
116it [00:49,  2.32it/s]
117it [00:50,  2.32it/s]
118it [00:50,  2.32it/s]
119it [00:51,  2.32it/s]
120it [00:51,  2.32it/s]
121it [00:52,  2.32it/s]
122it [00:52,  2.32it/s]
123it [00:52,  2.32it/s]
124it [00:53,  2.32it/s]
125it [00:53,  2.32it/s]
126it [00:54,  2.32it/s]
127it [00:54,  2.32it/s]
128it [00:55,  2.32it/s]
129it [00:55,  2.32it/s]
130it [00:55,  2.32it/s]
131it [00:56,  2.32it/s]
132it [00:56,  2.32it/s]
133it [00:57,  2.32it/s]
134it [00:57,  2.32it/s]
135it [00:58,  2.32it/s]
136it [00:58,  2.32it/s]
137it [00:58,  2.32it/s]
138it [00:59,  2.32it/s]
139it [00:59,  2.32it/s]
140it [01:00,  2.33it/s]
141it [01:00,  2.32it/s]
142it [01:01,  2.32it/s]
143it [01:01,  2.32it/s]
144it [01:02,  2.32it/s]
145it [01:02,  2.32it/s]
146it [01:02,  2.33it/s]
147it [01:03,  2.32it/s]
148it [01:03,  2.32it/s]
149it [01:04,  2.32it/s]
150it [01:04,  2.32it/s]
151it [01:05,  2.32it/s]
152it [01:05,  2.32it/s]
153it [01:05,  2.32it/s]
154it [01:06,  2.32it/s]
155it [01:06,  2.32it/s]
156it [01:07,  2.32it/s]
157it [01:07,  2.33it/s]
158it [01:08,  2.32it/s]
159it [01:08,  2.32it/s]
160it [01:08,  2.33it/s]
161it [01:09,  2.32it/s]
162it [01:09,  2.33it/s]
163it [01:10,  2.33it/s]
164it [01:10,  2.32it/s]
165it [01:11,  2.33it/s]
166it [01:11,  2.32it/s]
167it [01:11,  2.32it/s]
168it [01:12,  2.32it/s]
169it [01:12,  2.33it/s]
170it [01:13,  2.33it/s]
171it [01:13,  2.33it/s]
172it [01:14,  2.33it/s]
173it [01:14,  2.32it/s]
174it [01:14,  2.32it/s]
175it [01:15,  2.32it/s]
176it [01:15,  2.32it/s]
177it [01:16,  2.32it/s]
178it [01:16,  2.32it/s]
179it [01:17,  2.32it/s]
180it [01:17,  2.33it/s]
181it [01:17,  2.32it/s]
182it [01:18,  2.32it/s]
183it [01:18,  2.32it/s]
184it [01:19,  2.32it/s]
185it [01:19,  2.32it/s]
186it [01:20,  2.31it/s]
187it [01:20,  2.31it/s]
188it [01:20,  2.31it/s]
189it [01:21,  2.31it/s]
190it [01:21,  2.31it/s]
191it [01:22,  2.31it/s]
192it [01:22,  2.31it/s]
193it [01:23,  2.31it/s]
194it [01:23,  2.31it/s]
195it [01:23,  2.31it/s]
196it [01:24,  2.31it/s]
197it [01:24,  2.31it/s]
198it [01:25,  2.31it/s]
199it [01:25,  2.31it/s]
200it [01:26,  2.31it/s]
200it [01:26,  2.32it/s]

Using DPS in your inverse problem#

You can readily use this algorithm via the deepinv.sampling.DPS class.

y = physics(x)
model = dinv.sampling.DPS(dinv.models.DiffUNet(), data_fidelity=dinv.optim.data_fidelity.L2())
xhat = model(y, physics)

References:

Total running time of the script: (1 minutes 29.362 seconds)

Gallery generated by Sphinx-Gallery

Implementing DPS#

Installing dependencies#

Diffusion model loading#

Define diffusion schedule#

The DPS algorithm#

Denoising step#

DPS approximation#

DPS Algorithm#

Using DPS in your inverse problem#

This Page