A derivation of Fokker Planck equation

2024-05

We derive the Fokker-Planck equation (forward equation) and the backward equation. FPE is central in diffusion models to guarantee the correctness of image generation algorithm. Despite its extensive use in diffusion literature, the FPE background is not often presented in details.

1. Feller Generator / Semigroup of Diffusion
2. Kolmogorov Forward Backward Operator Duality
3. Density estimation
References

This writeup derives the operator for Kolmogorov Forward the Backward Equations. The forward equation is known as the Fokker-Planck equation (FPE). It is introduced to the diffusion model literature by Song et al., 2021 to show that the distribution induced by the learned image denoiser exactly matches the distribution by adding noise to data. FPE provides a guarantee that solving the backward diffusion ODE / SDE would lead us back to the image manifold. The role of Fokker-Planck equation is discussed extensively in diffusion literature e.g. EDM (Karras et al., 2022). But the derivation of FPE itself is seldom presented. This article makes an attempt hopefully without too many gaps.

1. Feller Generator / Semigroup of Diffusion

This material in this section is rehashed mostly from the notes by Prof. Lawler, and his 235 lectures (chapter 3) discusses the infinitesimal generator for discrete markov chains.

Let

u_t(x) = \int f(y) \, p_t(y|x) \, dy

, where

f

is some benign test function. In plain words,

u

is the expected value of

f

after time

t

rollout starting at

x

. We want to find operator

\ol

for the PDE

\begin{align} \partial_t u_t(x) = \ol_x u_t(x).\end{align}

Since

u

is an integral over dummy

y

, we can move both of the the operators

\partial_t

and

\ol

inside the integral, and obtain a PDE on the transition kernel

p_t(y|x)

\begin{align} \int f(y) \, \partial_t p_t(y|x) \, dy = \int f(y) \, \ol_x p_t(y|x) \, dy \quad \Longrightarrow \quad \partial_t p_t(y|x) = \ol_x p_t(y|x).\end{align}

The time derivative is defined as

\begin{align} \partial_t u_t(x) = \lim_{s \rightarrow 0} \frac{u_{t+s}(x) - u_t(x)}{s},\end{align}

and we attempt to write

u(x, t+s)

as a function of

u_t(x)

\begin{align} u_{t+s}(x) &= \int f(z) \, p_{t + s}(z|x) \, dz \nonumber \\ &= \int_z f(z) \left( \int_y p_{\ctext{t}}(z|y) \, p_{\ctext{s}}(y|x) dy \right) dz \\ &= \int_y p_{s}(y|x) \left( \int_z f(z) \, p_{t}(z|y) dz \right) dy \nonumber \\ &= \int_y u_t(y) \cdot p_{s}(y|x) \, dy .\end{align}

Note that when

t + s

is split,

t

goes to the 2nd segment

y \rightarrow z

and

s

is assigned to 1st segment

x \rightarrow y

. Otherwise we won't be able to form an expression for

u(\cdot, t)

cuz

f

is integrated over

z

. This gives these steps a "backward-going" flavor i.e. we have

u(y, x)

and then extend backward by

s

to form

u_{t+s}(x)

We write

g(x) = u_t(x)

as a new test function.

\begin{align} \partial_t u_t(x) = \lim_{s \rightarrow 0} \frac{ \int_y g(y) \cdot p_{s}(y|x) dy - g(x)}{s} = \lim_{s \rightarrow 0} \frac{ \E [g(x_s)|x_0=x] - g(x) }{s} .\end{align}

This is the form of the Feller generator. The expression is rather incidental. To solve

\partial_t u_t(x) = \lim_{s \rightarrow 0} \frac{ \E [g(x_s) - g(x)|x_0=x]}{s}

, we expand using Ito's

\begin{align} g(x_s) - g(x_0) = \int_0^s \left[ \mu(x_h, h) \, g'(x_h) + \frac{\sigma^2(x_h, h)}{2}g''(x_h) \right] dh + \int_0^s g'(x_h) \, \sigma(x_h, h) dB_h.\end{align}

The 2nd term is a martingale and its expectation is

0

. Thus

\begin{align} \partial_t u_t(x) &= \lim_{s \rightarrow 0} \frac{ \E \left[ \int_0^s [ \mu(x_h, h) \, g'(x_h) + \frac{\sigma^2(x_h, h)}{2}g''(x_h) ] dh \right] }{s} \nonumber \\ &= \E \left[ \lim_{s \rightarrow 0} \frac{\int_0^s [ \mu(x_h, h) \, g'(x_h) + \frac{\sigma^2(x_h, h)}{2}g''(x_h) ] dh - 0 }{s} \right] \\ &= \E \left[ \mu(x_0, 0) \, g'(x_0) + \frac{\sigma^2(x_0, 0)}{2}g''(x_0) \right] \nonumber \\ &= \mu(x, 0) \, \partial_x u_t(x) + \frac{\sigma^2(x, 0)}{2} \partial_{xx} u_t(x) \quad \text{using } x_0 = x\end{align}

Not right. Left

t

right

0

. Some logic is missing. For an alternative proof, directly apply Ito v2 on

u_t(x)

, and since

u_t(x)

is a conditional expectation, the terms on

dt

is 0. This way you get the PDE. This is much quicker.

2. Kolmogorov Forward Backward Operator Duality

The material in this section is based on the notes by Miranda Holmes-Cerfon, with influence from the notes by Jonathan Goodman. I remembered reading Goodman's claim that operator duality is the only legitimate way to derive the FPE / forward equation. I cannot find that webpage now.

To relate forward and backward PDEs, we consider a test function

f

integrated over the end-of-time marginal distribution

p_T

. We further break

p_T

into integrals of transition kernels

p_{T-t}(\cdot | \cdot)

and

p_t({\cdot | \cdot})

\begin{align} \int_z f(z) \, p_T(z) \, dz &= \int_z f(z) \left( \int_y \, p_T(z|y) \, p(y) \, dy \right) dz \nonumber \\ &= \int_z f(z) \left( \int_y \left( \int_x p_{T-t}(z|x) \cdot p_t(x|y) \, dx \right) p(y) \, dy \right) dz \\ &= \int_y \int_x \underbrace{ \left( \int_z f(z) \, p_{T-t}(z|x) dz \right) }_{u_{T-t}(x)} p_t(x|y) \, p(y) \, dx \, dy \nonumber \\ &= \int_x u_{T-t}(x) \left( \int_y p_t(x|y) \, p(y) dy \right) dx \nonumber \\ &= \int_x u_{T-t}(x) \cdot p_t(x) \, dx .\end{align}

Since the LHS is a constant, it is invariant to shifting the middle point

t

, and

\partial_t = 0

provides a constraint

\begin{align} 0 = \partial_t \int_x u_{T-t}(x) \cdot p_t(x) \, dx = \int_x [ \partial_t u_{ T-t}(x) \cdot p_t(x) + u_{T-t}(x) \cdot \partial_t p_t(x) ]\, dx .\end{align}

We use

\langle a, b \rangle

to denote the inner product

\int_x a(x) \, b(x) \, dx

, and the constraint yields

\begin{align} \boxed{ \langle \partial_t u_{T-t}, p_t \rangle + \langle u_{T-t}, \partial_t p_t \rangle = 0 } ~ \Longrightarrow ~ &\langle u_{T-t}, \partial_t p_t \rangle = -\langle \partial_{\ctext{t}} u_{\ctext{T-t}}, p_t \rangle \nonumber \\ &\langle u_{T-t}, \partial_t p_t \rangle = -\langle \ctext{-} \ol_x u_{T-t}, p_t \rangle = \langle \ol_x u_{T-t}, p_t \rangle = \langle u_{T-t}, \ol^*_x p_t \rangle . \\ &\boxed{ \partial_t p_t(x) = \ol^*_x p_t(x) } \label{FPE}\end{align}

This is the Kolmogorov Forward / Fokker-Planck equation. Eqn. 14 is true for both continuous and discrete Markov processes.

[Adjoints of operators]

To write down the expression for the adjoint, we use the following integration by parts identities

\begin{align} \langle h(x) \, \partial_x a, b \rangle &= \int_x h(x) \, \partial_x a(x) \cdot b(x) \, dx = \int_x \partial_x a(x) \cdot [h(x) b(x)] \, dx \nonumber \\ &= \int_x a(x) \cdot -\partial_x [h(x) b(x)] \, dx = \langle a, -\partial_x[h(x) \cdot b] \rangle \\ \langle l(x) \, \partial_{xx} a, b \rangle &= \int_x l(x) \, \partial_{xx} a(x) \cdot b(x) \, dx = \int_x \partial_{xx} a(x) \cdot [l(x) b(x)] \, dx \nonumber \\ &= \int_x a(x) \cdot \partial_{xx} [l(x) b(x)] \, dx = \langle a, \partial_{xx}[l(x) \cdot b] \rangle \quad \text{int by parts twice}.\end{align}

In the case of diffusion,

\begin{align} \langle \partial_t u_{T-t}, p_t \rangle = \langle \ol_x^t u_{T-t}, p_t \rangle &= \langle \mu(x, t) \, \partial_x u_{T-t} + \frac{\sigma^2(x, t)}{2} \partial_{xx} u_{T-t}, p_t \rangle \nonumber \\ &= \langle u_{T-t}, -\partial_x [\mu(x, t) \cdot p_t] + \partial_{xx} [\frac{\sigma^2(x, t)}{2} \cdot p_t] \rangle \nonumber \\ &= \langle u_{T-t}, \ol_x^* p_t \rangle = \langle u_{T-t}, \partial_t p_t \rangle .\end{align}

The Fokker-Planck equation is thus

\begin{align} \partial_t p_t(x) = -\partial_x [\mu(x, t) \cdot p_t(x)] + \partial_{xx} [\frac{\sigma^2(x, t)}{2} \cdot p_t(x)].\end{align}

The score function based deterministic ODE is:

\begin{align} \partial_t p_t(x) &= -\partial_x [\mu(x, t) \cdot p_t(x)] + \partial_{xx} [\frac{\sigma^2}{2} \cdot p_t(x)] \nonumber \\ &= -\partial_x [\mu(x, t) \cdot p_t(x) - \frac{\sigma^2}{2} \partial_{x} p_t(x)] \nonumber \\ &= -\partial_x [\mu(x, t) \cdot p_t(x) - \frac{\sigma^2}{2} \partial_{x} \log p_t(x) \cdot p_t(x)] \nonumber \\ &= -\partial_x \left[ \left( \mu(x, t) - \frac{\sigma^2}{2} \partial_{x} \log p_t(x) \right) \cdot p_t(x) \right]\end{align}

3. Density estimation

The probability density

p(\vx, t)

induced by a deterministic ODE

\dv{\vx}{t} = \vv(\vx, t)

is described by the (noiseless) Fokker Planck equation. Expand it using product rule of divergence:

\begin{align} \spdv{t} p(\vx, t) &= - \divr [ \vv(\bm{x}, t) \, p(\vx, t) ] \nonumber \\ &= - [ p(\vx, t) \, \divr \vv(\vx, t) + \vv(\vx, t) \cdot \partial_{\vx} p(\vx, t) ], \nonumber \\ \spdv{t} \log p(\vx, t) &= - \divr \vv(\vx, t) - \vv(\vx, t) \cdot \spdv{\vx} \log p(\vx, t) .\end{align}

To integrate the change in probability density over the ODE trajectory, we need

\begin{align} \sdv{t} \log p(\vx, t) &= \spdv{t} \log p(\vx, t) + \spdv{\vx} \log(\vx, t) \cdot \pdv{\vx}{t} \nonumber \\ &= - \divr \vv(\vx, t).\end{align}

For a neural ODE e.g. a diffusion at inference time, to compute the density at a data point

\vx

, we integrate the divergence of velocity term over time. Evaluating divergence i.e. trace of jacobian of a neural net feedforward pass is costly cuz it's expensive to materialize the full jacobian. FFjord (Grathwohl et al., 2019) proposes to use hutchinson trace estimator

\text{tr}( \pdv{\vv_{\theta}(\vx, t)}{\vx} ) = \E_{ \bm{\epsilon} \sim \gauss{0}{\bm{I}}} \epsilon^T \pdv{\vv_{\theta}(\vx, t)}{\vx} \bm{\epsilon}

since vector-jacobian product is very manageable.

This recipe is first popularized in ML community by Neural ODE (Chen et al., 2018). Their derivation in appendix A.2 is essentially what I wrote here. I have learned from someone working in fluid dynamics that they call it the "material derivative" . So it's not exactly a new result.

References

Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. Advances in neural information processing systems, 2018.
Will Grathwohl, Ricky T. Q. Chen, Jesse Bettencourt, and David Duvenaud. Scalable reversible generative models with free-form continuous dynamics. In International Conference on Learning Representations. 2019. URL: https://openreview.net/forum?id=rJxgknCcK7.
Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Adv. Neural Inform. Process. Syst. 2022.
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In Int. Conf. Learn. Represent. 2021.

Last updated on 2025-05-07. Design inspired by distill.