Kolmogorov forward-backward PDEs of a Stochastic Differential Equation

2024-05

I provide here some notes for the purpose of.

1. Feller Generator / Semigroup of Diffusion

Our goal is to derive the operator for Forward the Backward Equation. This is used by Song et al., 2021, Song and Ermon, 2019 to show that the distribution induced would match with the ideal. It's also used in a series of other papers [1][3][4][7][2]. I am first led to this body of work by Karras et al., 2022.
Let ut(x)=f(y)pt(yx)dyu_t(x) = \int f(y) \, p_t(y|x) \, dy , where ff is some benign test function. In plain words, uu is the expected value of ff after time tt rollout starting at xx. We want to find operator L\ol for the PDE
tut(x)=Lxut(x).\begin{align} \partial_t u_t(x) = \ol_x u_t(x).\end{align}
Since uu is an integral over dummy yy, we can move both of the the operators t\partial_t and L\ol inside the integral, and obtain a PDE on the transition kernel pt(yx)p_t(y|x)
f(y)tpt(yx)dy=f(y)Lxpt(yx)dytpt(yx)=Lxpt(yx).\begin{align} \int f(y) \, \partial_t p_t(y|x) \, dy = \int f(y) \, \ol_x p_t(y|x) \, dy \quad \Longrightarrow \quad \partial_t p_t(y|x) = \ol_x p_t(y|x).\end{align}
The time derivative is defined as
tut(x)=lims0ut+s(x)ut(x)s,\begin{align} \partial_t u_t(x) = \lim_{s \rightarrow 0} \frac{u_{t+s}(x) - u_t(x)}{s},\end{align}
and we attempt to write u(x,t+s)u(x, t+s) as a function of ut(x)u_t(x)
ut+s(x)=f(z)pt+s(zx)dz=zf(z)(ypt(zy)ps(yx)dy)dz=yps(yx)(zf(z)pt(zy)dz)dy=yut(y)ps(yx)dy.\begin{align} u_{t+s}(x) &= \int f(z) \, p_{t + s}(z|x) \, dz \nonumber \\ &= \int_z f(z) \left( \int_y p_{\ctext{t}}(z|y) \, p_{\ctext{s}}(y|x) dy \right) dz \\ &= \int_y p_{s}(y|x) \left( \int_z f(z) \, p_{t}(z|y) dz \right) dy \nonumber \\ &= \int_y u_t(y) \cdot p_{s}(y|x) \, dy .\end{align}
Note that when t+st + s is split, tt goes to the 2nd segment yzy \rightarrow z and ss is assigned to 1st segment xyx \rightarrow y. Otherwise we won't be able to form an expression for u(,t)u(\cdot, t) cuz ff is integrated over zz. This gives these steps a "backward-going" flavor i.e. we have u(y,x)u(y, x) and then extend backward by ss to form ut+s(x)u_{t+s}(x).
We write g(x)=ut(x)g(x) = u_t(x) as a new test function.
tut(x)=lims0yg(y)ps(yx)dyg(x)s=lims0E[g(xs)x0=x]g(x)s.\begin{align} \partial_t u_t(x) = \lim_{s \rightarrow 0} \frac{ \int_y g(y) \cdot p_{s}(y|x) dy - g(x)}{s} = \lim_{s \rightarrow 0} \frac{ \E [g(x_s)|x_0=x] - g(x) }{s} .\end{align}
This is the form of the Feller generator. The expression is rather incidental. To solve tut(x)=lims0E[g(xs)g(x)x0=x]s\partial_t u_t(x) = \lim_{s \rightarrow 0} \frac{ \E [g(x_s) - g(x)|x_0=x]}{s}, we expand using Ito's
g(xs)g(x0)=0s[μ(xh,h)g(xh)+σ2(xh,h)2g(xh)]dh+0sg(xh)σ(xh,h)dBh.\begin{align} g(x_s) - g(x_0) = \int_0^s \left[ \mu(x_h, h) \, g'(x_h) + \frac{\sigma^2(x_h, h)}{2}g''(x_h) \right] dh + \int_0^s g'(x_h) \, \sigma(x_h, h) dB_h.\end{align}
The 2nd term is a martingale and its expectation is 00. Thus
tut(x)=lims0E[0s[μ(xh,h)g(xh)+σ2(xh,h)2g(xh)]dh]s=E[lims00s[μ(xh,h)g(xh)+σ2(xh,h)2g(xh)]dh0s]=E[μ(x0,0)g(x0)+σ2(x0,0)2g(x0)]=μ(x,0)xut(x)+σ2(x,0)2xxut(x)using x0=x\begin{align} \partial_t u_t(x) &= \lim_{s \rightarrow 0} \frac{ \E \left[ \int_0^s [ \mu(x_h, h) \, g'(x_h) + \frac{\sigma^2(x_h, h)}{2}g''(x_h) ] dh \right] }{s} \nonumber \\ &= \E \left[ \lim_{s \rightarrow 0} \frac{\int_0^s [ \mu(x_h, h) \, g'(x_h) + \frac{\sigma^2(x_h, h)}{2}g''(x_h) ] dh - 0 }{s} \right] \\ &= \E \left[ \mu(x_0, 0) \, g'(x_0) + \frac{\sigma^2(x_0, 0)}{2}g''(x_0) \right] \nonumber \\ &= \mu(x, 0) \, \partial_x u_t(x) + \frac{\sigma^2(x, 0)}{2} \partial_{xx} u_t(x) \quad \text{using } x_0 = x\end{align}
Failed. Left tt right 00. Some logic is missing.
For an alternative proof, directly apply Ito v2 on ut(x)u_t(x), and since ut(x)u_t(x) is a conditional expectation, the terms on dtdt is 0. This way you get the PDE. This is much quicker.

2. Kolmogorov Forward Backward Operator Duality

To relate forward and backward PDEs, we consider a test function ff integrated over the end-of-time marginal distribution pTp_T. We further break pTp_T into integrals of transition kernels pTt()p_{T-t}(\cdot | \cdot) and pt()p_t({\cdot | \cdot}),
zf(z)pT(z)dz=zf(z)(ypT(zy)p(y)dy)dz=zf(z)(y(xpTt(zx)pt(xy)dx)p(y)dy)dz=yx(zf(z)pTt(zx)dz)uTt(x)pt(xy)p(y)dxdy=xuTt(x)(ypt(xy)p(y)dy)dx=xuTt(x)pt(x)dx.\begin{align} \int_z f(z) \, p_T(z) \, dz &= \int_z f(z) \left( \int_y \, p_T(z|y) \, p(y) \, dy \right) dz \nonumber \\ &= \int_z f(z) \left( \int_y \left( \int_x p_{T-t}(z|x) \cdot p_t(x|y) \, dx \right) p(y) \, dy \right) dz \\ &= \int_y \int_x \underbrace{ \left( \int_z f(z) \, p_{T-t}(z|x) dz \right) }_{u_{T-t}(x)} p_t(x|y) \, p(y) \, dx \, dy \nonumber \\ &= \int_x u_{T-t}(x) \left( \int_y p_t(x|y) \, p(y) dy \right) dx \nonumber \\ &= \int_x u_{T-t}(x) \cdot p_t(x) \, dx .\end{align}
Since the LHS is a constant, it is invariant to shifting the middle point tt, and t=0\partial_t = 0 provides a constraint
0=txuTt(x)pt(x)dx=x[tuTt(x)pt(x)+uTt(x)tpt(x)]dx.\begin{align} 0 = \partial_t \int_x u_{T-t}(x) \cdot p_t(x) \, dx = \int_x [ \partial_t u_{ T-t}(x) \cdot p_t(x) + u_{T-t}(x) \cdot \partial_t p_t(x) ]\, dx .\end{align}
We use a,b\langle a, b \rangle to denote the inner product xa(x)b(x)dx\int_x a(x) \, b(x) \, dx, and the constraint yields
tuTt,pt+uTt,tpt=0  uTt,tpt=tuTt,ptuTt,tpt=LxuTt,pt=LxuTt,pt=uTt,Lxpt.tpt(x)=Lxpt(x)eqnlabel-FPE\begin{align} \boxed{ \langle \partial_t u_{T-t}, p_t \rangle + \langle u_{T-t}, \partial_t p_t \rangle = 0 } ~ \Longrightarrow ~ &\langle u_{T-t}, \partial_t p_t \rangle = -\langle \partial_{\ctext{t}} u_{\ctext{T-t}}, p_t \rangle \nonumber \\ &\langle u_{T-t}, \partial_t p_t \rangle = -\langle \ctext{-} \ol_x u_{T-t}, p_t \rangle = \langle \ol_x u_{T-t}, p_t \rangle = \langle u_{T-t}, \ol^*_x p_t \rangle . \\ &\boxed{ \partial_t p_t(x) = \ol^*_x p_t(x) } \label{FPE}\end{align}
This is the Kolmogorov Forward / Fokker-Planck equation. Eqn. 14 is true for both continuous and discrete Markov processes.
[Adjoints of operators To write down the expression for the adjoint, we use the following integration by parts identities
h(x)xa,b=xh(x)xa(x)b(x)dx=xxa(x)[h(x)b(x)]dx=xa(x)x[h(x)b(x)]dx=a,x[h(x)b]l(x)xxa,b=xl(x)xxa(x)b(x)dx=xxxa(x)[l(x)b(x)]dx=xa(x)xx[l(x)b(x)]dx=a,xx[l(x)b]int by parts twice.\begin{align} \langle h(x) \, \partial_x a, b \rangle &= \int_x h(x) \, \partial_x a(x) \cdot b(x) \, dx = \int_x \partial_x a(x) \cdot [h(x) b(x)] \, dx \nonumber \\ &= \int_x a(x) \cdot -\partial_x [h(x) b(x)] \, dx = \langle a, -\partial_x[h(x) \cdot b] \rangle \\ \langle l(x) \, \partial_{xx} a, b \rangle &= \int_x l(x) \, \partial_{xx} a(x) \cdot b(x) \, dx = \int_x \partial_{xx} a(x) \cdot [l(x) b(x)] \, dx \nonumber \\ &= \int_x a(x) \cdot \partial_{xx} [l(x) b(x)] \, dx = \langle a, \partial_{xx}[l(x) \cdot b] \rangle \quad \text{int by parts twice}.\end{align}
In the case of diffusion,
tuTt,pt=LxtuTt,pt=μ(x,t)xuTt+σ2(x,t)2xxuTt,pt=uTt,x[μ(x,t)pt]+xx[σ2(x,t)2pt]=uTt,Lxpt=uTt,tpt.\begin{align} \langle \partial_t u_{T-t}, p_t \rangle = \langle \ol_x^t u_{T-t}, p_t \rangle &= \langle \mu(x, t) \, \partial_x u_{T-t} + \frac{\sigma^2(x, t)}{2} \partial_{xx} u_{T-t}, p_t \rangle \nonumber \\ &= \langle u_{T-t}, -\partial_x [\mu(x, t) \cdot p_t] + \partial_{xx} [\frac{\sigma^2(x, t)}{2} \cdot p_t] \rangle \nonumber \\ &= \langle u_{T-t}, \ol_x^* p_t \rangle = \langle u_{T-t}, \partial_t p_t \rangle .\end{align}
The Fokker-Planck equation is thus
tpt(x)=x[μ(x,t)pt(x)]+xx[σ2(x,t)2pt(x)].\begin{align} \partial_t p_t(x) = -\partial_x [\mu(x, t) \cdot p_t(x)] + \partial_{xx} [\frac{\sigma^2(x, t)}{2} \cdot p_t(x)].\end{align}
So the forward PDE describes the
The score function based deterministic ODE is:
tpt(x)=x[μ(x,t)pt(x)]+xx[σ22pt(x)]=x[μ(x,t)pt(x)σ22xpt(x)]=x[μ(x,t)pt(x)σ22xlogpt(x)pt(x)]=x[(μ(x,t)σ22xlogpt(x))pt(x)]\begin{align} \partial_t p_t(x) &= -\partial_x [\mu(x, t) \cdot p_t(x)] + \partial_{xx} [\frac{\sigma^2}{2} \cdot p_t(x)] \nonumber \\ &= -\partial_x [\mu(x, t) \cdot p_t(x) - \frac{\sigma^2}{2} \partial_{x} p_t(x)] \nonumber \\ &= -\partial_x [\mu(x, t) \cdot p_t(x) - \frac{\sigma^2}{2} \partial_{x} \log p_t(x) \cdot p_t(x)] \nonumber \\ &= -\partial_x \left[ \left( \mu(x, t) - \frac{\sigma^2}{2} \partial_{x} \log p_t(x) \right) \cdot p_t(x) \right]\end{align}

3. Neural ODE density estimation

The probability density p(x,t)p(\vx, t) induced by a deterministic ODE dxdt=v(x,t)\dv{\vx}{t} = \vv(\vx, t) is described by the (noiseless) Fokker Planck equation. Expand it using product rule of divergence:
tp(x,t)=[v(x,t)p(x,t)]=[p(x,t)v(x,t)+v(x,t)xp(x,t)],tlogp(x,t)=v(x,t)v(x,t)xlogp(x,t).\begin{align} \spdv{t} p(\vx, t) &= - \divr [ \vv(\bm{x}, t) \, p(\vx, t) ] \\ &= - [ p(\vx, t) \, \divr \vv(\vx, t) + \vv(\vx, t) \cdot \partial_{\vx} p(\vx, t) ], \\ \spdv{t} \log p(\vx, t) &= - \divr \vv(\vx, t) - \vv(\vx, t) \cdot \spdv{\vx} \log p(\vx, t) .\end{align}
To integrate the change in probability density over the ODE trajectory, we need
ddtlogp(x,t)=tlogp(x,t)+xlog(x,t)xt=v(x,t).\begin{align} \sdv{t} \log p(\vx, t) &= \spdv{t} \log p(\vx, t) + \spdv{\vx} \log(\vx, t) \cdot \pdv{\vx}{t} \\ &= - \divr \vv(\vx, t).\end{align}
For a neural ODE e.g. a diffusion at inference time, to compute the density at a data point x\vx, we integrate the divergence of velocity term over time. Evaluating divergence i.e. trace of jacobian of a neural net feedforward pass is costly cuz it's expensive to materialize the full jacobian. FFjord (https://arxiv.org/abs/1810.01367) proposes to use hutchinson trace estimator tr(vθ(x,t)x)=EϵN(0,I)ϵTvθ(x,t)xϵ\text{tr}( \pdv{\vv_{\theta}(\vx, t)}{\vx} ) = \E_{ \bm{\epsilon} \sim \gauss{0}{\bm{I}}} \epsilon^T \pdv{\vv_{\theta}(\vx, t)}{\vx} \bm{\epsilon} cuz vector-jacobian product is very manageable.
This recipe is first popularized in ML community by Neural ODE (https://arxiv.org/abs/1806.07366). Their derivation in appendix A.2 is essentially what I wrote here.
I have learned from a friend of friend working in fluid dynamics that they call it the "material derivative'' (https://en.wikipedia.org/wiki/Material_derivative). So it's not exactly a new result.

4. Discrete Diffusion Training Objective Interpretation

Recall the training objective of continuous diffusion
Ep(x0)Ep(xix0)D(xi)x02=Ep(xi)Ep(x0xi)D(xi)x02,\begin{align} & \E_{p(x_0)} \E_{p(x_i|x_0)} \| D(x_i) - x_0 \|^2 \nonumber \\ =& \E_{p(x_i)} \E_{p(x_0|x_i)} \| D(x_i) - x_0 \|^2,\end{align}
and the optimal prediction by denoiser D(xi)D(x_i) is the conditional mean E[x0xi]\E[x_0|x_i] . The minimum loss is thus Var(x0xi)\text{Var}(x_0 | x_i).
We now want to do the same for discrete diffusion, and establish the counterpart of "regressing to the mean''. The steps below are actually fully general (if we replace the \sum with \int). The spirit is to look for minima by completing the KL, analogous to completing the square.
Consider a single term in the ELBO interpretation of diffusion training objective. We are optimizing q(xi1xi)q(x_{i-1}|x_i) to minimize
Ep(x0)Ep(xix0)KL{p(xi1x0,xi)    q(xi1xi)}=Ep(xi)Ep(x0xi)KL{p(xi1x0,xi)    q(xi1xi)}=Ep(xi)x0p(x0xi)xi1p(xi1x0,xi)logp(xi1x0,xi)q(xi1xi)=Ep(xi)xi1,x0p(xi1,x0xi)logp(xi1x0,xi)q(xi1xi)=Ep(xi)xi1,x0p(xi1,x0xi)logp(xi1xi)q(xi1xi)p(xi1x0,xi)p(xi1xi)=Ep(xi)[xi1,x0p(xi1,x0xi)logp(xi1xi)q(xi1xi)+xi1,x0p(xi1,x0xi)logp(xi1x0,xi)p(xi1xi)]=Ep(xi)[xi1p(xi1xi)logp(xi1xi)q(xi1xi)+xi1,x0p(xi1,x0xi)logp(x0xi)p(xi1x0,xi)p(x0xi)p(xi1xi)111]=Ep(xi)[KL{p(xi1xi)    q(xi1xi)}+MI(X0;Xi1Xi)].\begin{align} & \E_{p(x_0)} \E_{p(x_i|x_0)} \KL{p(x_{i-1} | x_0, x_i)}{q(x_{i-1}|x_i)} \nonumber \\ =& \E_{p(x_i)} \E_{p(x_0|x_i)} \KL{p(x_{i-1} | x_0, x_i)}{q(x_{i-1}|x_i)} \\ =& \E_{p(x_i)} \sum_{x_0} p(x_0|x_i) \sum_{x_{i-1}} p(x_{i-1} | x_0, x_i) \log \frac{p(x_{i-1} | x_0, x_i)}{q(x_{i-1}|x_i)} \nonumber \\ =& \E_{p(x_i)} \sum_{x_{i-1},x_0} p(x_{i-1}, x_0 | x_i) \log \frac{p(x_{i-1} | x_0, x_i)}{q(x_{i-1}|x_i)} \\ =& \E_{p(x_i)} \sum_{x_{i-1},x_0} p(x_{i-1}, x_0 | x_i) \log \frac{ \ctext{p(x_{i-1}|x_i)} }{q(x_{i-1}|x_i)} \cdot \frac{p(x_{i-1} | x_0, x_i)}{ \ctext{p(x_{i-1}|x_i)} } \nonumber \\ =& \E_{p(x_i)} \left[ \sum_{x_{i-1},x_0} p(x_{i-1}, x_0 | x_i) \log \frac{ p(x_{i-1}|x_i) }{q(x_{i-1}|x_i)} + \sum_{x_{i-1},x_0} p(x_{i-1}, x_0 | x_i) \log \frac{p(x_{i-1} | x_0, x_i)}{ p(x_{i-1}|x_i) } \right] \\ =& \E_{p(x_i)} \left[ \sum_{x_{i-1}} p(x_{i-1} | x_i) \log \frac{ p(x_{i-1}|x_i) }{q(x_{i-1}|x_i)} + \sum_{x_{i-1},x_0} p(x_{i-1}, x_0 | x_i) \log \frac{ \ctext{p(x_0|x_i)} \, p(x_{i-1} | x_0, x_i) }{ \ctext{p(x_0|x_i)} \, p(x_{i-1}|x_i) \phantom{111} } \right] \\ =& \E_{p(x_i)} \bigg[ \KL{p(x_{i-1} | x_i)}{q(x_{i-1}|x_i)} + \text{MI}( X_0; X_{i-1}|X_i) \bigg].\end{align}
So the loss is minimized when q(xi1xi)q(x_{i-1}|x_i) becomes p(xi1xi)p(x_{i-1} | x_i), and the minimum loss is the mutual information MI(X0;Xi1Xi)\text{MI}( X_0; X_{i-1}|X_i) conditioned at XiX_i.
Note that I am using a non-standard notation for mutual information under conditioning. The definition of Conditional Mutual Information I(X;YZ):=EZ[KL{P(X,YZ)    P(XZ)P(YZ)}]I(X;Y|Z) := \E_Z[ \KL{P(X,Y|Z)}{P(X|Z) P(Y|Z)} ] includes the outer expectation.
My confusion: under continuous diffusion, q(xi1xi)q(x_{i-1}|x_i) is parameterized as a Gaussian. The real p(xi1xi)p(x_{i-1} | x_i) like you wrote is a mixture of gaussian. Therefore KL{p(xi1xi)    q(xi1xi)}\KL{p(x_{i-1} | x_i)}{q(x_{i-1}|x_i)} can't reach zero. So somehow in continuous diffusion this objective can't be perfectly optimized?

References

  1. Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Adv. Neural Inform. Process. Syst. 2022.
  2. Diederik P Kingma and Jimmy Ba. Adam: a method for stochastic optimization. In Int. Conf. Learn. Represent. 2015.
  3. Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. Advances in neural information processing systems, 34:21696–21707, 2021.
  4. Tzu-Mao Li, Michal Lukáč, Gharbi Michaël, and Jonathan Ragan-Kelley. Differentiable vector graphics rasterization for editing and learning. ACM Trans. Graph. (Proc. SIGGRAPH Asia), 39(6):193:1–193:15, 2020.
  5. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, and others. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. 2021.
  6. Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In Adv. Neural Inform. Process. Syst. 2019.
  7. Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In Int. Conf. Learn. Represent. 2021.
  8. Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A Yeh, and Greg Shakhnarovich. Score jacobian chaining: lifting pretrained 2d diffusion models for 3d generation. Computer Vision and Pattern Recognition, 2023.

Last updated on 2024-04-03. Design inspired by distill.