Comparing the variance of gradients by reinforce vs. re-parameterization

2025-02

Estimated gradients by reinforce and sampling re-parameterization are equal in expectation. We discuss the contributing factor that causes the former to generally have higher variance.

When parameter θ\theta occurs in the sampling prob. of an expectation i.e. Epθ(x)[f(x)]\E_{p_{\theta}(x)} [f(x)], direct differentiation w.r.t. θ\theta is not possible. The REINFORCE identity uses log-derivative,
θEpθ(x)[f(x)]=dxθpθ(x)f(x)=dxpθ(x)θpθ(x)pθ(x)f(x)=Epθ(x)[θlogpθ(x)f(x)].eqnlabel-eq:reinforce\begin{align} \grad_{\theta} \E_{p_{\theta}(x)} [f(x)] &= \cint{x} \grad_{\theta} p_{\theta}(x) \cdot f(x) = \cint{x} p_{\theta}(x) \cdot \frac{\grad_{\theta} p_{\theta}(x)}{p_{\theta}(x)} \cdot f(x) \nonumber \\ &= \E_{p_{\theta}(x)} \left[ \grad_{\theta} \log p_{\theta}(x) \cdot f(x) \right] . \label{eq:reinforce}\end{align}
[re-parameterization trick This commonly used alternative attempts to move θ\theta out of sampling and instead provides randomness from a parameter-free random variable ϵ\epsilon,
Expθ(x)[f(x)]=Eϵp(ϵ)[f(x(θ,ϵ))]θExpθ(x)[f(x)]=Eϵp(ϵ)[fxxθ].\begin{align} \E_{x \sim p_{\theta}(x)} [f(x)] &= \ex{ \epsilon \sim p(\epsilon) }{ f(x(\theta, \epsilon)) } \\ \grad_{\theta} \E_{x \sim p_{\theta}(x)} [f(x)] &= \ex{ \epsilon \sim p(\epsilon) }{ \pdv{f}{x} \pdv{x}{\theta} }.\end{align}
For continuous setup, ϵ\epsilon is often Gaussian. For discrete sampling, ϵ\epsilon is chosen as per Gumbel-softmax.
Take the former for example. A neural net predicts a distribution N(μθ,σθ)\gauss{\mu_{\theta}}{\sigma_{\theta}}. A sample of REINFORCE gradient estimator is
^reinforce=f(x)θlog(1σθ2πexp(12(xμθσθ)2))=f(x)(xμθσθ(μθσθ+xμθσθσθσθ)σθσθ)=f(x)(μθθϵσθ+σθθ1ϵ2σθ) where  ϵ=xμθσθN(0,1).\begin{align} \hat{\grad}_{\text{reinforce}} &= f(x) \cdot \grad_{\theta} \log \left( \frac{1}{\sigma_{\theta} \sqrt{2\pi}} \exp \left(-\frac{1}{2} \left(\frac{x - \mu_{\theta}}{\sigma_{\theta}}\right)^2 \right) \right) \nonumber \\ &= f(x) \cdot \left( \frac{x - \mu_{\theta}}{\sigma_{\theta}} \left( \frac{\mu_{\theta}'}{\sigma_{\theta}} + \frac{x - \mu_{\theta}}{\sigma_{\theta}} \cdot \frac{\sigma_{\theta}'}{\sigma_{\theta}} \right) - \frac{\sigma_{\theta}'}{\sigma_{\theta}} \right) \nonumber \\ &= \mycolor{brown}{ f(x) } \cdot \mycolor{teal}{ \left( \pdv{\mu_{\theta}}{\theta} \cdot \frac{\epsilon}{\sigma_{\theta}} + \pdv{\sigma_{\theta}}{\theta} \cdot \frac{1 - \epsilon^2}{\sigma_{\theta}} \right) } ~\text{where}~~ \epsilon = \frac{x - \mu_{\theta}}{\sigma_{\theta}} \sim \gauss{0}{1} .\end{align}
For re-param trick, the sample is produced by x(θ,ϵ)=μθ+σθϵ, ϵN(0,1)x(\theta, \epsilon) = \mu_{\theta} + \sigma_{\theta} \cdot \epsilon,~ \epsilon \sim \gauss{0}{1}. The re-param gradient estimate is
^re-param=fx(μθθ+σθθϵ).eqnlabel-eq:grad_reparam\begin{align} \hat{\grad}_{\text{re-param}} = \mycolor{brown}{ \pdv{f}{x} } \cdot \mycolor{teal}{ \left( \pdv{\mu_{\theta}}{\theta} + \pdv{\sigma_{\theta}}{\theta} \cdot \epsilon \right) } . \label{eq:grad\_reparam}\end{align}
The two estimators ^reinforce\hat{\grad}_{\text{reinforce}} , ^re-param\hat{\grad}_{\text{re-param}} share the same mean. In terms of structure, the 2nd bracketed parts in both estimators are similar; they involve some interaction of μθθ,σθθ,ϵ\pdv{\mu_{\theta}}{\theta}, \pdv{\sigma_{\theta}}{\theta}, \epsilon.
What contributes to the difference in variance is mostly the 1st parts, namely f(x)f(x) vs. fx\pdv{f}{x}. Derivative fx\pdv{f}{x} is invariant to global offset / shift, whereas the raw value f(x)f(x) is not. This is why variance reduction by baseline subtraction is helpful for reinforce. Also in most applications ff is Lipschitz-continous, and so the magnitude of fx\pdv{f}{x} is small. There could be some exceptions.
Re-param trick is generally not applicable to RL. Besides differentiating the action sampling w.r.t policy param θ\theta, we also need both the state transition and the reward to be differentiable. Usually the state trans is not. However, re-parameterization is used in Soft Actor Critic (SAC) to train the actor that produces action to maximize over the learned Q functions.

Last updated on 2025-05-07. Design inspired by distill.