Comparing the variance of gradients by reinforce vs. re-parameterization

2025-02

Estimated gradients by reinforce and sampling re-parameterization are equal in expectation. We discuss the contributing factor that causes the former to generally have higher variance.

When parameter

\theta

occurs in the sampling prob. of an expectation i.e.

\E_{p_{\theta}(x)} [f(x)]

, direct differentiation w.r.t.

\theta

is not possible. The REINFORCE identity uses log-derivative,

\begin{align} \grad_{\theta} \E_{p_{\theta}(x)} [f(x)] &= \cint{x} \grad_{\theta} p_{\theta}(x) \cdot f(x) = \cint{x} p_{\theta}(x) \cdot \frac{\grad_{\theta} p_{\theta}(x)}{p_{\theta}(x)} \cdot f(x) \nonumber \\ &= \E_{p_{\theta}(x)} \left[ \grad_{\theta} \log p_{\theta}(x) \cdot f(x) \right] . \label{eq:reinforce}\end{align}

[re-parameterization trick]

This commonly used alternative attempts to move

\theta

out of sampling and instead provides randomness from a parameter-free random variable

\epsilon

\begin{align} \E_{x \sim p_{\theta}(x)} [f(x)] &= \ex{ \epsilon \sim p(\epsilon) }{ f(x(\theta, \epsilon)) } \\ \grad_{\theta} \E_{x \sim p_{\theta}(x)} [f(x)] &= \ex{ \epsilon \sim p(\epsilon) }{ \pdv{f}{x} \pdv{x}{\theta} }.\end{align}

For continuous setup,

\epsilon

is often Gaussian. For discrete sampling,

\epsilon

is chosen as per Gumbel-softmax.

Take the former for example. A neural net predicts a distribution

\gauss{\mu_{\theta}}{\sigma_{\theta}}

. A sample of REINFORCE gradient estimator is

\begin{align} \hat{\grad}_{\text{reinforce}} &= f(x) \cdot \grad_{\theta} \log \left( \frac{1}{\sigma_{\theta} \sqrt{2\pi}} \exp \left(-\frac{1}{2} \left(\frac{x - \mu_{\theta}}{\sigma_{\theta}}\right)^2 \right) \right) \nonumber \\ &= f(x) \cdot \left( \frac{x - \mu_{\theta}}{\sigma_{\theta}} \left( \frac{\mu_{\theta}'}{\sigma_{\theta}} + \frac{x - \mu_{\theta}}{\sigma_{\theta}} \cdot \frac{\sigma_{\theta}'}{\sigma_{\theta}} \right) - \frac{\sigma_{\theta}'}{\sigma_{\theta}} \right) \nonumber \\ &= \mycolor{brown}{ f(x) } \cdot \mycolor{teal}{ \left( \pdv{\mu_{\theta}}{\theta} \cdot \frac{\epsilon}{\sigma_{\theta}} + \pdv{\sigma_{\theta}}{\theta} \cdot \frac{1 - \epsilon^2}{\sigma_{\theta}} \right) } ~\text{where}~~ \epsilon = \frac{x - \mu_{\theta}}{\sigma_{\theta}} \sim \gauss{0}{1} .\end{align}

For re-param trick, the sample is produced by

x(\theta, \epsilon) = \mu_{\theta} + \sigma_{\theta} \cdot \epsilon,~ \epsilon \sim \gauss{0}{1}

. The re-param gradient estimate is

\begin{align} \hat{\grad}_{\text{re-param}} = \mycolor{brown}{ \pdv{f}{x} } \cdot \mycolor{teal}{ \left( \pdv{\mu_{\theta}}{\theta} + \pdv{\sigma_{\theta}}{\theta} \cdot \epsilon \right) } . \label{eq:grad\_reparam}\end{align}

The two estimators

\hat{\grad}_{\text{reinforce}}

\hat{\grad}_{\text{re-param}}

share the same mean. In terms of structure, the 2nd bracketed parts in both estimators are similar; they involve some interaction of

\pdv{\mu_{\theta}}{\theta}, \pdv{\sigma_{\theta}}{\theta}, \epsilon

What contributes to the difference in variance is mostly the 1st parts, namely

f(x)

vs.

\pdv{f}{x}

. Derivative

\pdv{f}{x}

is invariant to global offset / shift, whereas the raw value

f(x)

is not. This is why variance reduction by baseline subtraction is helpful for reinforce. Also in most applications

f

is Lipschitz-continous, and so the magnitude of

\pdv{f}{x}

is small. There could be some exceptions.

Re-param trick is generally not applicable to RL. Besides differentiating the action sampling w.r.t policy param

\theta

, we also need both the state transition and the reward to be differentiable. Usually the state trans is not. However, re-parameterization is used in Soft Actor Critic (SAC) to train the actor that produces action to maximize over the learned Q functions.

Last updated on 2025-11-05. Design inspired by distill.