Comparing the variance of gradients by reinforce vs. re-parameterization
2025-02
Estimated gradients by reinforce and sampling re-parameterization are equal in expectation. We discuss the contributing factor that causes the former to generally have higher variance.
When parameter θ occurs in the sampling prob. of an expectation i.e. Epθ(x)[f(x)], direct differentiation w.r.t. θ is not possible. The REINFORCE identity uses log-derivative,
[re-parameterization trick] This commonly used alternative attempts to move θ out of sampling and instead provides randomness from a parameter-free random variable ϵ,
For re-param trick, the sample is produced by x(θ,ϵ)=μθ+σθ⋅ϵ,ϵ∼N(0,1). The re-param gradient estimate is
∇^re-param=∂x∂f⋅(∂θ∂μθ+∂θ∂σθ⋅ϵ).
The two estimators ∇^reinforce, ∇^re-param share the same mean.
In terms of structure, the 2nd bracketed parts in both estimators are similar; they involve some interaction of ∂θ∂μθ,∂θ∂σθ,ϵ.
What contributes to the difference in variance is mostly the 1st parts, namely f(x) vs. ∂x∂f. Derivative ∂x∂f is invariant to global offset / shift, whereas the raw value f(x) is not. This is why variance reduction by baseline subtraction is helpful for reinforce. Also in most applications f is Lipschitz-continous, and so the magnitude of ∂x∂f is small. There could be some exceptions.
Re-param trick is generally not applicable to RL. Besides differentiating the action sampling w.r.t policy param θ, we also need both the state transition and the reward to be differentiable. Usually the state trans is not.
However, re-parameterization is used in Soft Actor Critic (SAC) to train the actor that produces action to maximize over the learned Q functions.
Last updated on 2025-05-07. Design inspired by distill.