Reward to go as variance reduction

2025-02

It's nice that we can compute policy gradient using "reward-to-go" instead of sum of whole trajectory reward. The writeup expands on this part that many lecture slides do not cover in details. Sometimes this is referred to as "policy gradient theorem".

1. Setup
2. V, Q, A
3. Policy Gradient
4. Reward-to-Go and Actor-Critic Architecture

1. Setup

For an n-step trajectory

\tau = [s_1, a_1 \dots s_n, a_n, s_{n+1}]

, state transition

p(s_{t+1} \cprob a_t, s_t)

is provided by the environment, and the Action transition

\pi_{\theta}(a_{t} \cprob s_{t})

is provided by the policy

\pi_{\theta}

. The discounted reward is ,

\begin{align} r_\gamma(\tau) = \sum_{t=t_0 \text{ with } t_0=1}^n \gamma^{t-t_0} \, r(a_t, s_{t+1}) .\end{align}

The emphasis on

t_0=1

is to highlight that the discount factor depends on the gap

t - t_0

; it's not just some

-1

offset. It's useful when writing value function where the starting timestep is some

t_0 = k > 1

The goal of RL is to maximize the expected reward of the policy

\pi_{\theta}

\begin{align} R = \E_{p_{\theta}(\tau)} \left[ r_\gamma(\tau) \right] .\end{align}

2. V, Q, A

Given a policy

\pi_{\theta}

, the expected reward of the n-step trajectory when expanded out is

\begin{align} R = \E_{p_{\theta}(\tau)} \left[ r_\gamma(\tau) \right] = \E_{p_{\theta}(s_1, a_1, \dots s_n, a_n, s_{n+1} )} \left[ \sum_{t=1}^n \gamma^{t-1} \, r(a_t, s_{t+1}) \right].\end{align}

Let the

value function

of state

s_k

at timestep

k

\begin{align} V(s_k) = \E_{ p_{\theta}(\mycolor{blue}{a_k}, \dots s_n, a_n, s_{n+1} \cprob \mycolor{blue}{s_k}) } \left[ \sum_{t = \mycolor{blue}{k}}^n \gamma^{t-\mycolor{blue}{k}} \, r(a_t, s_{t+1}) \right] \label{eq:def\_val} ,\end{align}

and in particular, the expected reward can be written as the expectation of

V(s_1)

over the initial state,

\begin{align} R = \E_{p_{\theta}(\tau)} \left[ r_\gamma(\tau) \right] = \E_{p(s_1)} \left[ V(s_1) \right] .\end{align}

The value function

V(s_k)

admits a recursive definition. First decompose the weighting probability in eqn. 4. We omit the subscript

\theta

when it's understood that the traj. is induced by the learned policy.

\begin{align} \underbrace{p(\mycolor{blue}{a_k}, \mycolor{gray}{ s_{k+1}, a_{k+1} \dots s_n, a_n, s_{n+1} } \cprob \mycolor{blue}{s_k})}_{p(\tau \cprob s_k)} &= p(a_{k},s_{k+1} \cprob s_k) \, p(a_{k+1}, \mycolor{gray}{ s_{k+2}, a_{k+2} \dots s_n, a_n, s_{n+1} } \cprob s_k, a_k, s_{k+1}) \nonumber \\ &= p(a_{k},s_{k+1} \cprob s_k) \, p(a_{k+1}, \mycolor{gray}{ s_{k+2}, a_{k+2} \dots s_n, a_n, s_{n+1} } \cprob s_{k+1}) \label{eq:step\_markov} \\ p(\tau \cprob s_k) &= p(a_{k},s_{k+1} \cprob s_k) \, p(\tau \cprob s_{k+1}) \nonumber \\ &= \underbrace{\pi (a_k \cprob s_k)}_{\text{policy}} \, \underbrace{p(s_{k+1} \cprob a_k, s_k)}_{\text{env transition}} \, p(\tau \cprob s_{k+1}) .\end{align}

Eqn. 6 uses the markov property. The notation

p(\tau \cprob s_k)

is a shorthand for the long expression and should read "prob. of the remaining traj. given

s_k

We use this recursive relation on

p(\tau \cprob s_k)

to break down

V(s_k)

analogously,

\begin{align} V(s_k) &= \E_{ p(\tau \cprob s_k) } \left[ \sum_{t = k}^n \gamma^{t-k} \, r(a_t, s_{t+1}) \right] \nonumber \\ &= \E_{\pi (a_k | s_k)} \E_{p(s_{k+1} | a_k, s_k)} \E_{p(\tau \cprob s_{k+1})} \left[ r(a_k, s_{k+1}) + \gamma \cdot \sum_{t = k+1}^n \gamma^{t-(k+1)} \, r(a_t, s_{t+1}) \right] \nonumber \\ &= \E_{\pi (a_k | s_k)} \E_{p(s_{k+1} | a_k, s_k)} \left[ r(a_k, s_{k+1}) + \gamma \cdot \E_{p(\tau \cprob s_{k+1})} \sum_{t = k+1}^n \gamma^{t-(k+1)} \, r(a_t, s_{t+1}) \right] \nonumber \\ V(s_k) &= \E_{\pi (a_k | s_k)} \E_{p(s_{k+1} | a_k, s_k)} \left[ r(a_k, s_{k+1}) + \gamma V(s_{k+1}) \right] \nonumber \\ &= \E_{\pi (a_k | s_k)} \left[ r(a_k, s_{k+1}) + \gamma V(s_{k+1}) \right] \qquad \text{if the env is deterministic}.\end{align}

So far

V(s_k)

is defined to be timestep aware. For long or infinite horizon game, the subscript

k

may be dropped. The Q-function

Q(s_k, a_k)

is an inner term of

\E_{\pi(a_k|s_k)} \dots

V(s)

when taking the action

a

\begin{align} Q(s_k, a_k) &= \E_{p(s_{k+1} | a_k, s_k)} \E_{p(\tau | s_{k+1})} \left[ \sum_{t = k}^n \gamma^{t-k} \, r(a_t, s_{t+1}) \right] \label{eq:Q\_v1} \\ &= \E_{p(s_{k+1} | a_k, s_k)} \left[ r(a_k, s_{k+1}) + \gamma V(s_{k+1}) \right] \label{eq:Q\_v2} \\ &= r(a_k, s_{k+1}) + \gamma V(s_{k+1}) \qquad \text{if the env is deterministic}.\end{align}

Here we can see that the definitions

R, V, Q

are just expectations of rewards along the markov chain

\tau=[s_1, a_1, \dots s_n, a_n, s_{n+1}]

by fixing the conditioning variables one at a time from left to right,

\begin{align} R &= \E_{p(s_1)} \underbrace{ \E_{p(\tau \cprob s_1)} \left[ \sum_{t = 1}^n \gamma^{t-1} \, r(a_t, s_{t+1}) \right] }_{V(s_1)} = \underbrace{ \E_{p(s_1)} \underbrace{ \E_{\pi (a_1 | s_1)} \underbrace{ \E_{p(s_2 | a_1, s_1)} \E_{p(\tau \cprob s_2)} \left[ \sum_{t = 1}^n \gamma^{t-1} \, r(a_t, s_{t+1}) \right]}_{Q(s_1, a_1)} }_{V(s_1)} }_{R} \nonumber \\ &= \underbrace{ \E_{p(s_1)} \underbrace{ \E_{\pi (a_1 | s_1)} \underbrace{ \E_{p(s_2 | a_1, s_1)} \left[ r(a_1, s_2) + \gamma \underbrace{\E_{p(\tau \cprob s_2)} \sum_{t = 2}^n \gamma^{t-2} \, r(a_t, s_{t+1}) }_{V(s_2)} \right] }_{Q(s_1, a_1)} }_{V(s_1)} }_{R} \label{eq:RVQ\_nesting} .\end{align}

Finally the advantage function is defined as

A(s_k, a_k) = Q(s_k, a_k) - V(s_k)

3. Policy Gradient

0

\begin{align} \E_{p_{\theta}(x)} \left[ \grad_{\theta} \log p_{\theta}(x) \right] = \cint{x} p_{\theta}(x) \cdot \grad_{\theta} \log p_{\theta}(x) = \cint{x} \grad_{\theta} p_{\theta}(x) = \grad_{\theta} \cint{x} p_{\theta}(x) = 0. \label{eq:fisher\_score\_expectation}\end{align}

In particular

\E_{p_{\theta}(x)} \left[ \grad_{\theta} \log p_{\theta}(x) \cdot k \right] = 0

for any constant

k

[Lemma 2: Sequential structure of RL]

Consider a markov chain

[x, y, z]

with conditional independence

p(z \cprob y, x) = p(z \cprob y)

. We drop the subcript

\theta

on prob. when it's unambiguous.

\begin{align} \E_{p(x, y)} \left[ \grad_{\theta} \log p(y \cprob x) \cdot f(x) \right] &= \E_{p(x)} \underbrace{ \E_{p(y \cprob x)} \left[ \grad_{\theta} \log p(y \cprob x) \cdot f(x) \right]}_{=0} = 0. \\ \E_{p(x, y, z)} \left[ \grad_{\theta} \log p(z \cprob y) \cdot f(x, y) \right] &= \E_{p(x, y)} \E_{p(z \cprob y)} \left[ \grad_{\theta} \log p(z \cprob y) \cdot f(x, y) \right] = 0.\end{align}

Both are true since given the conditioning,

f(\dots)

becomes a constant. The critical part is that the fisher score

\grad_{\theta} \log \dots

is taken w.r.t. the conditional distribution; if it's

\grad_{\theta} \log p(z)

then it won't work.

[Lemma 4: Optimal control variate for variance reduction]

We want to reduce the variance of monte-carlo estimator for the random variable

\grad_{\theta} \log p_{\theta}(x) \cdot f(x)

. We do so by subtracting a baseline constant

k

since it doesn't change the expectation.

\begin{align} \E_{p_{\theta}(x)} \left[ \grad_{\theta} \log p_{\theta}(x) \cdot f(x) \right] = \E_{p_{\theta}(x)} \left[ \grad_{\theta} \log p_{\theta}(x) \cdot \left( f(x) - k \right) \right] .\end{align}

For simplicity assume

\theta \in \R^1

, and for compactness we abbreviate the score

\grad_{\theta} \log p_{\theta}(x)

as the scalar

s(x)

. We optimize for the best

k

that provides the most variance reduction,

\begin{align} \text{old var} &= \E_x \left[ s(x)^2 f(x)^2 \right] - \E_x \left[ s(x) f(x) \right]^2 \\ \text{new var} &= \E_x \left[ s(x)^2(f(x) - k)^2 \right] - \E_x \left[ s(x)(f(x) - k) \right]^2 \nonumber \\ &= \E_x \left[ s(x)^2 \left( f(x)^2 -2k f(x) + k^2 \right) \right] - \E_x \left[ s(x)f(x) \right]^2 \\ \text{for } \pdv{}{k} [ \text{new var} - \text{old var} ] &= \pdv{}{k} \E_x \left[ s(x)^2 \left( k^2 -2k f(x) \right) \right] \nonumber \\ &= 2 \cdot \E_x \left[ s(x)^2 \left( k - f(x) \right) \right] \text{ to be equal to } 0 \nonumber \\ k^* &= \frac{\E_x[s(x)^2 \cdot f(x)] }{\E_x[s(x)^2]} .\end{align}

Note that the denominator is the Fisher-Information. The two random variables

s(x)^2

and

f(x)

are certainly not independent, but it's reasonable to assume they are uncorrelated i.e. they share no superficial linear dependence, and potentially have complex relationship like the 3rd row in this diagram.

It doesn't matter that neither

s(x)^2

nor

f(x)

has zero mean; covariance is translation-invariant. We assume their covariance is

0

, and thus

\E_x[s(x)^2 \cdot f(x)] = \E_x[s(x)^2] \cdot \E_x[f(x)]

. The optimal control variate becomes

k^* = \E_x[f(x)]

4. Reward-to-Go and Actor-Critic Architecture

Given samples of markov chain

\tau = [s_1, a_1 \dots s_n, a_n, s_{n+1}]

, and discounted reward

r_{\gamma}(\tau)

, the gradient by REINFORCE is

\begin{align} \grad_{\theta} \E_{p_{\theta}(\tau)} [\, r(\tau) \,] &= \E_{p_{\theta}(\tau)} \left[\, \grad_{\theta} \log p_{\theta}(\tau) \cdot r_{\gamma}(\tau) \, \right] \nonumber \\ &= \E_{p_{\theta}(\tau)} \left[ \grad_{\theta} \log p_{\theta}(\tau) \cdot \left( \sum_{t=1}^n \gamma^{t-1} \, r(a_t, s_{t+1}) \right) \right].\end{align}

The reward

r_{\gamma}(\tau)

is a sum over the entire trajectory. It treats

\tau

as a whole and ignores the sequential step-wise structure. Missed opportunities for variance reduction.

Now break the fisher score of trajectory likelihood into steps,

\begin{align} \grad_{\theta} \log p_{\theta}(\tau) &= \cancel{ \grad_{\theta} \left[ \log p(s_1) + \sum_{t=1}^n \log p(s_{t+1} \cprob a_t, s_t) \right] } + \grad_{\theta} \sum_{t=1}^n \log \pi_{\theta}(a_t \cprob s_t) .\end{align}

Substitute it in,

\begin{align} \grad_{\theta} \E_{p_{\theta}(\tau)} \left[ r_\gamma(\tau) \right] &= \E_{p_{\theta}(\tau)} \left[ \grad_{\theta} \log p_{\theta}(\tau) \cdot r_\gamma(\tau) \right] \nonumber \\ &= \E_{p_{\theta}(\tau)} \left[ \left( \sum_{t=1}^n \grad_{\theta} \log \pi_{\theta}(a_t \cprob s_t) \right) \cdot \left( \sum_{t=1}^n \gamma^{t-1} \, r(a_t, s_{t+1}) \right) \right] \nonumber \\ &= \E_{p_{\theta}(\tau)} \left[ \begin{array}{c|cccc} \text{sum over cells} & r(a_1, s_2) & + \gamma \, r(a_2, s_3) & + \gamma^2 \, r(a_3, s_4) & + \dots \\ \hline \phantom{+} \grad_{\theta} \log \pi_{\theta}(a_1 \cprob s_1) & \ddots \\ +\grad_{\theta} \log \pi_{\theta}(a_2 \cprob s_2) \\ +\grad_{\theta} \log \pi_{\theta}(a_3 \cprob s_3) \\ \vdots \\ \end{array} \right] .\end{align}

Write

\E_{p(x|y)}

\E_{x|y}

. Consider the

k

-th row by breaking

\E_{\tau} [\dots]

into

\E_{s_1, a_1 \dots s_k} \E_{a_k, s_{k+1}, a_{k+1} \dots s_{n+1} \cprob s_k} [ \dots ]

, and the reward into

r_{\gamma}(\tau) = \sum_{t=1}^{k-1} \dots + \sum_{t=k}^{n} \dots

\begin{align} \text{$k$-th row} &= \E_{\tau}[ \grad_{\theta} \log \pi_{\theta} (a_k \cprob s_k) \cdot r_{\gamma}(\tau) ] \nonumber \\ &= \E_{s_1, a_1 \dots s_k} \E_{a_k, s_{k+1}, a_{k+1} \dots s_{n+1} | s_k} \left[ \grad_{\theta} \log \pi_{\theta} (a_k \cprob s_k) \cdot \left( \sum_{t=1}^{k-1} \gamma^{t-1} r(a_t, s_{t+1}) + \sum_{t=k}^{n} \gamma^{t-1} r(a_t, s_{t+1}) \right) \right] \nonumber \\ &= \bigg\{ \cancel{ \E_{s_1, a_1 \dots s_k} \E_{\mycolor{blue}{a_k | s_k}} \E_{s_{k+1}, a_{k+1} \dots s_{n+1} | s_k,a_k} \left[ \grad_{\theta} \log \pi_{\theta} (\mycolor{blue}{a_k \cprob s_k}) \cdot \left( \sum_{t=1}^{k-1} \gamma^{t-1} r(a_t, s_{t+1}) \right) \right] } \nonumber \\ & + \E_{s_1, a_1 \dots s_k} \E_{a_k, s_{k+1}, a_{k+1} \dots s_{n+1} | s_k} \left[ \grad_{\theta} \log \pi_{\theta} (a_k \cprob s_k) \cdot \left( \sum_{t=k}^{n} \gamma^{t-1} r(a_t, s_{t+1}) \right) \right] \bigg\} .\end{align}

The first part is

0

due to Lemma 3; the innermost

\E_{s_{k+1}, a_{k+1} \dots s_{n+1} | s_k, a_k}

is marginalized out.

Reorganize the second part, and note that

\gamma^{t-1} = \gamma^{t-k} \cdot \gamma^{k-1}

\begin{align} \text{$k$-th row} &= \E_{s_1, a_1 \dots, s_k}\E_{a_k, s_{k+1}, a_{k+1} \dots s_{n+1} | s_k} \left[ \grad_{\theta} \log \pi_{\theta} (a_k \cprob s_k) \cdot \left( \sum_{t=k}^{n} \gamma^{t-1} r(a_t, s_{t+1}) \right) \right] \nonumber \\ &= \E_{s_1, a_1 \dots s_k} \E_{a_k | s_k} \left[ \grad_{\theta} \log \pi_{\theta} (a_k \cprob s_k) \cdot \gamma^{k-1} \, \E_{s_{k+1}, a_{k+1} \dots s_{n+1} | s_k, a_k} \left[ \sum_{t=k}^{n} \gamma^{t-k} \, r(a_t, s_{t+1}) \right] \right] \nonumber \\ &= \E_{s_1, a_1 \dots s_k} \E_{a_k | s_k} \left[ \grad_{\theta} \log \pi_{\theta} (a_k \cprob s_k) \cdot \gamma^{k-1} Q(s_k, a_k) \right] \nonumber \\ &= \E_{s_1, a_1 \dots s_{k-1}, a_{k-1}} \E_{s_k | s_{k-1}, a_{k-1}} \E_{a_k | s_k} \left[ \grad_{\theta} \log \pi_{\theta} (a_k \cprob s_k) \cdot \gamma^{k-1} Q(s_k, a_k) \right] \label{eq:law\_tot\_exp} \\ &= \E_{s_k} \E_{a_k | s_k} \left[ \grad_{\theta} \log \pi_{\theta} (a_k \cprob s_k) \cdot \gamma^{k-1} Q(s_k, a_k) \right] .\end{align}

The transition after eq. 24 uses the law of total expectation i.e.

\E_a \E_{b|a} [f(b)] = \E_b[f(b)]

The quantity

\gamma^{k-1} Q(s_k, a_k)

is the "reward-to-go". The optimal control variate according to Lemma 4 is

\begin{align} \E_{a_k | s_k} \left[\, \gamma^{k-1} Q(s_k, a_k) \,\right] = \gamma^{k-1} V(s_k),\end{align}

and thus the optimal variance reduced

k

-th row is

\begin{align} \gamma^{k-1} \E_{s_k} \E_{a_k | s_k} \left[ \grad_{\theta} \log \pi_{\theta} (a_k \cprob s_k) \cdot \bigg( \underbrace{ Q(s_k, a_k) - V(s_k) }_{A(s_k, a_k)} \bigg) \right].\end{align}

To compute the policy gradient, we need to sum over each row / action score, and we need some estimate

\hat{Q}

and

\hat{V}

. When they are approximated by learned functions, we arrive at the actor-critic architecture.

Last updated on 2025-05-07. Design inspired by distill.