Reward to go as variance reduction

2025-02

It's nice that we can compute policy gradient using "reward-to-go" instead of sum of whole trajectory reward. The writeup expands on this part that many lecture slides do not cover in details. Sometimes this is referred to as "policy gradient theorem".

1. Setup

For an n-step trajectory τ=[s1,a1sn,an,sn+1]\tau = [s_1, a_1 \dots s_n, a_n, s_{n+1}], state transition p(st+1at,st)p(s_{t+1} \cprob a_t, s_t) is provided by the environment, and the Action transition πθ(atst)\pi_{\theta}(a_{t} \cprob s_{t}) is provided by the policy πθ\pi_{\theta}. The discounted reward is ,
rγ(τ)=t=t0 with t0=1nγtt0r(at,st+1).\begin{align} r_\gamma(\tau) = \sum_{t=t_0 \text{ with } t_0=1}^n \gamma^{t-t_0} \, r(a_t, s_{t+1}) .\end{align}
The emphasis on t0=1t_0=1 is to highlight that the discount factor depends on the gap tt0t - t_0; it's not just some 1-1 offset. It's useful when writing value function where the starting timestep is some t0=k>1t_0 = k > 1.
The goal of RL is to maximize the expected reward of the policy πθ\pi_{\theta}
R=Epθ(τ)[rγ(τ)].\begin{align} R = \E_{p_{\theta}(\tau)} \left[ r_\gamma(\tau) \right] .\end{align}

2. V, Q, A

Given a policy πθ\pi_{\theta}, the expected reward of the n-step trajectory when expanded out is
R=Epθ(τ)[rγ(τ)]=Epθ(s1,a1,sn,an,sn+1)[t=1nγt1r(at,st+1)].\begin{align} R = \E_{p_{\theta}(\tau)} \left[ r_\gamma(\tau) \right] = \E_{p_{\theta}(s_1, a_1, \dots s_n, a_n, s_{n+1} )} \left[ \sum_{t=1}^n \gamma^{t-1} \, r(a_t, s_{t+1}) \right].\end{align}
Let the value function of state sks_k at timestep kk be
V(sk)=Epθ(ak,sn,an,sn+1sk)[t=knγtkr(at,st+1)]eqnlabel-eq:def_val,\begin{align} V(s_k) = \E_{ p_{\theta}(\mycolor{blue}{a_k}, \dots s_n, a_n, s_{n+1} \cprob \mycolor{blue}{s_k}) } \left[ \sum_{t = \mycolor{blue}{k}}^n \gamma^{t-\mycolor{blue}{k}} \, r(a_t, s_{t+1}) \right] \label{eq:def\_val} ,\end{align}
and in particular, the expected reward can be written as the expectation of V(s1)V(s_1) over the initial state,
R=Epθ(τ)[rγ(τ)]=Ep(s1)[V(s1)].\begin{align} R = \E_{p_{\theta}(\tau)} \left[ r_\gamma(\tau) \right] = \E_{p(s_1)} \left[ V(s_1) \right] .\end{align}
The value function V(sk)V(s_k) admits a recursive definition. First decompose the weighting probability in eqn. 4. We omit the subscript θ\theta when it's understood that the traj. is induced by the learned policy.
p(ak,sk+1,ak+1sn,an,sn+1sk)p(τsk)=p(ak,sk+1sk)p(ak+1,sk+2,ak+2sn,an,sn+1sk,ak,sk+1)=p(ak,sk+1sk)p(ak+1,sk+2,ak+2sn,an,sn+1sk+1)eqnlabel-eq:step_markovp(τsk)=p(ak,sk+1sk)p(τsk+1)=π(aksk)policyp(sk+1ak,sk)env transitionp(τsk+1).\begin{align} \underbrace{p(\mycolor{blue}{a_k}, \mycolor{gray}{ s_{k+1}, a_{k+1} \dots s_n, a_n, s_{n+1} } \cprob \mycolor{blue}{s_k})}_{p(\tau \cprob s_k)} &= p(a_{k},s_{k+1} \cprob s_k) \, p(a_{k+1}, \mycolor{gray}{ s_{k+2}, a_{k+2} \dots s_n, a_n, s_{n+1} } \cprob s_k, a_k, s_{k+1}) \nonumber \\ &= p(a_{k},s_{k+1} \cprob s_k) \, p(a_{k+1}, \mycolor{gray}{ s_{k+2}, a_{k+2} \dots s_n, a_n, s_{n+1} } \cprob s_{k+1}) \label{eq:step\_markov} \\ p(\tau \cprob s_k) &= p(a_{k},s_{k+1} \cprob s_k) \, p(\tau \cprob s_{k+1}) \nonumber \\ &= \underbrace{\pi (a_k \cprob s_k)}_{\text{policy}} \, \underbrace{p(s_{k+1} \cprob a_k, s_k)}_{\text{env transition}} \, p(\tau \cprob s_{k+1}) .\end{align}
Eqn. 6 uses the markov property. The notation p(τsk)p(\tau \cprob s_k) is a shorthand for the long expression and should read "prob. of the remaining traj. given sks_k".
We use this recursive relation on p(τsk)p(\tau \cprob s_k) to break down V(sk)V(s_k) analogously,
V(sk)=Ep(τsk)[t=knγtkr(at,st+1)]=Eπ(aksk)Ep(sk+1ak,sk)Ep(τsk+1)[r(ak,sk+1)+γt=k+1nγt(k+1)r(at,st+1)]=Eπ(aksk)Ep(sk+1ak,sk)[r(ak,sk+1)+γEp(τsk+1)t=k+1nγt(k+1)r(at,st+1)]V(sk)=Eπ(aksk)Ep(sk+1ak,sk)[r(ak,sk+1)+γV(sk+1)]=Eπ(aksk)[r(ak,sk+1)+γV(sk+1)]if the env is deterministic.\begin{align} V(s_k) &= \E_{ p(\tau \cprob s_k) } \left[ \sum_{t = k}^n \gamma^{t-k} \, r(a_t, s_{t+1}) \right] \nonumber \\ &= \E_{\pi (a_k | s_k)} \E_{p(s_{k+1} | a_k, s_k)} \E_{p(\tau \cprob s_{k+1})} \left[ r(a_k, s_{k+1}) + \gamma \cdot \sum_{t = k+1}^n \gamma^{t-(k+1)} \, r(a_t, s_{t+1}) \right] \nonumber \\ &= \E_{\pi (a_k | s_k)} \E_{p(s_{k+1} | a_k, s_k)} \left[ r(a_k, s_{k+1}) + \gamma \cdot \E_{p(\tau \cprob s_{k+1})} \sum_{t = k+1}^n \gamma^{t-(k+1)} \, r(a_t, s_{t+1}) \right] \nonumber \\ V(s_k) &= \E_{\pi (a_k | s_k)} \E_{p(s_{k+1} | a_k, s_k)} \left[ r(a_k, s_{k+1}) + \gamma V(s_{k+1}) \right] \nonumber \\ &= \E_{\pi (a_k | s_k)} \left[ r(a_k, s_{k+1}) + \gamma V(s_{k+1}) \right] \qquad \text{if the env is deterministic}.\end{align}
So far V(sk)V(s_k) is defined to be timestep aware. For long or infinite horizon game, the subscript kk may be dropped. The Q-function Q(sk,ak)Q(s_k, a_k) is an inner term of Eπ(aksk)\E_{\pi(a_k|s_k)} \dots in V(s)V(s) when taking the action aa,
Q(sk,ak)=Ep(sk+1ak,sk)Ep(τsk+1)[t=knγtkr(at,st+1)]eqnlabel-eq:Q_v1=Ep(sk+1ak,sk)[r(ak,sk+1)+γV(sk+1)]eqnlabel-eq:Q_v2=r(ak,sk+1)+γV(sk+1)if the env is deterministic.\begin{align} Q(s_k, a_k) &= \E_{p(s_{k+1} | a_k, s_k)} \E_{p(\tau | s_{k+1})} \left[ \sum_{t = k}^n \gamma^{t-k} \, r(a_t, s_{t+1}) \right] \label{eq:Q\_v1} \\ &= \E_{p(s_{k+1} | a_k, s_k)} \left[ r(a_k, s_{k+1}) + \gamma V(s_{k+1}) \right] \label{eq:Q\_v2} \\ &= r(a_k, s_{k+1}) + \gamma V(s_{k+1}) \qquad \text{if the env is deterministic}.\end{align}
Here we can see that the definitions R,V,QR, V, Q are just expectations of rewards along the markov chain τ=[s1,a1,sn,an,sn+1]\tau=[s_1, a_1, \dots s_n, a_n, s_{n+1}] by fixing the conditioning variables one at a time from left to right,
R=Ep(s1)Ep(τs1)[t=1nγt1r(at,st+1)]V(s1)=Ep(s1)Eπ(a1s1)Ep(s2a1,s1)Ep(τs2)[t=1nγt1r(at,st+1)]Q(s1,a1)V(s1)R=Ep(s1)Eπ(a1s1)Ep(s2a1,s1)[r(a1,s2)+γEp(τs2)t=2nγt2r(at,st+1)V(s2)]Q(s1,a1)V(s1)Reqnlabel-eq:RVQ_nesting.\begin{align} R &= \E_{p(s_1)} \underbrace{ \E_{p(\tau \cprob s_1)} \left[ \sum_{t = 1}^n \gamma^{t-1} \, r(a_t, s_{t+1}) \right] }_{V(s_1)} = \underbrace{ \E_{p(s_1)} \underbrace{ \E_{\pi (a_1 | s_1)} \underbrace{ \E_{p(s_2 | a_1, s_1)} \E_{p(\tau \cprob s_2)} \left[ \sum_{t = 1}^n \gamma^{t-1} \, r(a_t, s_{t+1}) \right]}_{Q(s_1, a_1)} }_{V(s_1)} }_{R} \nonumber \\ &= \underbrace{ \E_{p(s_1)} \underbrace{ \E_{\pi (a_1 | s_1)} \underbrace{ \E_{p(s_2 | a_1, s_1)} \left[ r(a_1, s_2) + \gamma \underbrace{\E_{p(\tau \cprob s_2)} \sum_{t = 2}^n \gamma^{t-2} \, r(a_t, s_{t+1}) }_{V(s_2)} \right] }_{Q(s_1, a_1)} }_{V(s_1)} }_{R} \label{eq:RVQ\_nesting} .\end{align}
Finally the advantage function is defined as A(sk,ak)=Q(sk,ak)V(sk)A(s_k, a_k) = Q(s_k, a_k) - V(s_k).

3. Policy Gradient

[Lemma 1: the 1st moment / expectation of fisher score is 00
Epθ(x)[θlogpθ(x)]=dxpθ(x)θlogpθ(x)=dxθpθ(x)=θdxpθ(x)=0.eqnlabel-eq:fisher_score_expectation\begin{align} \E_{p_{\theta}(x)} \left[ \grad_{\theta} \log p_{\theta}(x) \right] = \cint{x} p_{\theta}(x) \cdot \grad_{\theta} \log p_{\theta}(x) = \cint{x} \grad_{\theta} p_{\theta}(x) = \grad_{\theta} \cint{x} p_{\theta}(x) = 0. \label{eq:fisher\_score\_expectation}\end{align}
In particular Epθ(x)[θlogpθ(x)k]=0\E_{p_{\theta}(x)} \left[ \grad_{\theta} \log p_{\theta}(x) \cdot k \right] = 0 for any constant kk.
[Lemma 2: Sequential structure of RL Consider a markov chain [x,y,z][x, y, z] with conditional independence p(zy,x)=p(zy)p(z \cprob y, x) = p(z \cprob y). We drop the subcript θ\theta on prob. when it's unambiguous.
Ep(x,y)[θlogp(yx)f(x)]=Ep(x)Ep(yx)[θlogp(yx)f(x)]=0=0.Ep(x,y,z)[θlogp(zy)f(x,y)]=Ep(x,y)Ep(zy)[θlogp(zy)f(x,y)]=0.\begin{align} \E_{p(x, y)} \left[ \grad_{\theta} \log p(y \cprob x) \cdot f(x) \right] &= \E_{p(x)} \underbrace{ \E_{p(y \cprob x)} \left[ \grad_{\theta} \log p(y \cprob x) \cdot f(x) \right]}_{=0} = 0. \\ \E_{p(x, y, z)} \left[ \grad_{\theta} \log p(z \cprob y) \cdot f(x, y) \right] &= \E_{p(x, y)} \E_{p(z \cprob y)} \left[ \grad_{\theta} \log p(z \cprob y) \cdot f(x, y) \right] = 0.\end{align}
Both are true since given the conditioning, f()f(\dots) becomes a constant. The critical part is that the fisher score θlog\grad_{\theta} \log \dots is taken w.r.t. the conditional distribution; if it's θlogp(z)\grad_{\theta} \log p(z) then it won't work.
[Lemma 4: Optimal control variate for variance reduction We want to reduce the variance of monte-carlo estimator for the random variable θlogpθ(x)f(x)\grad_{\theta} \log p_{\theta}(x) \cdot f(x). We do so by subtracting a baseline constant kk since it doesn't change the expectation.
Epθ(x)[θlogpθ(x)f(x)]=Epθ(x)[θlogpθ(x)(f(x)k)].\begin{align} \E_{p_{\theta}(x)} \left[ \grad_{\theta} \log p_{\theta}(x) \cdot f(x) \right] = \E_{p_{\theta}(x)} \left[ \grad_{\theta} \log p_{\theta}(x) \cdot \left( f(x) - k \right) \right] .\end{align}
For simplicity assume θR1\theta \in \R^1, and for compactness we abbreviate the score θlogpθ(x)\grad_{\theta} \log p_{\theta}(x) as the scalar s(x)s(x). We optimize for the best kk that provides the most variance reduction,
old var=Ex[s(x)2f(x)2]Ex[s(x)f(x)]2new var=Ex[s(x)2(f(x)k)2]Ex[s(x)(f(x)k)]2=Ex[s(x)2(f(x)22kf(x)+k2)]Ex[s(x)f(x)]2for k[new varold var]=kEx[s(x)2(k22kf(x))]=2Ex[s(x)2(kf(x))] to be equal to 0k=Ex[s(x)2f(x)]Ex[s(x)2].\begin{align} \text{old var} &= \E_x \left[ s(x)^2 f(x)^2 \right] - \E_x \left[ s(x) f(x) \right]^2 \\ \text{new var} &= \E_x \left[ s(x)^2(f(x) - k)^2 \right] - \E_x \left[ s(x)(f(x) - k) \right]^2 \nonumber \\ &= \E_x \left[ s(x)^2 \left( f(x)^2 -2k f(x) + k^2 \right) \right] - \E_x \left[ s(x)f(x) \right]^2 \\ \text{for } \pdv{}{k} [ \text{new var} - \text{old var} ] &= \pdv{}{k} \E_x \left[ s(x)^2 \left( k^2 -2k f(x) \right) \right] \nonumber \\ &= 2 \cdot \E_x \left[ s(x)^2 \left( k - f(x) \right) \right] \text{ to be equal to } 0 \nonumber \\ k^* &= \frac{\E_x[s(x)^2 \cdot f(x)] }{\E_x[s(x)^2]} .\end{align}
Note that the denominator is the Fisher-Information. The two random variables s(x)2s(x)^2 and f(x)f(x) are certainly not independent, but it's reasonable to assume they are uncorrelated i.e. they share no superficial linear dependence, and potentially have complex relationship like the 3rd row in this diagram.
It doesn't matter that neither s(x)2s(x)^2 nor f(x) f(x) has zero mean; covariance is translation-invariant. We assume their covariance is 00, and thus Ex[s(x)2f(x)]=Ex[s(x)2]Ex[f(x)]\E_x[s(x)^2 \cdot f(x)] = \E_x[s(x)^2] \cdot \E_x[f(x)] . The optimal control variate becomes k=Ex[f(x)]k^* = \E_x[f(x)].

4. Reward-to-Go and Actor-Critic Architecture

Given samples of markov chain τ=[s1,a1sn,an,sn+1]\tau = [s_1, a_1 \dots s_n, a_n, s_{n+1}], and discounted reward rγ(τ)r_{\gamma}(\tau), the gradient by REINFORCE is
θEpθ(τ)[r(τ)]=Epθ(τ)[θlogpθ(τ)rγ(τ)]=Epθ(τ)[θlogpθ(τ)(t=1nγt1r(at,st+1))].\begin{align} \grad_{\theta} \E_{p_{\theta}(\tau)} [\, r(\tau) \,] &= \E_{p_{\theta}(\tau)} \left[\, \grad_{\theta} \log p_{\theta}(\tau) \cdot r_{\gamma}(\tau) \, \right] \nonumber \\ &= \E_{p_{\theta}(\tau)} \left[ \grad_{\theta} \log p_{\theta}(\tau) \cdot \left( \sum_{t=1}^n \gamma^{t-1} \, r(a_t, s_{t+1}) \right) \right].\end{align}
The reward rγ(τ)r_{\gamma}(\tau) is a sum over the entire trajectory. It treats τ\tau as a whole and ignores the sequential step-wise structure. Missed opportunities for variance reduction.
Now break the fisher score of trajectory likelihood into steps,
θlogpθ(τ)=θ[logp(s1)+t=1nlogp(st+1at,st)]+θt=1nlogπθ(atst).\begin{align} \grad_{\theta} \log p_{\theta}(\tau) &= \cancel{ \grad_{\theta} \left[ \log p(s_1) + \sum_{t=1}^n \log p(s_{t+1} \cprob a_t, s_t) \right] } + \grad_{\theta} \sum_{t=1}^n \log \pi_{\theta}(a_t \cprob s_t) .\end{align}
Substitute it in,
θEpθ(τ)[rγ(τ)]=Epθ(τ)[θlogpθ(τ)rγ(τ)]=Epθ(τ)[(t=1nθlogπθ(atst))(t=1nγt1r(at,st+1))]=Epθ(τ)[sum over cellsr(a1,s2)+γr(a2,s3)+γ2r(a3,s4)++θlogπθ(a1s1)+θlogπθ(a2s2)+θlogπθ(a3s3)].\begin{align} \grad_{\theta} \E_{p_{\theta}(\tau)} \left[ r_\gamma(\tau) \right] &= \E_{p_{\theta}(\tau)} \left[ \grad_{\theta} \log p_{\theta}(\tau) \cdot r_\gamma(\tau) \right] \nonumber \\ &= \E_{p_{\theta}(\tau)} \left[ \left( \sum_{t=1}^n \grad_{\theta} \log \pi_{\theta}(a_t \cprob s_t) \right) \cdot \left( \sum_{t=1}^n \gamma^{t-1} \, r(a_t, s_{t+1}) \right) \right] \nonumber \\ &= \E_{p_{\theta}(\tau)} \left[ \begin{array}{c|cccc} \text{sum over cells} & r(a_1, s_2) & + \gamma \, r(a_2, s_3) & + \gamma^2 \, r(a_3, s_4) & + \dots \\ \hline \phantom{+} \grad_{\theta} \log \pi_{\theta}(a_1 \cprob s_1) & \ddots \\ +\grad_{\theta} \log \pi_{\theta}(a_2 \cprob s_2) \\ +\grad_{\theta} \log \pi_{\theta}(a_3 \cprob s_3) \\ \vdots \\ \end{array} \right] .\end{align}
Write Ep(xy)\E_{p(x|y)} as Exy\E_{x|y}. Consider the kk-th row by breaking Eτ[]\E_{\tau} [\dots] into Es1,a1skEak,sk+1,ak+1sn+1sk[]\E_{s_1, a_1 \dots s_k} \E_{a_k, s_{k+1}, a_{k+1} \dots s_{n+1} \cprob s_k} [ \dots ], and the reward into rγ(τ)=t=1k1+t=knr_{\gamma}(\tau) = \sum_{t=1}^{k-1} \dots + \sum_{t=k}^{n} \dots
k-th row=Eτ[θlogπθ(aksk)rγ(τ)]=Es1,a1skEak,sk+1,ak+1sn+1sk[θlogπθ(aksk)(t=1k1γt1r(at,st+1)+t=knγt1r(at,st+1))]={Es1,a1skEakskEsk+1,ak+1sn+1sk,ak[θlogπθ(aksk)(t=1k1γt1r(at,st+1))]+Es1,a1skEak,sk+1,ak+1sn+1sk[θlogπθ(aksk)(t=knγt1r(at,st+1))]}.\begin{align} \text{$k$-th row} &= \E_{\tau}[ \grad_{\theta} \log \pi_{\theta} (a_k \cprob s_k) \cdot r_{\gamma}(\tau) ] \nonumber \\ &= \E_{s_1, a_1 \dots s_k} \E_{a_k, s_{k+1}, a_{k+1} \dots s_{n+1} | s_k} \left[ \grad_{\theta} \log \pi_{\theta} (a_k \cprob s_k) \cdot \left( \sum_{t=1}^{k-1} \gamma^{t-1} r(a_t, s_{t+1}) + \sum_{t=k}^{n} \gamma^{t-1} r(a_t, s_{t+1}) \right) \right] \nonumber \\ &= \bigg\{ \cancel{ \E_{s_1, a_1 \dots s_k} \E_{\mycolor{blue}{a_k | s_k}} \E_{s_{k+1}, a_{k+1} \dots s_{n+1} | s_k,a_k} \left[ \grad_{\theta} \log \pi_{\theta} (\mycolor{blue}{a_k \cprob s_k}) \cdot \left( \sum_{t=1}^{k-1} \gamma^{t-1} r(a_t, s_{t+1}) \right) \right] } \nonumber \\ & + \E_{s_1, a_1 \dots s_k} \E_{a_k, s_{k+1}, a_{k+1} \dots s_{n+1} | s_k} \left[ \grad_{\theta} \log \pi_{\theta} (a_k \cprob s_k) \cdot \left( \sum_{t=k}^{n} \gamma^{t-1} r(a_t, s_{t+1}) \right) \right] \bigg\} .\end{align}
The first part is 00 due to Lemma 3; the innermost Esk+1,ak+1sn+1sk,ak\E_{s_{k+1}, a_{k+1} \dots s_{n+1} | s_k, a_k} is marginalized out.
Reorganize the second part, and note that γt1=γtkγk1\gamma^{t-1} = \gamma^{t-k} \cdot \gamma^{k-1},
k-th row=Es1,a1,skEak,sk+1,ak+1sn+1sk[θlogπθ(aksk)(t=knγt1r(at,st+1))]=Es1,a1skEaksk[θlogπθ(aksk)γk1Esk+1,ak+1sn+1sk,ak[t=knγtkr(at,st+1)]]=Es1,a1skEaksk[θlogπθ(aksk)γk1Q(sk,ak)]=Es1,a1sk1,ak1Esksk1,ak1Eaksk[θlogπθ(aksk)γk1Q(sk,ak)]eqnlabel-eq:law_tot_exp=EskEaksk[θlogπθ(aksk)γk1Q(sk,ak)].\begin{align} \text{$k$-th row} &= \E_{s_1, a_1 \dots, s_k}\E_{a_k, s_{k+1}, a_{k+1} \dots s_{n+1} | s_k} \left[ \grad_{\theta} \log \pi_{\theta} (a_k \cprob s_k) \cdot \left( \sum_{t=k}^{n} \gamma^{t-1} r(a_t, s_{t+1}) \right) \right] \nonumber \\ &= \E_{s_1, a_1 \dots s_k} \E_{a_k | s_k} \left[ \grad_{\theta} \log \pi_{\theta} (a_k \cprob s_k) \cdot \gamma^{k-1} \, \E_{s_{k+1}, a_{k+1} \dots s_{n+1} | s_k, a_k} \left[ \sum_{t=k}^{n} \gamma^{t-k} \, r(a_t, s_{t+1}) \right] \right] \nonumber \\ &= \E_{s_1, a_1 \dots s_k} \E_{a_k | s_k} \left[ \grad_{\theta} \log \pi_{\theta} (a_k \cprob s_k) \cdot \gamma^{k-1} Q(s_k, a_k) \right] \nonumber \\ &= \E_{s_1, a_1 \dots s_{k-1}, a_{k-1}} \E_{s_k | s_{k-1}, a_{k-1}} \E_{a_k | s_k} \left[ \grad_{\theta} \log \pi_{\theta} (a_k \cprob s_k) \cdot \gamma^{k-1} Q(s_k, a_k) \right] \label{eq:law\_tot\_exp} \\ &= \E_{s_k} \E_{a_k | s_k} \left[ \grad_{\theta} \log \pi_{\theta} (a_k \cprob s_k) \cdot \gamma^{k-1} Q(s_k, a_k) \right] .\end{align}
The transition after eq. 24 uses the law of total expectation i.e. EaEba[f(b)]=Eb[f(b)]\E_a \E_{b|a} [f(b)] = \E_b[f(b)].
The quantity γk1Q(sk,ak)\gamma^{k-1} Q(s_k, a_k) is the "reward-to-go". The optimal control variate according to Lemma 4 is
Eaksk[γk1Q(sk,ak)]=γk1V(sk),\begin{align} \E_{a_k | s_k} \left[\, \gamma^{k-1} Q(s_k, a_k) \,\right] = \gamma^{k-1} V(s_k),\end{align}
and thus the optimal variance reduced kk-th row is
γk1EskEaksk[θlogπθ(aksk)(Q(sk,ak)V(sk)A(sk,ak))].\begin{align} \gamma^{k-1} \E_{s_k} \E_{a_k | s_k} \left[ \grad_{\theta} \log \pi_{\theta} (a_k \cprob s_k) \cdot \bigg( \underbrace{ Q(s_k, a_k) - V(s_k) }_{A(s_k, a_k)} \bigg) \right].\end{align}
To compute the policy gradient, we need to sum over each row / action score, and we need some estimate Q^\hat{Q} and V^\hat{V}. When they are approximated by learned functions, we arrive at the actor-critic architecture.

Last updated on 2025-05-07. Design inspired by distill.