It's nice that we can compute policy gradient using "reward-to-go" instead of sum of whole trajectory reward. The writeup expands on this part that many lecture slides do not cover in details. Sometimes this is referred to as "policy gradient theorem".
For an n-step trajectory τ=[s1,a1…sn,an,sn+1], state transition p(st+1∣at,st) is provided by the environment, and the Action transition πθ(at∣st) is provided by the policy πθ. The discounted reward is ,
rγ(τ)=t=t0 with t0=1∑nγt−t0r(at,st+1).
The emphasis on t0=1 is to highlight that the discount factor depends on the gap t−t0; it's not just some −1 offset. It's useful when writing value function where the starting timestep is some t0=k>1.
The goal of RL is to maximize the expected reward of the policy πθ
R=Epθ(τ)[rγ(τ)].
2. V, Q, A
Given a policy πθ, the expected reward of the n-step trajectory when expanded out is
and in particular, the expected reward can be written as the expectation of V(s1) over the initial state,
R=Epθ(τ)[rγ(τ)]=Ep(s1)[V(s1)].
The value function V(sk) admits a recursive definition. First decompose the weighting probability in eqn. 4. We omit the subscript θ when it's understood that the traj. is induced by the learned policy.
Eqn. 6 uses the markov property. The notation p(τ∣sk) is a shorthand for the long expression and should read "prob. of the remaining traj. given sk".
We use this recursive relation on p(τ∣sk) to break down V(sk) analogously,
V(sk)V(sk)=Ep(τ∣sk)[t=k∑nγt−kr(at,st+1)]=Eπ(ak∣sk)Ep(sk+1∣ak,sk)Ep(τ∣sk+1)[r(ak,sk+1)+γ⋅t=k+1∑nγt−(k+1)r(at,st+1)]=Eπ(ak∣sk)Ep(sk+1∣ak,sk)[r(ak,sk+1)+γ⋅Ep(τ∣sk+1)t=k+1∑nγt−(k+1)r(at,st+1)]=Eπ(ak∣sk)Ep(sk+1∣ak,sk)[r(ak,sk+1)+γV(sk+1)]=Eπ(ak∣sk)[r(ak,sk+1)+γV(sk+1)]if the env is deterministic.
So far V(sk) is defined to be timestep aware. For long or infinite horizon game, the subscript k may be dropped.
The Q-function Q(sk,ak) is an inner term of Eπ(ak∣sk)… in V(s) when taking the action a,
Q(sk,ak)=Ep(sk+1∣ak,sk)Ep(τ∣sk+1)[t=k∑nγt−kr(at,st+1)]=Ep(sk+1∣ak,sk)[r(ak,sk+1)+γV(sk+1)]=r(ak,sk+1)+γV(sk+1)if the env is deterministic.
Here we can see that the definitions R,V,Q are just expectations of rewards along the markov chain τ=[s1,a1,…sn,an,sn+1] by fixing the conditioning variables one at a time from left to right,
In particular Epθ(x)[∇θlogpθ(x)⋅k]=0 for any constant k.
[Lemma 2: Sequential structure of RL]
Consider a markov chain [x,y,z] with conditional independence p(z∣y,x)=p(z∣y). We drop the subcript θ on prob. when it's unambiguous.
Both are true since given the conditioning, f(…) becomes a constant. The critical part is that the fisher score ∇θlog… is taken w.r.t. the conditional distribution; if it's ∇θlogp(z) then it won't work.
[Lemma 4: Optimal control variate for variance reduction] We want to reduce the variance of monte-carlo estimator for the random variable ∇θlogpθ(x)⋅f(x). We do so by subtracting a baseline constant k since it doesn't change the expectation.
For simplicity assume θ∈R1, and for compactness we abbreviate the score ∇θlogpθ(x) as the scalar s(x). We optimize for the best k that provides the most variance reduction,
old varnew varfor ∂k∂[new var−old var]k∗=Ex[s(x)2f(x)2]−Ex[s(x)f(x)]2=Ex[s(x)2(f(x)−k)2]−Ex[s(x)(f(x)−k)]2=Ex[s(x)2(f(x)2−2kf(x)+k2)]−Ex[s(x)f(x)]2=∂k∂Ex[s(x)2(k2−2kf(x))]=2⋅Ex[s(x)2(k−f(x))] to be equal to 0=Ex[s(x)2]Ex[s(x)2⋅f(x)].
Note that the denominator is the Fisher-Information. The two random variables s(x)2 and f(x) are certainly not independent, but it's reasonable to assume they are uncorrelated i.e. they share no superficial linear dependence, and potentially have complex relationship like the 3rd row in this diagram.
It doesn't matter that neither s(x)2 nor f(x) has zero mean; covariance is translation-invariant. We assume their covariance is 0, and thus Ex[s(x)2⋅f(x)]=Ex[s(x)2]⋅Ex[f(x)]. The optimal control variate becomes k∗=Ex[f(x)].
4. Reward-to-Go and Actor-Critic Architecture
Given samples of markov chain τ=[s1,a1…sn,an,sn+1], and discounted reward rγ(τ), the gradient by REINFORCE is
The reward rγ(τ) is a sum over the entire trajectory. It treats τ as a whole and ignores the sequential step-wise structure. Missed opportunities for variance reduction.
Now break the fisher score of trajectory likelihood into steps,
∇θEpθ(τ)[rγ(τ)]=Epθ(τ)[∇θlogpθ(τ)⋅rγ(τ)]=Epθ(τ)[(t=1∑n∇θlogπθ(at∣st))⋅(t=1∑nγt−1r(at,st+1))]=Epθ(τ)sum over cells+∇θlogπθ(a1∣s1)+∇θlogπθ(a2∣s2)+∇θlogπθ(a3∣s3)⋮r(a1,s2)⋱+γr(a2,s3)+γ2r(a3,s4)+….
Write Ep(x∣y) as Ex∣y.
Consider the k-th row by breaking Eτ[…]
into Es1,a1…skEak,sk+1,ak+1…sn+1∣sk[…], and the reward into rγ(τ)=∑t=1k−1⋯+∑t=kn…
To compute the policy gradient, we need to sum over each row / action score, and we need some estimate Q^ and V^. When they are approximated by learned functions, we arrive at the actor-critic architecture.
Last updated on 2025-05-07. Design inspired by distill.