Policy Gradient

策略梯度与基线函数

策略梯度

将一个静态马尔可夫随机策略参数化为 πθ(as)\pi_{\theta}(a \mid s),为了最大化策略的期望回报,定义目标函数:

J(θ)=Es0b0()Ea0πθ(s0)Er1r(s0, a0)Es1p(s0, a0)Ea1πθ(s1)Er2r(s1, a1)ErT+1r(sT, aT)[t=0Tγtrt+1]=Es0b0()Ea0πθ(s0)Es1p(s0 a0)Ea1πθ(s1)EsTp(sT1, aT1)EaTπθ(sT)[t=0TγtR(st, at)]=s0a0s1a1sTaT[b0(s0)t=0Tπθ(atst)t=1Tp(stst1, at1)t=0TγtR(st, at)]\begin{aligned} J(\theta) &= \mathcal{E}_{s_{0} \sim b_{0}(\cdot)} \mathcal{E}_{a_{0} \sim \pi_{\theta}(\cdot \mid s_{0})} \mathcal{E}_{r_{1} \sim r(\cdot \mid s_{0},\ a_{0})} \mathcal{E}_{s_{1} \sim p(\cdot \mid s_{0},\ a_{0})} \mathcal{E}_{a_{1} \sim \pi_{\theta}(\cdot \mid s_{1})} \mathcal{E}_{r_{2} \sim r(\cdot \mid s_{1},\ a_{1})} \cdots \mathcal{E}_{r_{\mathrm{T} + 1} \sim r(\cdot \mid s_{\mathrm{T}},\ a_{\mathrm{T}})} \left[ \sum_{t = 0}^{\mathrm{T}} \gamma^{t} r_{t + 1} \right] \\[7mm] &= \mathcal{E}_{s_{0} \sim b_{0}(\cdot)} \mathcal{E}_{a_{0} \sim \pi_{\theta}(\cdot \mid s_{0})} \mathcal{E}_{s_{1} \sim p(\cdot \mid s_{0}\ a_{0})} \mathcal{E}_{a_{1} \sim \pi_{\theta}(\cdot \mid s_{1})} \cdots \mathcal{E}_{s_{\mathrm{T}} \sim p(\cdot \mid s_{\mathrm{T} - 1},\ a_{\mathrm{T} - 1})} \mathcal{E}_{a_{\mathrm{T}} \sim \pi_{\theta}(\cdot \mid s_{\mathrm{T}})} \left[ \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{R}(s_{t},\ a_{t}) \right] \\[7mm] &= \sum_{s_{0}} \sum_{a_{0}} \sum_{s_{1}} \sum_{a_{1}} \cdots \sum_{s_{\mathrm{T}}} \sum_{a_{\mathrm{T}}} \left[ b_{0}(s_{0}) \prod_{t = 0}^{\mathrm{T}} \pi_{\theta}(a_{t} \mid s_{t}) \prod_{t = 1}^{\mathrm{T}} p(s_{t} \mid s_{t - 1},\ a_{t - 1}) \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{R}(s_{t},\ a_{t}) \right] \end{aligned}

在策略的条件概率值 πθ(atst)\pi_{\theta}(a_{t} \mid s_{t}) 对策略参数 θ\theta 求梯度时为了方便起见将形式重写为:

θπθ(atst)=πθ(atst)θπθ(atst)πθ(atst)=πθ(atst)θlnπθ(atst)\nabla_{\theta} \pi_{\theta}(a_{t} \mid s_{t}) = \pi_{\theta}(a_{t} \mid s_{t}) \frac{\nabla_{\theta} \pi_{\theta}(a_{t} \mid s_{t})}{\pi_{\theta}(a_{t} \mid s_{t})} = \pi_{\theta}(a_{t} \mid s_{t}) \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t})

因此目标函数中策略条件概率的乘积项的梯度为:

θt=0Tπθ(atst)=t=0Tπθ(atst)t=0Tθlnπθ(atst)\nabla_{\theta} \prod_{t = 0}^{\mathrm{T}} \pi_{\theta}(a_{t} \mid s_{t}) = \prod_{t = 0}^{\mathrm{T}} \pi_{\theta}(a_{t} \mid s_{t}) \sum_{t = 0}^{\mathrm{T}} \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t})

目标函数对策略参数 θ\theta 求导得到(一行实在写不下了,期望下标中的条件分布暂时省略 😭):

θJ(θ)=s0a0s1a1sTaT[b0(s0)(θt=0Tπθ(atst))t=1Tp(stst1, at1)t=0TγtR(st, at)]=Es0Ea0Es1Ea1EsTEaT[(t=0Tθlnπθ(atst))(t=0TγtR(st, at))]\begin{aligned} \nabla_{\theta} J(\theta) &= \sum_{s_{0}} \sum_{a_{0}} \sum_{s_{1}} \sum_{a_{1}} \cdots \sum_{s_{\mathrm{T}}} \sum_{a_{\mathrm{T}}} \left[ b_{0}(s_{0}) \left( \nabla_{\theta} \prod_{t = 0}^{\mathrm{T}} \pi_{\theta}(a_{t} \mid s_{t}) \right) \prod_{t = 1}^{\mathrm{T}} p(s_{t} \mid s_{t - 1},\ a_{t - 1}) \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{R}(s_{t},\ a_{t}) \right] \\[7mm] &= \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \mathcal{E}_{a_{\mathrm{T}}} \left[ \left( \sum_{t = 0}^{\mathrm{T}} \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) \right) \cdot \left( \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{R}(s_{t},\ a_{t}) \right) \right] \end{aligned}

考虑以上梯度中的一项乘积因子 θπθ(atst)R(sτ, aτ)\nabla_{\theta} \pi_{\theta}(a_{t} \mid s_{t}) \mathcal{R}(s_{\tau},\ a_{\tau})t>τt > \tau 时的期望:

Es0Ea0Es1Ea1EsTEaT[θlnπθ(atst)R(sτ, aτ)]=Es0Ea0Es1Ea1EstEat[θlnπθ(atst)R(sτ, aτ)]\mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \mathcal{E}_{a_{\mathrm{T}}} \Big[ \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) \mathcal{R}(s_{\tau},\ a_{\tau}) \Big] = \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{t}} \mathcal{E}_{a_{t}} \Big[ \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) \mathcal{R}(s_{\tau},\ a_{\tau}) \Big]

其中:

Eatπθ(st)[θlnπθ(atst)R(sτ, aτ)]=R(sτ, aτ)Eatπθ(st)θlnπθ(atst)=R(sτ, aτ)atπθ(atst)θlnπθ(atst)=R(sτ, aτ)θatπθ(atst)1=0\begin{aligned} &\mathcal{E}_{a_{t} \sim \pi_{\theta}(\cdot \mid s_{t})} \Big[ \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) \mathcal{R}(s_{\tau},\ a_{\tau}) \Big] = \mathcal{R}(s_{\tau},\ a_{\tau}) \mathcal{E}_{a_{t} \mid \pi_{\theta}(\cdot \mid s_{t})} \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) \\[5mm] = &\mathcal{R}(s_{\tau},\ a_{\tau}) \sum_{a_{t}} \pi_{\theta}(a_{t} \mid s_{t}) \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) = \mathcal{R}(s_{\tau},\ a_{\tau}) \nabla_{\theta} \underset{1}{\underbrace{\sum_{a_{t}} \pi_{\theta}(a_{t} \mid s_{t})}} = 0 \end{aligned}

因此所有 t>τt > \tau 的乘积因子项的期望均为 0,策略梯度简化为:

θJ(θ)=Es0Ea0Es1Ea1EsTEaT[t=0Tθlnπθ(atst)τ=tTγτR(sτ, aτ)]=t=0TEs0Ea0Es1Ea1EstEat[θlnπθ(atst)γtEst+1Eat+1EsTEaTτ=tTγτtR(sτ, aτ)qπθ(t)(st, at)]=t=0TγtEs0Ea0Es1Ea1EstEat[θlnπθ(atst)qπθ(t)(st, at)]\begin{aligned} \nabla_{\theta} J(\theta) &= \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \mathcal{E}_{a_{\mathrm{T}}} \left[ \sum_{t = 0}^{\mathrm{T}} \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) \sum_{\tau = t}^{\mathrm{T}} \gamma^{\tau} \mathcal{R}(s_{\tau},\ a_{\tau}) \right] \\[7mm] &= \sum_{t = 0}^{\mathrm{T}} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{t}} \mathcal{E}_{a_{t}} \bigg[ \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) \cdot \gamma^{t} \underset{q_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t})}{\underbrace{\mathcal{E}_{s_{t + 1}} \mathcal{E}_{a_{t + 1}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \mathcal{E}_{a_{\mathrm{T}}} \sum_{\tau = t}^{\mathrm{T}} \gamma^{\tau - t} \mathcal{R}(s_{\tau},\ a_{\tau})}} \bigg] \\[10mm] &= \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{t}} \mathcal{E}_{a_{t}} \Big[ \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) q_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) \Big] \end{aligned}

由于:

Es0Ea0Es1Ea1Est[f(st)]=s0a0s1a1stb0(s0)τ=0t1πθ(aτsτ)τ=0t1p(sτ+1sτ, aτ)f(st)=stf(st)s0a0s1a1st1b0(s0)τ=0t1πθ(aτsτ)τ=0t1p(sτ+1sτ, aτ)=stf(st)bt(st)=Estbt()[f(st)]\begin{aligned} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{t}} \Big[ f(s_{t}) \Big] &= \sum_{s_{0}} \sum_{a_{0}} \sum_{s_{1}} \sum_{a_{1}} \cdots \sum_{s_{t}} b_{0}(s_{0}) \prod_{\tau = 0}^{t - 1} \pi_{\theta}(a_{\tau} \mid s_{\tau}) \prod_{\tau = 0}^{t - 1} p(s_{\tau + 1} \mid s_{\tau},\ a_{\tau}) f(s_{t}) \\[7mm] &= \sum_{s_{t}} f(s_{t}) \sum_{s_{0}} \sum_{a_{0}} \sum_{s_{1}} \sum_{a_{1}} \cdots \sum_{s_{t - 1}} b_{0}(s_{0}) \prod_{\tau = 0}^{t - 1} \pi_{\theta}(a_{\tau} \mid s_{\tau}) \prod_{\tau = 0}^{t - 1} p(s_{\tau + 1} \mid s_{\tau},\ a_{\tau}) \\[7mm] &= \sum_{s_{t}} f(s_{t}) b_{t}(s_{t}) = \mathcal{E}_{s_{t} \sim b_{t}(\cdot)} \Big[ f(s_{t}) \Big] \end{aligned}

其中 bt()b_{t}(\cdot) 为初始状态分布 b0()b_{0}(\cdot) 和策略 πθ\pi_{\theta} 下状态 sts_{t} 的边缘概率分布,基于此可以改写策略梯度形式为:

θJ(θ)=t=0TγtEstbt()Eatπθ(st)[θlnπθ(atst)qπθ(t)(st, at)]\nabla_{\theta} J(\theta) = \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{t} \sim b_{t}(\cdot)} \mathcal{E}_{a_{t} \sim \pi_{\theta}(\cdot \mid s_{t})} \Big[ \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) q_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) \Big]

推广到无限期规划下:

θJ(θ)=t=0γtEstbt()Eatπθ(st)[θlnπθ(atst)qπθ(st, at)]=t=0γtsbt(s)aπθ(as)[θlnπθ(as)qπθ(s, a)]=11γs((1γ)t=0γtbt(s))νπθ(s)aπθ(as)[θlnπθ(as)qπθ(s, a)]=11γsaνπθ(s)πθ(as)ρπθ(s, a)[θlnπθ(as)qπθ(s, a)]Esνθ()Eaπθ()[θlnπθ(as)qπθ(s, a)]=E(s, a)ρθ(, )[θlnπθ(as)qπθ(s, a)]\begin{aligned} \nabla_{\theta} J(\theta) &= \sum_{t = 0}^{\infty} \gamma^{t} \mathcal{E}_{s_{t} \sim b_{t}(\cdot)} \mathcal{E}_{a_{t} \sim \pi_{\theta}(\cdot \mid s_{t})} \Big[ \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) q_{\pi_{\theta}}(s_{t},\ a_{t}) \Big] \\[7mm] &= \sum_{t = 0}^{\infty} \gamma^{t} \sum_{s} b_{t}(s) \sum_{a} \pi_{\theta}(a \mid s) \Big[ \nabla_{\theta} \ln \pi_{\theta}(a \mid s) q_{\pi_{\theta}}(s,\ a) \Big] \\[7mm] &= \frac{1}{1 - \gamma} \sum_{s} \underset{\nu_{\pi_{\theta}}(s)}{\underbrace{\left( (1 - \gamma) \sum_{t = 0}^{\infty} \gamma^{t} b_{t}(s) \right)}} \sum_{a} \pi_{\theta}(a \mid s) \Big[ \nabla_{\theta} \ln \pi_{\theta}(a \mid s) q_{\pi_{\theta}}(s,\ a) \Big] \\[10mm] &= \frac{1}{1 - \gamma} \sum_{s} \sum_{a} \underset{\rho_{\pi_{\theta}}(s,\ a)}{\underbrace{\nu_{\pi_{\theta}}(s) \pi_{\theta}(a \mid s)}} \Big[ \nabla_{\theta} \ln \pi_{\theta}(a \mid s) q_{\pi_{\theta}}(s,\ a) \Big] \\[10mm] &\propto \mathcal{E}_{s \sim \nu_{\theta}(\cdot)} \mathcal{E}_{a \sim \pi_{\theta}(\cdot)} \Big[ \nabla_{\theta} \ln \pi_{\theta}(a \mid s) q_{\pi_{\theta}}(s,\ a) \Big] = \mathcal{E}_{(s,\ a) \sim \rho_{\theta}(\cdot,\ \cdot)} \Big[ \nabla_{\theta} \ln \pi_{\theta}(a \mid s) q_{\pi_{\theta}}(s,\ a) \Big] \end{aligned}

其中 νπθ()\nu_{\pi_{\theta}}(\cdot)ρπθ(, )\rho_{\pi_{\theta}}(\cdot,\ \cdot) 分别为初始状态分布 b0()b_{0}(\cdot) 和策略 πθ\pi_{\theta} 下的状态访问分布和占用度量,并满足:

νπ(s)=(1γ)t=0γtbt(s)=(1γ)b0(s)+(1γ)γt=0γtbt+1(s)=(1γ)b0(s)+(1γ)γt=0γtsbt(s)pπ(1)(ss)=(1γ)b0(s)+(1γ)γt=0γtsbt(s)ap(ss, a)π(as)=(1γ)b0(s)+γs(1γ)t=0γtbt(s)νπ(s)ap(ss, a)π(as)=(1γ)b0(s)+γsνπ(s)ap(ss, a)π(as)\begin{aligned} \nu_{\pi}(s) &= (1 - \gamma) \sum_{t = 0}^{\infty} \gamma^{t} b_{t}(s) = (1 - \gamma) b_{0}(s) + (1 - \gamma) \gamma \sum_{t = 0}^{\infty} \gamma^{t} b_{t + 1}(s) \\[7mm] &= (1 - \gamma) b_{0}(s) + (1 - \gamma) \gamma \sum_{t = 0}^{\infty} \gamma^{t} \sum_{s'} b_{t}(s') p_{\pi}^{(1)}(s \mid s') \\[7mm] &= (1 - \gamma) b_{0}(s) + (1 - \gamma) \gamma \sum_{t = 0}^{\infty} \gamma^{t} \sum_{s'} b_{t}(s') \sum_{a'} p(s \mid s',\ a') \pi(a' \mid s') \\[7mm] &= (1 - \gamma) b_{0}(s) + \gamma \sum_{s'} \underset{\nu_{\pi}(s')}{\underbrace{(1 - \gamma) \sum_{t = 0}^{\infty} \gamma^{t} b_{t}(s')}} \sum_{a'} p(s \mid s',\ a') \pi(a' \mid s') \\[11mm] &= (1 - \gamma) b_{0}(s) + \gamma \sum_{s'}\nu_{\pi}(s') \sum_{a'} p(s \mid s',\ a') \pi(a' \mid s') \end{aligned}

通过和贝尔曼期望方程类似的方法可以判断出等式右侧的映射为压缩映射,方程的解存在且唯一。其中,归一化系数 1γ1 - \gamma 是为了保证状态访问分布的概率规范性:

sνπθ(s)=(1γ)t=0γtsbt(s)=(1γ)t=0γt=(1γ)11γ=1\sum_{s} \nu_{\pi_{\theta}}(s) = (1 - \gamma) \sum_{t = 0}^{\infty} \gamma^{t} \sum_{s} b_{t}(s) = (1 - \gamma) \sum_{t = 0}^{\infty} \gamma^{t} = (1 - \gamma) \frac{1}{1 - \gamma} = 1

如果初始状态分布 b0()b_{0}(\cdot)πθ\pi_{\theta} 下的稳态分布 ιπθ()\iota_{\pi_{\theta}}(\cdot),即:

ιπθ(st+1)=statιπθ(st)πθ(atst)p(st+1st, at)=stιπθ(st)atπθ(atst)p(st+1st, at)\iota_{\pi_{\theta}}(s_{t + 1}) = \sum_{s_{t}} \sum_{a_{t}} \iota_{\pi_{\theta}}(s_{t}) \pi_{\theta}(a_{t} \mid s_{t}) p(s_{t + 1} \mid s_{t},\ a_{t}) = \sum_{s_{t}} \iota_{\pi_{\theta}}(s_{t}) \sum_{a_{t}} \pi_{\theta}(a_{t} \mid s_{t}) p(s_{t + 1} \mid s_{t},\ a_{t})

在稳态分布下有:

Estιπθ()Eatπθ(st)Est+1p(st, at)[f(st+1)]=statst+1ιπθ(st)πθ(atst)p(st+1st, at)f(st+1)=st+1f(st+1)statιπθ(st)πθ(atst)p(st+1st, at)=st+1ιπθ(st+1)f(st+1)=Est+1ιπθ()[f(st+1)]\begin{aligned} &\mathcal{E}_{s_{t} \sim \iota_{\pi_{\theta}}(\cdot)} \mathcal{E}_{a_{t} \sim \pi_{\theta}(\cdot \mid s_{t})} \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t},\ a_{t})} \Big[ f(s_{t + 1}) \Big] = \sum_{s_{t}} \sum_{a_{t}} \sum_{s_{t + 1}} \iota_{\pi_{\theta}}(s_{t}) \pi_{\theta}(a_{t} \mid s_{t}) p(s_{t + 1} \mid s_{t},\ a_{t}) f(s_{t + 1}) \\[7mm] = &\sum_{s_{t + 1}} f(s_{t + 1}) \sum_{s_{t}} \sum_{a_{t}} \iota_{\pi_{\theta}}(s_{t}) \pi_{\theta}(a_{t} \mid s_{t}) p(s_{t + 1} \mid s_{t},\ a_{t}) = \sum_{s_{t + 1}} \iota_{\pi_{\theta}}(s_{t + 1}) f(s_{t + 1}) = \mathcal{E}_{s_{t + 1} \sim \iota_{\pi_{\theta}}(\cdot)} \Big[ f(s_{t + 1}) \Big] \end{aligned}

结合稳态分布假设可以将策略梯度重写为:

θJ(θ)=t=0TγtEstbt()Eatπθ(st)[θlnπθ(atst)qπθ(t)(st, at)]=t=0TγtEstιπθ()Eatπθ(st)[θlnπθ(atst)qπθ(t)(st, at)]\begin{aligned} \nabla_{\theta} J(\theta) &= \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{t} \sim b_{t}(\cdot)} \mathcal{E}_{a_{t} \sim \pi_{\theta}(\cdot \mid s_{t})} \Big[ \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) q_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) \Big] \\[7mm] &= \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{t} \sim \iota_{\pi_{\theta}}(\cdot)} \mathcal{E}_{a_{t} \sim \pi_{\theta}(\cdot \mid s_{t})} \Big[ \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) q_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) \Big] \end{aligned}

在无限期规划下则有:

θJ(θ)=11γEsιπθ()Eaπθ(s)[θlnπθ(as)qπθ(s, a)]Esιπθ()Eaπθ(s)[θlnπθ(as)qπθ(s, a)]\nabla_{\theta} J(\theta) = \frac{1}{1 - \gamma} \mathcal{E}_{s \sim \iota_{\pi_{\theta}}(\cdot)} \mathcal{E}_{a \sim \pi_{\theta}(\cdot \mid s)} \Big[ \nabla_{\theta} \ln \pi_{\theta}(a \mid s) q_{\pi_{\theta}}(s,\ a) \Big] \propto \mathcal{E}_{s \sim \iota_{\pi_{\theta}}(\cdot)} \mathcal{E}_{a \sim \pi_{\theta}(\cdot \mid s)} \Big[ \nabla_{\theta} \ln \pi_{\theta}(a \mid s) q_{\pi_{\theta}}(s,\ a) \Big]

策略梯度(带基线)

通过基线函数 bb 来重写策略梯度形式为:

θJ(θ)t=0TγtEs0Ea0Es1Ea1EstEat[θlnπθ(atst)(qπθ(t)(st, at)b)]\nabla_{\theta} J(\theta) \triangleq \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{t}} \mathcal{E}_{a_{t}} \left[ \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) \Big( q_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) - b \Big) \right]

在稳态分布假设和无限期规划下则是:

θJ(θ)Esιπθ()Eaπθ(s)[(qπθ(s, a)b)θlnπθ(as)]\nabla_{\theta} J(\theta) \triangleq \mathcal{E}_{s \sim \iota_{\pi_{\theta}}(\cdot)} \mathcal{E}_{a \sim \pi_{\theta}(\cdot \mid s)} \Big[ \Big( q_{\pi_{\theta}}(s,\ a) - b \Big) \nabla_{\theta} \ln \pi_{\theta}(a \mid s) \Big]

其中,基线函数 bb 不是动作 aa(或 ata_{t})的函数,进而可得:

Eaπθ(s)bθlnπθ(as)=baπθ(as)θlnπθ(as)=bθaπθ(as)=0\mathcal{E}_{a \sim \pi_{\theta}(\cdot \mid s)} b \nabla_{\theta} \ln \pi_{\theta}(a \mid s) = b \sum_{a} \pi_{\theta}(a \mid s) \nabla_{\theta} \ln \pi_{\theta}(a \mid s) = b \nabla_{\theta} \sum_{a} \pi_{\theta}(a \mid s) = 0

因此在加入基线函数后策略梯度保持不变,通过样本的近似估计仍然无偏,但估计的方差与基线函数相关:

Var=i=1dDaπθ(s)[(qπθ(s, a)b)θilnπθ(as)]=i=1dEaπθ(s)[(qπθ(s, a)b)2(θilnπθ(as))2]i=1dEaπθ(s)2[(qπθ(s, a)b)θilnπθ(as)]=i=1dEaπθ(s)[(qπθ(s, a)b)2(θilnπθ(as))2]i=1dEaπθ(s)2[qπθ(s, a)θilnπθ(as)]\begin{aligned} \mathrm{Var} &= \sum_{i = 1}^{d} \mathcal{D}_{a \sim \pi_{\theta}(\cdot \mid s)} \left[ \Big( q_{\pi_{\theta}}(s,\ a) - b \Big) \frac{\partial}{\partial \theta_{i}} \ln \pi_{\theta}(a \mid s) \right] \\[5mm] &= \sum_{i = 1}^{d} \mathcal{E}_{a \sim \pi_{\theta}(\cdot \mid s)} \left[ \Big( q_{\pi_{\theta}}(s,\ a) - b \Big)^{2} \left( \frac{\partial}{\partial \theta_{i}} \ln \pi_{\theta}(a \mid s) \right)^{2} \right] - \sum_{i = 1}^{d} \mathcal{E}_{a \sim \pi_{\theta}(\cdot \mid s)}^{2} \left[ \Big( q_{\pi_{\theta}}(s,\ a) - b \Big) \frac{\partial}{\partial \theta_{i}} \ln \pi_{\theta}(a \mid s) \right] \\[5mm] &= \sum_{i = 1}^{d} \mathcal{E}_{a \sim \pi_{\theta}(\cdot \mid s)} \left[ \Big( q_{\pi_{\theta}}(s,\ a) - b \Big)^{2} \left( \frac{\partial}{\partial \theta_{i}} \ln \pi_{\theta}(a \mid s) \right)^{2} \right] - \sum_{i = 1}^{d} \mathcal{E}_{a \sim \pi_{\theta}(\cdot \mid s)}^{2} \left[ q_{\pi_{\theta}}(s,\ a) \frac{\partial}{\partial \theta_{i}} \ln \pi_{\theta}(a \mid s) \right] \end{aligned}

方差对基线函数的一阶导数和二阶导数分别为:

ddbVar=2i=1dEaπθ(s)[(bqπθ(s, a))(θilnπθ(as))2]d2db2Var=2i=1dEaπθ(s)(θilnπθ(as))20\begin{gathered} \frac{d}{db} \mathrm{Var} = 2 \sum_{i = 1}^{d} \mathcal{E}_{a \sim \pi_{\theta}(\cdot \mid s)} \left[ \Big( b - q_{\pi_{\theta}}(s,\ a) \Big) \left( \frac{\partial}{\partial \theta_{i}} \ln \pi_{\theta}(a \mid s) \right)^{2} \right] \\[7mm] \frac{d^{2}}{db^{2}} \mathrm{Var} = 2 \sum_{i = 1}^{d} \mathcal{E}_{a \sim \pi_{\theta}(\cdot \mid s)} \left( \frac{\partial}{\partial \theta_{i}} \ln \pi_{\theta}(a \mid s) \right)^{2} \ge 0 \end{gathered}

可得最优基线函数为:

b=i=1dEaπθ(s)[(bqπθ(s, a))(θilnπθ(as))2]/i=1dEaπθ(s)(θilnπθ(as))2b^{\star} = \sum_{i = 1}^{d} \mathcal{E}_{a \sim \pi_{\theta}(\cdot \mid s)} \left[ \Big( b - q_{\pi_{\theta}}(s,\ a) \Big) \left( \frac{\partial}{\partial \theta_{i}} \ln \pi_{\theta}(a \mid s) \right)^{2} \right] \bigg/ \sum_{i = 1}^{d} \mathcal{E}_{a \sim \pi_{\theta}(\cdot \mid s)} \left( \frac{\partial}{\partial \theta_{i}} \ln \pi_{\theta}(a \mid s) \right)^{2}

实际中可以取基线函数为 b=vπθ(s)b = v_{\pi_{\theta}}(s),从而达到降低策略梯度估计方差的效果。

💡从另一个角度来理解基线函数的意义:在原始的策略梯度定义下使用单个样本进行随机梯度更新时:

θθ+αθlnπθ(atst)qπθ(t)(st, at)\theta \leftarrow \theta + \alpha \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) q_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t})

例如所有动作的动作价值均为正,即使 ata_{t} 的价值低于所有动作的价值的平均水平(例如 vπθ(t)(st)v_{\pi_{\theta}}^{(t)}(s_{t}))上式也会倾向于增加 πθ(atst)\pi_{\theta}(a_{t} \mid s_{t}) 并相应地降低其他更优动作的采样概率。而加入基线函数 vπθ(t)(st)v_{\pi_{\theta}}^{(t)}(s_{t}) 后:

θθ+αθlnπθ(atst)[qπθ(t)(st, at)vπθ(t)(st)]=θ+αθlnπθ(atst)[qπθ(t)(st, at)Eatπ(st)qπθ(t)(st, at)]\theta \leftarrow \theta + \alpha \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) \Big[ q_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) - v_{\pi_{\theta}}^{(t)}(s_{t}) \Big] = \theta + \alpha \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) \Big[ q_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) - \mathcal{E}_{a_{t} \sim \pi(\cdot \mid s_{t})} q_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) \Big]

如果 qπθ(t)(st, at)<vπθ(t)(st)q_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) < v_{\pi_{\theta}}^{(t)}(s_{t}),上式会倾向于降低 πθ(atst)\pi_{\theta}(a_{t} \mid s_{t}),并增加其他更优动作的采样概率。与未加入基线函数的原始策略梯度相比,这种方法可以在随机梯度下更有效地更新策略。


Policy Gradient
http://example.com/2024/07/19/PG/
Author
木辛
Posted on
July 19, 2024
Licensed under