TRPO and PPO

TRPO & PPO

Trust Region Policy Optimization(TRPO)

由于 actor-critic 方法直接使用策略梯度进行参数的更新,在步长较大时策略有可能会显著变差。为了保证策略在优化时性能的单调提升,即 J(θ)J(θ)J(\theta') \ge J(\theta),将优化目标 J(θ)=Es0b0()[vπθ(0)(s0)]J(\theta) = \mathcal{E}_{s_{0} \sim b_{0}(\cdot)} \Big[ v_{\pi_{\theta}}^{(0)} (s_{0}) \Big] 重写为:

J(θ)=Es0b0()Ea0πθ(s0)Es1p(s0, a0)Ea1πθ(s1)EsTp(sT1, aT1)EaTπθ(sT)[t=0Tγtvπθ(t)(st)t=1Tγtvπθ(t)(st)]=Es0b0()Ea0πθ(s0)Es1p(s0, a0)Ea1πθ(s1)EsTp(sT1, aT1)EaTπθ(sT)[t=0Tγt(γvπθ(t+1)(st+1)vπθ(t)(st))]\begin{aligned} J(\theta) &= \mathcal{E}_{s_{0} \sim b_{0}(\cdot)} \mathcal{E}_{a_{0} \sim \pi_{\theta'}(\cdot \mid s_{0})} \mathcal{E}_{s_{1} \sim p(\cdot \mid s_{0},\ a_{0})} \mathcal{E}_{a_{1} \sim \pi_{\theta'}(\cdot \mid s_{1})} \cdots \mathcal{E}_{s_{\mathrm{T}} \sim p(\cdot \mid s_{\mathrm{T} - 1},\ a_{\mathrm{T} - 1})} \mathcal{E}_{a_{\mathrm{T}} \sim \pi_{\theta'}(\cdot \mid s_{\mathrm{T}})} \left[ \sum_{t = 0}^{\mathrm{T}} \gamma^{t} v_{\pi_{\theta}}^{(t)}(s_{t}) - \sum_{t = 1}^{\mathrm{T}} \gamma^{t} v_{\pi_{\theta}}^{(t)}(s_{t}) \right] \\[7mm] &= -\mathcal{E}_{s_{0} \sim b_{0}(\cdot)} \mathcal{E}_{a_{0} \sim \pi_{\theta'}(\cdot \mid s_{0})} \mathcal{E}_{s_{1} \sim p(\cdot \mid s_{0},\ a_{0})} \mathcal{E}_{a_{1} \sim \pi_{\theta'}(\cdot \mid s_{1})} \cdots \mathcal{E}_{s_{\mathrm{T}} \sim p(\cdot \mid s_{\mathrm{T} - 1},\ a_{\mathrm{T} - 1})} \mathcal{E}_{a_{\mathrm{T}} \sim \pi_{\theta'}(\cdot \mid s_{\mathrm{T}})} \left[ \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \Big( \gamma v_{\pi_{\theta}}^{(t + 1)}(s_{t + 1}) - v_{\pi_{\theta}}^{(t)}(s_{t}) \Big) \right] \end{aligned}

以上形式的轨迹通过 πθ\pi_{\theta'} 生成,计算参数更新前后优化目标的差值:

J(θ)J(θ)=Es0Ea0Es1Ea1EsTEaT[t=0TγtR(st, at)]+Es0Ea0Es1Ea1EsTEaT[t=0Tγt(γvπθ(t+1)(st+1)vπθ(t)(st))]=Es0Ea0Es1Ea1EsTEaT[t=0Tγt(R(st, at)+γvπθ(t+1)(st+1)vπθ(t)(st))]=t=0TγtEs0Ea0Es1Ea1EstEatEst+1[R(st, at)+γvπθ(t+1)(st+1)vπθ(t)(st)]=t=0TγtEs0Ea0Es1Ea1EstEat[R(st, at)+γEst+1p(st, at)vπθ(t+1)(st+1)qπθ(t)(st, at)vπθ(t)(st)]=t=0TγtEs0Ea0Es1Ea1EstEat[qπθ(t)(st, at)vπθ(t)(st)]=t=0TγtEs0Ea0Es1Ea1EstEat[dπθ(t)(st, at)]\begin{aligned} J(\theta') - J(\theta) &= \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \mathcal{E}_{a_{\mathrm{T}}} \left[ \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{R}(s_{t},\ a_{t}) \right] + \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \mathcal{E}_{a_{\mathrm{T}}} \left[ \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \Big( \gamma v_{\pi_{\theta}}^{(t + 1)}(s_{t + 1}) - v_{\pi_{\theta}}^{(t)}(s_{t}) \Big) \right] \\[7mm] &= \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \mathcal{E}_{a_{\mathrm{T}}} \left[ \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \Big( \mathcal{R}(s_{t},\ a_{t}) + \gamma v_{\pi_{\theta}}^{(t + 1)}(s_{t + 1}) - v_{\pi_{\theta}}^{(t)}(s_{t}) \Big) \right] \\[7mm] &= \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{t}} \mathcal{E}_{a_{t}} \mathcal{E}_{s_{t + 1}} \Big[ \mathcal{R}(s_{t},\ a_{t}) + \gamma v_{\pi_{\theta}}^{(t + 1)}(s_{t + 1}) - v_{\pi_{\theta}}^{(t)}(s_{t}) \Big] \\[7mm] &= \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{t}} \mathcal{E}_{a_{t}} \Big[ \underset{q_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t})}{\underbrace{\mathcal{R}(s_{t},\ a_{t}) + \gamma \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t},\ a_{t})} v_{\pi_{\theta}}^{(t + 1)}(s_{t + 1})}} - v_{\pi_{\theta}}^{(t)}(s_{t}) \Big] \\[7mm] &= \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{t}} \mathcal{E}_{a_{t}} \Big[ q_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) - v_{\pi_{\theta}}^{(t)}(s_{t}) \Big] = \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{t}} \mathcal{E}_{a_{t}} \Big[ d_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) \Big] \end{aligned}

为了求解上式需要使用新策略 πθ\pi_{\theta'} 进行轨迹的生成,而后对目标函数进行计算与优化。但这种做法在现实中难以实现,因此在 πθ\pi_{\theta'}πθ\pi_{\theta} 非常接近时,可以将差值近似为:

L(θθ)=t=0TγtEs0b0()Ea0πθ(s0)Es1p(s0, a0)Ea1πθ(s1)Estp(st1, at1)Eatπθ(st)[dπθ(t)(st, at)]t=0TγtEs0b0()Ea0πθ(s0)Es1p(s0, a0)Ea1πθ(s1)Estp(st1, at1)Eatπθ(st)[dπθ(t)(st, at)]=t=0TγtEs0b0()Ea0πθ(s0)Es1p(s0, a0)Ea1πθ(s1)Estp(st1, at1)Eatπθ(st)[πθ(atst)πθ(atst)dπθ(t)(st, at)]\begin{aligned} L(\theta' \mid \theta) &= \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0} \sim b_{0}(\cdot)} \mathcal{E}_{a_{0} \sim \pi_{\theta'}(\cdot \mid s_{0})} \mathcal{E}_{s_{1} \sim p(\cdot \mid s_{0},\ a_{0})} \mathcal{E}_{a_{1} \sim \pi_{\theta'}(\cdot \mid s_{1})} \cdots \mathcal{E}_{s_{t} \sim p(\cdot \mid s_{t - 1},\ a_{t - 1})} \mathcal{E}_{a_{t} \sim \pi_{\theta'}(\cdot \mid s_{t})} \Big[ d_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) \Big] \\[7mm] &\approx \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0} \sim b_{0}(\cdot)} \mathcal{E}_{a_{0} \sim \pi_{\theta}(\cdot \mid s_{0})} \mathcal{E}_{s_{1} \sim p(\cdot \mid s_{0},\ a_{0})} \mathcal{E}_{a_{1} \sim \pi_{\theta}(\cdot \mid s_{1})} \cdots \mathcal{E}_{s_{t} \sim p(\cdot \mid s_{t - 1},\ a_{t - 1})} \mathcal{E}_{a_{t} \sim \pi_{\theta'}(\cdot \mid s_{t})} \Big[ d_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) \Big] \\[7mm] &= \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0} \sim b_{0}(\cdot)} \mathcal{E}_{a_{0} \sim \pi_{\theta}(\cdot \mid s_{0})} \mathcal{E}_{s_{1} \sim p(\cdot \mid s_{0},\ a_{0})} \mathcal{E}_{a_{1} \sim \pi_{\theta}(\cdot \mid s_{1})} \cdots \mathcal{E}_{s_{t} \sim p(\cdot \mid s_{t - 1},\ a_{t - 1})} \mathcal{E}_{a_{t} \sim \pi_{\theta}(\cdot \mid s_{t})} \left[ \frac{\pi_{\theta'}(a_{t} \mid s_{t})}{\pi_{\theta}(a_{t} \mid s_{t})} d_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) \right] \end{aligned}

此时可以通过旧策略 πθ\pi_{\theta} 的采样数据来估计并优化新策略 πθ\pi_{\theta'},同时使用 KL 散度来衡量策略间的距离:

dKL(π, π)=Es0b0()Ea0π(s0)Es1p(s0, a0)Ea1π(s1)EsTp(sT1, aT1)EaTπ(sT)[t=0TγtDKL(π(st)  π(st))]=t=0TγtEs0b0()Ea0π(s0)Es1p(s0, a0)Ea1π(s1)Estp(st1, at1)[DKL(π(st)  π(st))]=t=0TγtEs0b0()Ea0π(s0)Es1p(s0, a0)Ea1π(s1)Estp(st1, at1)Eatπ(st)[lnπ(atst)π(atst)]\begin{aligned} d_{\mathrm{KL}}(\pi,\ \pi') &= \mathcal{E}_{s_{0} \sim b_{0}(\cdot)} \mathcal{E}_{a_{0} \sim \pi(\cdot \mid s_{0})} \mathcal{E}_{s_{1} \sim p(\cdot \mid s_{0},\ a_{0})} \mathcal{E}_{a_{1} \sim \pi(\cdot \mid s_{1})} \cdots \mathcal{E}_{s_{\mathrm{T}} \sim p(\cdot \mid s_{\mathrm{T} - 1},\ a_{\mathrm{T} - 1})} \mathcal{E}_{a_{\mathrm{T}} \sim \pi(\cdot \mid s_{\mathrm{T}})} \left[ \sum_{t = 0}^{\mathrm{T}} \gamma^{t} D_{\mathrm{KL}} \Big( \pi(\cdot \mid s_{t})\ \|\ \pi'(\cdot \mid s_{t}) \Big) \right] \\[7mm] &= \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0} \sim b_{0}(\cdot)} \mathcal{E}_{a_{0} \sim \pi(\cdot \mid s_{0})} \mathcal{E}_{s_{1} \sim p(\cdot \mid s_{0},\ a_{0})} \mathcal{E}_{a_{1} \sim \pi(\cdot \mid s_{1})} \cdots \mathcal{E}_{s_{t} \sim p(\cdot \mid s_{t - 1},\ a_{t - 1})} \left[ D_{\mathrm{KL}} \Big( \pi(\cdot \mid s_{t})\ \|\ \pi'(\cdot \mid s_{t}) \Big) \right] \\[7mm] &= \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0} \sim b_{0}(\cdot)} \mathcal{E}_{a_{0} \sim \pi(\cdot \mid s_{0})} \mathcal{E}_{s_{1} \sim p(\cdot \mid s_{0},\ a_{0})} \mathcal{E}_{a_{1} \sim \pi(\cdot \mid s_{1})} \cdots \mathcal{E}_{s_{t} \sim p(\cdot \mid s_{t - 1},\ a_{t - 1})} \mathcal{E}_{a_{t} \sim \pi(\cdot \mid s_{t})} \left[ \ln \frac{\pi(a_{t} \mid s_{t})}{\pi'(a_{t} \mid s_{t})} \right] \end{aligned}

此时优化问题近似为在一个 δ\delta - KL 球(信任区域)中通过采样轨迹做近似优化:

maxθL(θθ)t=0Tγtdπθ(t)(st, at)πθ(atst)πθ(atst)s.t.  dKL(πθ, πθ)t=0Tγtlnπθ(atst)πθ(atst)δ\max_{\theta'} L(\theta' \mid \theta) \approx \sum_{t = 0}^{\mathrm{T}} \gamma^{t} d_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) \frac{\pi_{\theta'}(a_{t} \mid s_{t})}{\pi_{\theta}(a_{t} \mid s_{t})} \quad \mathrm{s.t.}\ \ d_{\mathrm{KL}}(\pi_{\theta},\ \pi_{\theta'}) \approx \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \ln \frac{\pi_{\theta'}(a_{t} \mid s_{t})}{\pi_{\theta}(a_{t} \mid s_{t})} \le \delta

由于优化目标中的优势函数未知,因此需要对其进行估计,可以通过采样轨迹和价值函数进行近似:

dπθ(t)(st, at)=qπθ(t)(st, at)vπθ(t)(st)=Ert+1rt+1+γEst+1vπθ(t+1)(st+1)vπθ(t)(st)=Ert+1Est+1[rt+1+γvπθ(t+1)(st+1)vπθ(t)(st)]Δt(1)=Ert+1Est+1[rt+1+γEat+1qπθ(t+1)(st+1, at+1)vπθ(t)(st)]=Ert+1Est+1[rt+1+γEat+1[Ert+2rt+2+γEst+2vπθ(t+2)(st+2)]vπθ(t)(st)]=Ert+1Est+1Eat+1Ert+2Est+2[rt+1+γrt+2+γ2vπθ(t+2)(st+2)vπθ(t)(st)]Δt(2)==Ert+1Est+1Eat+1Ert+2Est+2Ert+kEst+k[τ=0k1γτrt+1+τ+γkvπθ(t+k)(st+k)vπθ(t)(st)]Δt(k)\begin{aligned} d_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) &= q_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) - v_{\pi_{\theta}}^{(t)}(s_{t}) = \mathcal{E}_{r_{t + 1}} r_{t + 1} + \gamma \mathcal{E}_{s_{t + 1}} v_{\pi_{\theta}}^{(t + 1)}(s_{t + 1}) - v_{\pi_{\theta}}^{(t)}(s_{t}) \\[5mm] &= \mathcal{E}_{r_{t + 1}} \mathcal{E}_{s_{t + 1}} \Big[ r_{t + 1} + \gamma v_{\pi_{\theta}}^{(t + 1)}(s_{t + 1}) - v_{\pi_{\theta}}^{(t)}(s_{t}) \Big] \Leftarrow \Delta_{t}^{(1)} \\[5mm] &= \mathcal{E}_{r_{t + 1}} \mathcal{E}_{s_{t + 1}} \Big[ r_{t + 1} + \gamma \mathcal{E}_{a_{t + 1}} q_{\pi_{\theta}}^{(t + 1)}(s_{t + 1},\ a_{t + 1}) - v_{\pi_{\theta}}^{(t)}(s_{t}) \Big] \\[5mm] &= \mathcal{E}_{r_{t + 1}} \mathcal{E}_{s_{t + 1}} \Big[ r_{t + 1} + \gamma \mathcal{E}_{a_{t + 1}} \Big[ \mathcal{E}_{r_{t + 2}} r_{t + 2} + \gamma \mathcal{E}_{s_{t + 2}} v_{\pi_{\theta}}^{(t + 2)}(s_{t + 2}) \Big] - v_{\pi_{\theta}}^{(t)}(s_{t}) \Big] \\[5mm] &= \mathcal{E}_{r_{t + 1}} \mathcal{E}_{s_{t + 1}} \mathcal{E}_{a_{t + 1}} \mathcal{E}_{r_{t + 2}} \mathcal{E}_{s_{t + 2}} \Big[ r_{t + 1} + \gamma r_{t + 2} + \gamma^{2} v_{\pi_{\theta}}^{(t + 2)}(s_{t + 2}) - v_{\pi_{\theta}}^{(t)}(s_{t}) \Big] \Leftarrow \Delta_{t}^{(2)} \\[5mm] &= \cdots = \mathcal{E}_{r_{t + 1}} \mathcal{E}_{s_{t + 1}} \mathcal{E}_{a_{t + 1}} \mathcal{E}_{r_{t + 2}} \mathcal{E}_{s_{t + 2}} \cdots \mathcal{E}_{r_{t + k}} \mathcal{E}_{s_{t + k}} \left[ \sum_{\tau = 0}^{k - 1} \gamma^{\tau} r_{t + 1 + \tau} + \gamma^{k} v_{\pi_{\theta}}^{(t + k)}(s_{t + k}) - v_{\pi_{\theta}}^{(t)}(s_{t}) \right] \Leftarrow \Delta_{t}^{(k)} \end{aligned}

在无限期规划下:

dπθ(st, at)=qπθ(st, at)vπθ(st, at)δt(k)=τ=0k1γτrt+1+τ+γkvπθ(st+k)vπθ(st)d_{\pi_{\theta}}(s_{t},\ a_{t}) = q_{\pi_{\theta}}(s_{t},\ a_{t}) - v_{\pi_{\theta}}(s_{t},\ a_{t}) \approx \delta_{t}^{(k)} = \sum_{\tau = 0}^{k - 1} \gamma^{\tau} r_{t + 1 + \tau} + \gamma^{k} v_{\pi_{\theta}}(s_{t + k}) - v_{\pi_{\theta}}(s_{t})

结合 TD(λ) 方法,利用不同时间步的时序差分项对优势函数进行估计:

dπθ(st, at)(1λ)k=1Tt1λk1δt(k)+λTt1δt(Tt)T(1λ)k=1λk1δt(k)d_{\pi_{\theta}}(s_{t},\ a_{t}) \approx (1 - \lambda) \sum_{k = 1}^{\mathrm{T} - t - 1} \lambda^{k - 1} \delta_{t}^{(k)} + \lambda^{\mathrm{T} - t - 1} \delta_{t}^{(\mathrm{T} - t)} \overset{\mathrm{T} \to \infty}{\longrightarrow} (1 - \lambda) \sum_{k = 1}^{\infty} \lambda^{k - 1} \delta_{t}^{(k)}

在估计时使用的价值函数 vπθ(st)v_{\pi_{\theta}}(s_{t}) 可以通过 V 网络 vw(s)v_{w}(s) 进行近似和学习。

Proximal Policy Optimization(PPO)

由于 TRPO 带有信任域的约束,因此优化过程较为复杂,而 PPO 算法则将约束加入目标函数进行近似求解。

PPO-Penalty

PPO-Penalty 结合拉格朗日乘数法将原 TRPO 算法中的 KL 散度约束加入到了目标函数中:

L(θθ)=t=0TγtEs0Ea0Es1Ea1EstEat[πθ(atst)πθ(atst)dπθ(t)(st, at)]βdKL(πθ, πθ)=t=0TγtEs0Ea0Es1Ea1EstEat[πθ(atst)πθ(atst)dπθ(t)(st, at)βDKL(πθ(st)  πθ(st))]\begin{aligned} L(\theta' \mid \theta) &= \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{t}} \mathcal{E}_{a_{t}} \left[ \frac{\pi_{\theta'}(a_{t} \mid s_{t})}{\pi_{\theta}(a_{t} \mid s_{t})} d_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) \right] - \beta d_{\mathrm{KL}}(\pi_{\theta},\ \pi_{\theta'}) \\[7mm] &= \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{t}} \mathcal{E}_{a_{t}} \left[ \frac{\pi_{\theta'}(a_{t} \mid s_{t})}{\pi_{\theta}(a_{t} \mid s_{t})} d_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) - \beta D_{\mathrm{KL}} \Big( \pi_{\theta}(\cdot \mid s_{t})\ \|\ \pi_{\theta'}(\cdot \mid s_{t}) \Big) \right] \end{aligned}

通过采样轨迹近似为:

L(θθ)t=0Tγt[πθ(atst)πθ(atst)dπθ(t)(st, at)βDKL(πθ(st)  πθ(st))]L(\theta' \mid \theta) \approx \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \left[ \frac{\pi_{\theta'}(a_{t} \mid s_{t})}{\pi_{\theta}(a_{t} \mid s_{t})} d_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) - \beta D_{\mathrm{KL}} \Big( \pi_{\theta}(\cdot \mid s_{t})\ \|\ \pi_{\theta'}(\cdot \mid s_{t}) \Big) \right]

此时优化问题变为无约束的 maxθL(θθ)\max_{\theta'} L(\theta' \mid \theta),为了限制学习策略和之前一轮策略的差距,令:

βk+1{βk2dk<23ϵ2βkdk>32ϵβk23ϵdk32ϵ\beta_{k + 1} \leftarrow \left\{ \begin{matrix} \dfrac{\beta_{k}}{2} & d_{k} < \dfrac{2}{3} \epsilon \\[5mm] 2 \beta_{k} & d_{k} > \dfrac{3}{2} \epsilon \\[5mm] \beta_{k} & \dfrac{2}{3} \epsilon \le d_{k} \le \dfrac{3}{2} \epsilon \end{matrix} \right.

其中 dkd_{k} 为第 kk 轮与第 k+1k + 1 轮的策略 KL 散度,ϵ\epsilon 为设定好的超参数。

PPO-Clip

与 PPO-Penalty 类似,PPO-Clip 同样将策略更新的幅度约束加入到目标函数中,但约束的形式略有不同:

L(θθ)=t=0TγtEs0Ea0Es1Ea1EstEatmin[πθ(atst)πθ(atst)dπθ(t)(st, at), clip(πθ(atst)πθ(atst), 1ϵ, 1+ϵ)dπθ(t)(st, at)]L(\theta' \mid \theta) = \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{t}} \mathcal{E}_{a_{t}} \min \left[ \frac{\pi_{\theta'}(a_{t} \mid s_{t})}{\pi_{\theta}(a_{t} \mid s_{t})} d_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}),\ \operatorname{clip} \left( \frac{\pi_{\theta'}(a_{t} \mid s_{t})}{\pi_{\theta}(a_{t} \mid s_{t})},\ 1 - \epsilon,\ 1 + \epsilon \right) d_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) \right]

其中 ϵ\epsilon 为超参数,通过采样轨迹可以近似为:

L(θθ)t=0Tγtmin[πθ(atst)πθ(atst)dπθ(t)(st, at), clip(πθ(atst)πθ(atst), 1ϵ, 1+ϵ)dπθ(t)(st, at)]L(\theta' \mid \theta) \approx \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \min \left[ \frac{\pi_{\theta'}(a_{t} \mid s_{t})}{\pi_{\theta}(a_{t} \mid s_{t})} d_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}),\ \operatorname{clip} \left( \frac{\pi_{\theta'}(a_{t} \mid s_{t})}{\pi_{\theta}(a_{t} \mid s_{t})},\ 1 - \epsilon,\ 1 + \epsilon \right) d_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) \right]

其中,截断函数 clip(x, l, r)=max(min(x, r), l)\operatorname{clip}(x,\ l,\ r) = \max(\min(x,\ r),\ l) 可以将 xx 限制在区间 [l, r][l,\ r] 内,外层的 min\min 限制了:

情况 限制
dπθ(t)(st, at)>0d_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) > 0 πθ(atst)πθ(atst)>1+ϵ\dfrac{\pi_{\theta'}(a_{t} \mid s_{t})}{\pi_{\theta}(a_{t} \mid s_{t})} > 1 + \epsilon 时,该项梯度退化为 0
dπθ(t)(st, at)<0d_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) < 0 πθ(atst)πθ(atst)<1ϵ\dfrac{\pi_{\theta'}(a_{t} \mid s_{t})}{\pi_{\theta}(a_{t} \mid s_{t})} < 1 - \epsilon 时,该项梯度退化为 0

TRPO and PPO
http://example.com/2024/07/19/TRPO&PPO/
Author
木辛
Posted on
July 19, 2024
Licensed under