TRPO and PPO

TRPO & PPO

Trust Region Policy Optimization（TRPO）

由于 actor-critic 方法直接使用策略梯度进行参数的更新，在步长较大时策略有可能会显著变差。为了保证策略在优化时性能的单调提升，即 $J(\theta') \ge J(\theta)$ ，将优化目标 $J(\theta) = \mathcal{E}_{s_{0} \sim b_{0}(\cdot)} \Big[ v_{\pi_{\theta}}^{(0)} (s_{0}) \Big]$ 重写为：

\begin{aligned} J(\theta) &= \mathcal{E}_{s_{0} \sim b_{0}(\cdot)} \mathcal{E}_{a_{0} \sim \pi_{\theta'}(\cdot \mid s_{0})} \mathcal{E}_{s_{1} \sim p(\cdot \mid s_{0},\ a_{0})} \mathcal{E}_{a_{1} \sim \pi_{\theta'}(\cdot \mid s_{1})} \cdots \mathcal{E}_{s_{\mathrm{T}} \sim p(\cdot \mid s_{\mathrm{T} - 1},\ a_{\mathrm{T} - 1})} \mathcal{E}_{a_{\mathrm{T}} \sim \pi_{\theta'}(\cdot \mid s_{\mathrm{T}})} \left[ \sum_{t = 0}^{\mathrm{T}} \gamma^{t} v_{\pi_{\theta}}^{(t)}(s_{t}) - \sum_{t = 1}^{\mathrm{T}} \gamma^{t} v_{\pi_{\theta}}^{(t)}(s_{t}) \right] \\[7mm] &= -\mathcal{E}_{s_{0} \sim b_{0}(\cdot)} \mathcal{E}_{a_{0} \sim \pi_{\theta'}(\cdot \mid s_{0})} \mathcal{E}_{s_{1} \sim p(\cdot \mid s_{0},\ a_{0})} \mathcal{E}_{a_{1} \sim \pi_{\theta'}(\cdot \mid s_{1})} \cdots \mathcal{E}_{s_{\mathrm{T}} \sim p(\cdot \mid s_{\mathrm{T} - 1},\ a_{\mathrm{T} - 1})} \mathcal{E}_{a_{\mathrm{T}} \sim \pi_{\theta'}(\cdot \mid s_{\mathrm{T}})} \left[ \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \Big( \gamma v_{\pi_{\theta}}^{(t + 1)}(s_{t + 1}) - v_{\pi_{\theta}}^{(t)}(s_{t}) \Big) \right] \end{aligned}

以上形式的轨迹通过 $\pi_{\theta'}$ 生成，计算参数更新前后优化目标的差值：

\begin{aligned} J(\theta') - J(\theta) &= \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \mathcal{E}_{a_{\mathrm{T}}} \left[ \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{R}(s_{t},\ a_{t}) \right] + \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \mathcal{E}_{a_{\mathrm{T}}} \left[ \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \Big( \gamma v_{\pi_{\theta}}^{(t + 1)}(s_{t + 1}) - v_{\pi_{\theta}}^{(t)}(s_{t}) \Big) \right] \\[7mm] &= \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \mathcal{E}_{a_{\mathrm{T}}} \left[ \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \Big( \mathcal{R}(s_{t},\ a_{t}) + \gamma v_{\pi_{\theta}}^{(t + 1)}(s_{t + 1}) - v_{\pi_{\theta}}^{(t)}(s_{t}) \Big) \right] \\[7mm] &= \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{t}} \mathcal{E}_{a_{t}} \mathcal{E}_{s_{t + 1}} \Big[ \mathcal{R}(s_{t},\ a_{t}) + \gamma v_{\pi_{\theta}}^{(t + 1)}(s_{t + 1}) - v_{\pi_{\theta}}^{(t)}(s_{t}) \Big] \\[7mm] &= \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{t}} \mathcal{E}_{a_{t}} \Big[ \underset{q_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t})}{\underbrace{\mathcal{R}(s_{t},\ a_{t}) + \gamma \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t},\ a_{t})} v_{\pi_{\theta}}^{(t + 1)}(s_{t + 1})}} - v_{\pi_{\theta}}^{(t)}(s_{t}) \Big] \\[7mm] &= \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{t}} \mathcal{E}_{a_{t}} \Big[ q_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) - v_{\pi_{\theta}}^{(t)}(s_{t}) \Big] = \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{t}} \mathcal{E}_{a_{t}} \Big[ d_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) \Big] \end{aligned}

为了求解上式需要使用新策略 $\pi_{\theta'}$ 进行轨迹的生成，而后对目标函数进行计算与优化。但这种做法在现实中难以实现，因此在 $\pi_{\theta'}$ 与 $\pi_{\theta}$ 非常接近时，可以将差值近似为：

\begin{aligned} L(\theta' \mid \theta) &= \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0} \sim b_{0}(\cdot)} \mathcal{E}_{a_{0} \sim \pi_{\theta'}(\cdot \mid s_{0})} \mathcal{E}_{s_{1} \sim p(\cdot \mid s_{0},\ a_{0})} \mathcal{E}_{a_{1} \sim \pi_{\theta'}(\cdot \mid s_{1})} \cdots \mathcal{E}_{s_{t} \sim p(\cdot \mid s_{t - 1},\ a_{t - 1})} \mathcal{E}_{a_{t} \sim \pi_{\theta'}(\cdot \mid s_{t})} \Big[ d_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) \Big] \\[7mm] &\approx \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0} \sim b_{0}(\cdot)} \mathcal{E}_{a_{0} \sim \pi_{\theta}(\cdot \mid s_{0})} \mathcal{E}_{s_{1} \sim p(\cdot \mid s_{0},\ a_{0})} \mathcal{E}_{a_{1} \sim \pi_{\theta}(\cdot \mid s_{1})} \cdots \mathcal{E}_{s_{t} \sim p(\cdot \mid s_{t - 1},\ a_{t - 1})} \mathcal{E}_{a_{t} \sim \pi_{\theta'}(\cdot \mid s_{t})} \Big[ d_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) \Big] \\[7mm] &= \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0} \sim b_{0}(\cdot)} \mathcal{E}_{a_{0} \sim \pi_{\theta}(\cdot \mid s_{0})} \mathcal{E}_{s_{1} \sim p(\cdot \mid s_{0},\ a_{0})} \mathcal{E}_{a_{1} \sim \pi_{\theta}(\cdot \mid s_{1})} \cdots \mathcal{E}_{s_{t} \sim p(\cdot \mid s_{t - 1},\ a_{t - 1})} \mathcal{E}_{a_{t} \sim \pi_{\theta}(\cdot \mid s_{t})} \left[ \frac{\pi_{\theta'}(a_{t} \mid s_{t})}{\pi_{\theta}(a_{t} \mid s_{t})} d_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) \right] \end{aligned}

此时可以通过旧策略 $\pi_{\theta}$ 的采样数据来估计并优化新策略 $\pi_{\theta'}$ ，同时使用 KL 散度来衡量策略间的距离：

\begin{aligned} d_{\mathrm{KL}}(\pi,\ \pi') &= \mathcal{E}_{s_{0} \sim b_{0}(\cdot)} \mathcal{E}_{a_{0} \sim \pi(\cdot \mid s_{0})} \mathcal{E}_{s_{1} \sim p(\cdot \mid s_{0},\ a_{0})} \mathcal{E}_{a_{1} \sim \pi(\cdot \mid s_{1})} \cdots \mathcal{E}_{s_{\mathrm{T}} \sim p(\cdot \mid s_{\mathrm{T} - 1},\ a_{\mathrm{T} - 1})} \mathcal{E}_{a_{\mathrm{T}} \sim \pi(\cdot \mid s_{\mathrm{T}})} \left[ \sum_{t = 0}^{\mathrm{T}} \gamma^{t} D_{\mathrm{KL}} \Big( \pi(\cdot \mid s_{t})\ \|\ \pi'(\cdot \mid s_{t}) \Big) \right] \\[7mm] &= \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0} \sim b_{0}(\cdot)} \mathcal{E}_{a_{0} \sim \pi(\cdot \mid s_{0})} \mathcal{E}_{s_{1} \sim p(\cdot \mid s_{0},\ a_{0})} \mathcal{E}_{a_{1} \sim \pi(\cdot \mid s_{1})} \cdots \mathcal{E}_{s_{t} \sim p(\cdot \mid s_{t - 1},\ a_{t - 1})} \left[ D_{\mathrm{KL}} \Big( \pi(\cdot \mid s_{t})\ \|\ \pi'(\cdot \mid s_{t}) \Big) \right] \\[7mm] &= \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0} \sim b_{0}(\cdot)} \mathcal{E}_{a_{0} \sim \pi(\cdot \mid s_{0})} \mathcal{E}_{s_{1} \sim p(\cdot \mid s_{0},\ a_{0})} \mathcal{E}_{a_{1} \sim \pi(\cdot \mid s_{1})} \cdots \mathcal{E}_{s_{t} \sim p(\cdot \mid s_{t - 1},\ a_{t - 1})} \mathcal{E}_{a_{t} \sim \pi(\cdot \mid s_{t})} \left[ \ln \frac{\pi(a_{t} \mid s_{t})}{\pi'(a_{t} \mid s_{t})} \right] \end{aligned}

此时优化问题近似为在一个 $\delta$ - KL 球（信任区域）中通过采样轨迹做近似优化：

\max_{\theta'} L(\theta' \mid \theta) \approx \sum_{t = 0}^{\mathrm{T}} \gamma^{t} d_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) \frac{\pi_{\theta'}(a_{t} \mid s_{t})}{\pi_{\theta}(a_{t} \mid s_{t})} \quad \mathrm{s.t.}\ \ d_{\mathrm{KL}}(\pi_{\theta},\ \pi_{\theta'}) \approx \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \ln \frac{\pi_{\theta'}(a_{t} \mid s_{t})}{\pi_{\theta}(a_{t} \mid s_{t})} \le \delta

由于优化目标中的优势函数未知，因此需要对其进行估计，可以通过采样轨迹和价值函数进行近似：

\begin{aligned} d_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) &= q_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) - v_{\pi_{\theta}}^{(t)}(s_{t}) = \mathcal{E}_{r_{t + 1}} r_{t + 1} + \gamma \mathcal{E}_{s_{t + 1}} v_{\pi_{\theta}}^{(t + 1)}(s_{t + 1}) - v_{\pi_{\theta}}^{(t)}(s_{t}) \\[5mm] &= \mathcal{E}_{r_{t + 1}} \mathcal{E}_{s_{t + 1}} \Big[ r_{t + 1} + \gamma v_{\pi_{\theta}}^{(t + 1)}(s_{t + 1}) - v_{\pi_{\theta}}^{(t)}(s_{t}) \Big] \Leftarrow \Delta_{t}^{(1)} \\[5mm] &= \mathcal{E}_{r_{t + 1}} \mathcal{E}_{s_{t + 1}} \Big[ r_{t + 1} + \gamma \mathcal{E}_{a_{t + 1}} q_{\pi_{\theta}}^{(t + 1)}(s_{t + 1},\ a_{t + 1}) - v_{\pi_{\theta}}^{(t)}(s_{t}) \Big] \\[5mm] &= \mathcal{E}_{r_{t + 1}} \mathcal{E}_{s_{t + 1}} \Big[ r_{t + 1} + \gamma \mathcal{E}_{a_{t + 1}} \Big[ \mathcal{E}_{r_{t + 2}} r_{t + 2} + \gamma \mathcal{E}_{s_{t + 2}} v_{\pi_{\theta}}^{(t + 2)}(s_{t + 2}) \Big] - v_{\pi_{\theta}}^{(t)}(s_{t}) \Big] \\[5mm] &= \mathcal{E}_{r_{t + 1}} \mathcal{E}_{s_{t + 1}} \mathcal{E}_{a_{t + 1}} \mathcal{E}_{r_{t + 2}} \mathcal{E}_{s_{t + 2}} \Big[ r_{t + 1} + \gamma r_{t + 2} + \gamma^{2} v_{\pi_{\theta}}^{(t + 2)}(s_{t + 2}) - v_{\pi_{\theta}}^{(t)}(s_{t}) \Big] \Leftarrow \Delta_{t}^{(2)} \\[5mm] &= \cdots = \mathcal{E}_{r_{t + 1}} \mathcal{E}_{s_{t + 1}} \mathcal{E}_{a_{t + 1}} \mathcal{E}_{r_{t + 2}} \mathcal{E}_{s_{t + 2}} \cdots \mathcal{E}_{r_{t + k}} \mathcal{E}_{s_{t + k}} \left[ \sum_{\tau = 0}^{k - 1} \gamma^{\tau} r_{t + 1 + \tau} + \gamma^{k} v_{\pi_{\theta}}^{(t + k)}(s_{t + k}) - v_{\pi_{\theta}}^{(t)}(s_{t}) \right] \Leftarrow \Delta_{t}^{(k)} \end{aligned}

在无限期规划下：

d_{\pi_{\theta}}(s_{t},\ a_{t}) = q_{\pi_{\theta}}(s_{t},\ a_{t}) - v_{\pi_{\theta}}(s_{t},\ a_{t}) \approx \delta_{t}^{(k)} = \sum_{\tau = 0}^{k - 1} \gamma^{\tau} r_{t + 1 + \tau} + \gamma^{k} v_{\pi_{\theta}}(s_{t + k}) - v_{\pi_{\theta}}(s_{t})

结合 TD(λ) 方法，利用不同时间步的时序差分项对优势函数进行估计：

d_{\pi_{\theta}}(s_{t},\ a_{t}) \approx (1 - \lambda) \sum_{k = 1}^{\mathrm{T} - t - 1} \lambda^{k - 1} \delta_{t}^{(k)} + \lambda^{\mathrm{T} - t - 1} \delta_{t}^{(\mathrm{T} - t)} \overset{\mathrm{T} \to \infty}{\longrightarrow} (1 - \lambda) \sum_{k = 1}^{\infty} \lambda^{k - 1} \delta_{t}^{(k)}

在估计时使用的价值函数 $v_{\pi_{\theta}}(s_{t})$ 可以通过 V 网络 $v_{w}(s)$ 进行近似和学习。

Proximal Policy Optimization（PPO）

由于 TRPO 带有信任域的约束，因此优化过程较为复杂，而 PPO 算法则将约束加入目标函数进行近似求解。

PPO-Penalty

PPO-Penalty 结合拉格朗日乘数法将原 TRPO 算法中的 KL 散度约束加入到了目标函数中：

\begin{aligned} L(\theta' \mid \theta) &= \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{t}} \mathcal{E}_{a_{t}} \left[ \frac{\pi_{\theta'}(a_{t} \mid s_{t})}{\pi_{\theta}(a_{t} \mid s_{t})} d_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) \right] - \beta d_{\mathrm{KL}}(\pi_{\theta},\ \pi_{\theta'}) \\[7mm] &= \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{t}} \mathcal{E}_{a_{t}} \left[ \frac{\pi_{\theta'}(a_{t} \mid s_{t})}{\pi_{\theta}(a_{t} \mid s_{t})} d_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) - \beta D_{\mathrm{KL}} \Big( \pi_{\theta}(\cdot \mid s_{t})\ \|\ \pi_{\theta'}(\cdot \mid s_{t}) \Big) \right] \end{aligned}

通过采样轨迹近似为：

L(\theta' \mid \theta) \approx \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \left[ \frac{\pi_{\theta'}(a_{t} \mid s_{t})}{\pi_{\theta}(a_{t} \mid s_{t})} d_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) - \beta D_{\mathrm{KL}} \Big( \pi_{\theta}(\cdot \mid s_{t})\ \|\ \pi_{\theta'}(\cdot \mid s_{t}) \Big) \right]

此时优化问题变为无约束的 $\max_{\theta'} L(\theta' \mid \theta)$ ，为了限制学习策略和之前一轮策略的差距，令：

\beta_{k + 1} \leftarrow \left\{ \begin{matrix} \dfrac{\beta_{k}}{2} & d_{k} < \dfrac{2}{3} \epsilon \\[5mm] 2 \beta_{k} & d_{k} > \dfrac{3}{2} \epsilon \\[5mm] \beta_{k} & \dfrac{2}{3} \epsilon \le d_{k} \le \dfrac{3}{2} \epsilon \end{matrix} \right.

其中 $d_{k}$ 为第 $k$ 轮与第 $k + 1$ 轮的策略 KL 散度， $\epsilon$ 为设定好的超参数。

PPO-Clip

与 PPO-Penalty 类似，PPO-Clip 同样将策略更新的幅度约束加入到目标函数中，但约束的形式略有不同：

L(\theta' \mid \theta) = \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{t}} \mathcal{E}_{a_{t}} \min \left[ \frac{\pi_{\theta'}(a_{t} \mid s_{t})}{\pi_{\theta}(a_{t} \mid s_{t})} d_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}),\ \operatorname{clip} \left( \frac{\pi_{\theta'}(a_{t} \mid s_{t})}{\pi_{\theta}(a_{t} \mid s_{t})},\ 1 - \epsilon,\ 1 + \epsilon \right) d_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) \right]

其中 $\epsilon$ 为超参数，通过采样轨迹可以近似为：

L(\theta' \mid \theta) \approx \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \min \left[ \frac{\pi_{\theta'}(a_{t} \mid s_{t})}{\pi_{\theta}(a_{t} \mid s_{t})} d_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}),\ \operatorname{clip} \left( \frac{\pi_{\theta'}(a_{t} \mid s_{t})}{\pi_{\theta}(a_{t} \mid s_{t})},\ 1 - \epsilon,\ 1 + \epsilon \right) d_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) \right]

其中，截断函数 $\operatorname{clip}(x,\ l,\ r) = \max(\min(x,\ r),\ l)$ 可以将 $x$ 限制在区间 $[l,\ r]$ 内，外层的 $\min$ 限制了：

情况	限制
$d_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) > 0$	$\dfrac{\pi_{\theta'}(a_{t} \mid s_{t})}{\pi_{\theta}(a_{t} \mid s_{t})} > 1 + \epsilon$ 时，该项梯度退化为 0
$d_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) < 0$	$\dfrac{\pi_{\theta'}(a_{t} \mid s_{t})}{\pi_{\theta}(a_{t} \mid s_{t})} < 1 - \epsilon$ 时，该项梯度退化为 0

RL > Preliminary

#TRPO #PPO

TRPO and PPO

http://example.com/2024/07/19/TRPO&PPO/

Author

木辛

Posted on

July 19, 2024

Licensed under

DPG Previous

Actor-Critic Next