Soft Actor-Critic

SAC（Soft Actor-Critic）

Soft Policy Gradient

在参数化策略后可以通过类似的方法计算熵正则项的梯度：

\begin{aligned} \nabla_{\theta} \Omega(\theta) &= -\nabla_{\theta} \sum_{s_{0}} \sum_{a_{0}} \sum_{s_{1}} \sum_{a_{1}} \cdots \sum_{s_{\mathrm{T}}} \sum_{a_{\mathrm{T}}} \left[ b_{0}(s_{0}) \prod_{t = 0}^{\mathrm{T}} \pi_{\theta}(a_{t} \mid s_{t}) \prod_{t = 1}^{\mathrm{T}} p(s_{t} \mid s_{t - 1},\ a_{t - 1}) \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \ln \pi_{\theta}(a_{t} \mid s_{t}) \right] \\[7mm] &= -\sum_{s_{0}} \sum_{a_{0}} \sum_{s_{1}} \sum_{a_{1}} \cdots \sum_{s_{\mathrm{T}}} \sum_{a_{\mathrm{T}}} \left[ b_{0}(s_{0}) \left( \nabla_{\theta} \prod_{t = 0}^{\mathrm{T}} \pi_{\theta}(a_{t} \mid s_{t}) \right) \prod_{t = 1}^{\mathrm{T}} p(s_{t} \mid s_{t - 1},\ a_{t - 1}) \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \ln \pi_{\theta}(a_{t} \mid s_{t}) \right] \\[7mm] &\quad - \sum_{s_{0}} \sum_{a_{0}} \sum_{s_{1}} \sum_{a_{1}} \cdots \sum_{s_{\mathrm{T}}} \sum_{a_{\mathrm{T}}} \left[ b_{0}(s_{0}) \prod_{t = 0}^{\mathrm{T}} \pi_{\theta}(a_{t} \mid s_{t}) \prod_{t = 1}^{\mathrm{T}} p(s_{t} \mid s_{t - 1},\ a_{t - 1}) \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) \right] \\[7mm] &= -\mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \mathcal{E}_{a_{\mathrm{T}}} \left[ \left( \sum_{t = 0}^{\mathrm{T}} \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) \right) \cdot \left( \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \ln \pi_{\theta}(a_{t} \mid s_{t}) \right) + \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) \right] \end{aligned}

梯度中的后一部分可以化简为 0：

\sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{t}} \underset{0}{\underbrace{\mathcal{E}_{a_{t}} \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t})}} = 0

通过和策略梯度中类似的方法，可以证明乘积因子 $\nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) \ln \pi_{\theta}(a_{\tau} \mid s_{\tau})$ 在 $t > \tau$ 时的期望为 0：

\mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \mathcal{E}_{a_{\mathrm{T}}} \Big[ \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) \ln \pi_{\theta}(a_{\tau} \mid s_{\tau}) \Big] = \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{t}} \Big[ \ln \pi_{\theta}(a_{\tau} \mid s_{\tau}) \underset{0}{\underbrace{\mathcal{E}_{a_{t}} \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t})}} \Big] = 0

最终将熵正则项的梯度化简为：

\nabla_{\theta} \Omega(\theta) = -\sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{t}} \mathcal{E}_{a_{t}} \left[ \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) \mathcal{E}_{s_{t + 1}} \mathcal{E}_{a_{t + 1}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \mathcal{E}_{a_{\mathrm{T}}} \sum_{\tau = 0}^{\mathrm{T}} \gamma^{\tau - t} \ln \pi_{\theta}(a_{\tau} \mid s_{\tau}) \right]

结合 $J(\theta)$ 的策略梯度，可以将总体优化目标函数的梯度写作：

\begin{aligned} \nabla_{\theta} J_{\mathcal{H}}(\theta) &= \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{t}} \mathcal{E}_{a_{t}} \left[ \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) \mathcal{E}_{s_{t + 1}} \mathcal{E}_{a_{t + 1}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \mathcal{E}_{a_{\mathrm{T}}} \sum_{\tau = t}^{\mathrm{T}} \gamma^{\tau - t} \mathcal{R}(s_{\tau},\ a_{\tau}) \right] \\[7mm] &- \alpha \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{t}} \mathcal{E}_{a_{t}} \left[ \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) \mathcal{E}_{s_{t + 1}} \mathcal{E}_{a_{t + 1}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \mathcal{E}_{a_{\mathrm{T}}} \sum_{\tau = 0}^{\mathrm{T}} \gamma^{\tau - t} \ln \pi_{\theta}(a_{\tau} \mid s_{\tau}) \right] \\[7mm] &= \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{t}} \mathcal{E}_{a_{t}} \left[ \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) \Big( q_{\mathcal{H}}^{(t)}(s_{t},\ a_{t}) - \alpha \ln \pi_{\theta}(a_{t} \mid s_{t}) \Big) \right] \\[7mm] &\overset{\mathrm{T} \to \infty}{\longrightarrow} \frac{1}{1 - \gamma} \mathcal{E}_{s \sim \nu_{\pi_{\theta}}(\cdot)} \mathcal{E}_{a \sim \pi_{\theta}(\cdot \mid s)} \left[ \nabla_{\theta} \ln \pi_{\theta}(a \mid s) \Big( q_{\mathcal{H}}(s,\ a) - \alpha \ln \pi_{\theta}(a \mid s) \Big) \right] \end{aligned}

为了学习对 $q_{\mathcal{H}}(s,\ a)$ 的近似，SAC 算法采用 critic 网络 $q_{w}(s,\ a)$ 并通过 TD 误差来指导网络参数的更新：

\mathcal{L}_{q}(w) = \frac{1}{2n} \sum_{i = 1}^{n} \Big[ q_{w}(s_{i},\ a_{i}) - r_{i} - \gamma \Big( q_{w}(s_{i}',\ a_{i}') - \alpha \ln \pi_{\theta}(a_{i}' \mid s_{i}') \Big) \Big]^{2}

其中，动作 $a_{i}' \sim \pi_{\theta}(\cdot \mid s_{i}')$ ，同时为了切断自举并防止高估，可以使用两套 critic + 目标网络：

\mathcal{L}_{q}(w) = \frac{1}{2n} \sum_{i = 1}^{n} \Big[ q_{w}(s_{i},\ a_{i}) - r_{i} - \gamma \Big( \min_{j \in \{1,\ 2\}} q_{w_{j}^{-}}(s_{i}',\ a_{i}') - \alpha \ln \pi_{\theta}(a_{i}' \mid s_{i}') \Big) \Big]^{2}

自适应温度系数

由于原始的 SAC 算法对温度系数项 $\alpha$ 较为敏感，在有限期规划下将优化问题重写为带约束的形式：

\begin{gathered} \max_{\pi^{(0)},\ \pi^{(1)},\ \cdots,\ \pi^{(\mathrm{T})}} J(\pi) = \mathcal{E}_{s_{0} \sim b_{0}(\cdot)} \mathcal{E}_{a_{0} \sim \pi^{(0)}(\cdot \mid s_{0})} \mathcal{E}_{s_{1} \sim p(\cdot \mid s_{0},\ a_{0})} \mathcal{E}_{a_{1} \sim \pi^{(1)}(\cdot \mid s_{1})} \cdots \mathcal{E}_{s_{\mathrm{T}} \sim p(\cdot \mid s_{\mathrm{T} - 1},\ a_{\mathrm{T} - 1})} \mathcal{E}_{a_{\mathrm{T}} \sim \pi^{(\mathrm{T})}(\cdot \mid s_{\mathrm{T}})} \left[ \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{R}(s_{t},\ a_{t}) \right] \\[7mm] \mathrm{s.t.} \quad \forall\ t : \mathcal{E}_{s_{0} \sim b_{0}(\cdot)} \mathcal{E}_{a_{0} \sim \pi^{(0)}(\cdot \mid s_{0})} \mathcal{E}_{s_{1} \sim p(\cdot \mid s_{0},\ a_{0})} \mathcal{E}_{a_{1} \sim \pi^{(1)}(\cdot \mid s_{1})} \cdots \mathcal{E}_{s_{t} \sim p(\cdot \mid s_{t - 1},\ a_{t - 1})} \mathcal{E}_{a_{t} \sim \pi^{(t)}(\cdot \mid s_{t})} \Big[ -\ln \pi^{(t)}(a_{t} \mid s_{t}) \Big] \ge \mathcal{H} \end{gathered}

目标函数的最大值可以被递归地分解为：

\max_{\pi^{(0)},\ \pi^{(1)},\ \cdots,\ \pi^{(\mathrm{T})}} J(\pi) = \max_{\pi^{(0)}} \left[ \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \Big[ \mathcal{R}(s_{0},\ a_{0}) \Big] + \gamma \max_{\pi^{(1)}} \left[ \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \Big[ \mathcal{R}(s_{1},\ a_{1}) \Big] + \gamma \max_{\pi^{(2)}} (\cdots) \right] \right]

其中每层的最大化操作都需要满足相应的约束，最内层的最大化操作为：

\max_{\pi^{(\mathrm{T})}} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \mathcal{E}_{a_{\mathrm{T}}} \Big[ \mathcal{R}(s_{\mathrm{T}},\ a_{\mathrm{T}}) \Big] \quad \mathrm{s.t.}\quad \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \mathcal{E}_{a_{\mathrm{T}}} \Big[ -\ln \pi^{(\mathrm{T})}(a_{\mathrm{T}} \mid s_{\mathrm{T}}) \Big] \ge \mathcal{H}

其中优化目标和约束条件均为凸函数，因此可以将带约束的最大化操作转换为相应的对偶问题：

\min_{\alpha^{(\mathrm{T})}} \max_{\pi^{(\mathrm{T})}} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \mathcal{E}_{a_{\mathrm{T}}} \Big[ \mathcal{R}(s_{\mathrm{T}},\ a_{\mathrm{T}}) - \alpha^{(\mathrm{T})} \ln \pi^{(\mathrm{T})}(a_{\mathrm{T}} \mid s_{\mathrm{T}}) - \alpha^{(\mathrm{T})} \mathcal{H} \Big]

其中，内层的 $\max_{\pi^{(\mathrm{T})}}$ 对应的最优策略 $\pi_{\star}^{(\mathrm{T})}$ 即温度系数 $\alpha^{(\mathrm{T})}$ 下的最优策略（soft）：

\begin{aligned} \pi_{\star}^{(\mathrm{T})} &= \argmax_{\pi^{(\mathrm{T})}} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \mathcal{E}_{a_{\mathrm{T}}} \Big[ \mathcal{R}(s_{\mathrm{T}},\ a_{\mathrm{T}}) - \alpha^{(\mathrm{T})} \ln \pi^{(\mathrm{T})}(a_{\mathrm{T}} \mid s_{\mathrm{T}}) - \alpha^{(\mathrm{T})} \mathcal{H} \Big] \\[5mm] &= \argmax_{\pi^{(\mathrm{T})}(\cdot \mid s_{\mathrm{T}})} \mathcal{E}_{a_{\mathrm{T}} \sim \pi^{(\mathrm{T})}(\cdot \mid s_{\mathrm{T}})} \Big[ q_{\star}^{(\mathrm{T})}(s_{\mathrm{T}},\ a_{\mathrm{T}}) - \alpha^{(\mathrm{T})} \ln \pi^{(\mathrm{T})}(a_{\mathrm{T}} \mid s_{\mathrm{T}}) \Big] \propto \exp \left[ \frac{1}{\alpha^{(\mathrm{T})}} q_{\star}^{(\mathrm{T})}(s_{\mathrm{T}},\ a_{\mathrm{T}}) \right] \end{aligned}

注意此处的 $\pi_{\star}^{(\mathrm{T})}$ 是 $\alpha^{(\mathrm{T})}$ 的函数，进而得到最优温度系数：

\begin{aligned} \alpha_{\star}^{(\mathrm{T})} &= \argmin_{\alpha^{(\mathrm{T})}} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \mathcal{E}_{a_{\mathrm{T}}} \Big[ \mathcal{R}(s_{\mathrm{T}},\ a_{\mathrm{T}}) - \alpha^{(\mathrm{T})} \ln \pi_{\star}^{(\mathrm{T})}(a_{\mathrm{T}} \mid s_{\mathrm{T}}) - \alpha^{(\mathrm{T})} \mathcal{H} \Big] \\[5mm] &= \argmin_{\alpha^{(\mathrm{T})}} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \mathcal{E}_{a_{\mathrm{T}} \sim \pi_{\star}^{(\mathrm{T})}(\cdot \mid s_{\mathrm{T}})} \Big[ - \alpha^{(\mathrm{T})} \ln \pi_{\star}^{(\mathrm{T})}(a_{\mathrm{T}} \mid s_{\mathrm{T}}) - \alpha^{(\mathrm{T})} \mathcal{H} \Big] \end{aligned}

将最优系数 $\alpha_{\star}^{(\mathrm{T})}$ 及其对应的最优策略（soft）代入第二层的最大化操作中：

\max_{\pi^{(\mathrm{T} - 1)}} \bigg\{ \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{\mathrm{T} - 1}} \mathcal{E}_{a_{\mathrm{T} - 1}} \Big[ \mathcal{R}(s_{\mathrm{T} - 1},\ a_{\mathrm{T} - 1}) + \gamma \mathcal{E}_{s_{\mathrm{T}}} \mathcal{E}_{a_{\mathrm{T}}} \Big[ \mathcal{R}(s_{\mathrm{T}},\ a_{\mathrm{T}}) - \alpha_{\star}^{(\mathrm{T})} \ln \pi_{\star}^{(\mathrm{T})}(a_{\mathrm{T}} \mid s_{\mathrm{T}}) \Big] \Big] - \gamma \alpha_{\star}^{(\mathrm{T})} \mathcal{H} \bigg\}

最优动作价值函数（soft）满足：

q_{\star}^{(t)}(s_{t},\ a_{t}) = \mathcal{R}(s_{t},\ a_{t}) +\gamma \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t},\ a_{t})} \mathcal{E}_{a_{t + 1} \sim \pi_{\star}^{(t + 1)}(\cdot \mid s_{t + 1})} \Big[ q_{\star}^{(t + 1)}(s_{t + 1},\ a_{t + 1}) - \alpha \ln \pi_{\star}^{(t + 1)}(a' \mid s') \Big]

结合初始条件 $q_{\star}^{(\mathrm{T})}(s_{\mathrm{T}},\ a_{\mathrm{T}}) = \mathcal{R}(s_{\mathrm{T}},\ a_{\mathrm{T}})$ ，可以将第二层最大化操作转化为：

\max_{\pi^{(\mathrm{T} - 1)}} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{\mathrm{T} - 1}} \mathcal{E}_{a_{\mathrm{T} - 1}} \Big[ q_{\star}^{(\mathrm{T} - 1)}(s_{\mathrm{T} - 1},\ a_{\mathrm{T} - 1}) \Big] \quad \mathrm{s.t.} \quad \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{\mathrm{T} - 1}} \mathcal{E}_{a_{\mathrm{T} - 1}} \Big[ -\ln \pi^{(\mathrm{T} - 1)}(a_{\mathrm{T} - 1} \mid s_{\mathrm{T} - 1}) \Big] \ge \mathcal{H}

通过相同的方法得到：

\begin{gathered} \pi_{\star}^{(\mathrm{T} - 1)} = \argmax_{\pi^{(\mathrm{T} - 1)}(\cdot \mid s_{\mathrm{T} - 1})} \mathcal{E}_{a_{\mathrm{T} - 1} \sim \pi^{(\mathrm{T} - 1)}(\cdot \mid s_{\mathrm{T} - 1})} \Big[ q_{\star}^{(\mathrm{T} - 1)}(s_{\mathrm{T} - 1},\ a_{\mathrm{T} - 1}) - \alpha^{(\mathrm{T} - 1)} \ln \pi^{(\mathrm{T} - 1)}(a_{\mathrm{T} - 1} \mid s_{\mathrm{T} - 1}) \Big] \\[5mm] \alpha_{\star}^{(\mathrm{T} - 1)} = \argmin_{\alpha^{(\mathrm{T} - 1)}} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{\mathrm{T} - 1}} \mathcal{E}_{a_{\mathrm{T} - 1} \sim \pi_{\star}^{(\mathrm{T} - 1)}(\cdot \mid s_{\mathrm{T} - 1})} \Big[ - \alpha^{(\mathrm{T} - 1)} \ln \pi_{\star}^{(\mathrm{T} - 1)}(a_{\mathrm{T} - 1} \mid s_{\mathrm{T} - 1}) - \alpha^{(\mathrm{T} - 1)} \mathcal{H} \Big] \end{gathered}

重复以上操作，即可得到每个规划时间步上的最优温度系数 $\alpha_{\star}^{(t)}$ 以及对应的最优策略（soft） $\pi_{\star}^{(t)}$ 。在参数化的策略和凸性假设下，可以通过对偶梯度下降的方式对截断的温度系数的目标函数进行求解：

J(\alpha) = \mathcal{E}_{a \sim \pi_{\theta}(\cdot \mid s)} \Big[ -\alpha \ln \pi_{\theta}(a \mid s) - \alpha \mathcal{H} \Big]

RL > Preliminary

#SAC

Soft Actor-Critic

http://example.com/2024/07/20/SAC/

Author

木辛

Posted on

July 20, 2024

Licensed under

TD-MPC Previous

Soft Q-Learning Next