Soft Actor-Critic

SAC(Soft Actor-Critic)

Soft Policy Gradient

在参数化策略后可以通过类似的方法计算熵正则项的梯度:

θΩ(θ)=θs0a0s1a1sTaT[b0(s0)t=0Tπθ(atst)t=1Tp(stst1, at1)t=0Tγtlnπθ(atst)]=s0a0s1a1sTaT[b0(s0)(θt=0Tπθ(atst))t=1Tp(stst1, at1)t=0Tγtlnπθ(atst)]s0a0s1a1sTaT[b0(s0)t=0Tπθ(atst)t=1Tp(stst1, at1)t=0Tγtθlnπθ(atst)]=Es0Ea0Es1Ea1EsTEaT[(t=0Tθlnπθ(atst))(t=0Tγtlnπθ(atst))+t=0Tγtθlnπθ(atst)]\begin{aligned} \nabla_{\theta} \Omega(\theta) &= -\nabla_{\theta} \sum_{s_{0}} \sum_{a_{0}} \sum_{s_{1}} \sum_{a_{1}} \cdots \sum_{s_{\mathrm{T}}} \sum_{a_{\mathrm{T}}} \left[ b_{0}(s_{0}) \prod_{t = 0}^{\mathrm{T}} \pi_{\theta}(a_{t} \mid s_{t}) \prod_{t = 1}^{\mathrm{T}} p(s_{t} \mid s_{t - 1},\ a_{t - 1}) \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \ln \pi_{\theta}(a_{t} \mid s_{t}) \right] \\[7mm] &= -\sum_{s_{0}} \sum_{a_{0}} \sum_{s_{1}} \sum_{a_{1}} \cdots \sum_{s_{\mathrm{T}}} \sum_{a_{\mathrm{T}}} \left[ b_{0}(s_{0}) \left( \nabla_{\theta} \prod_{t = 0}^{\mathrm{T}} \pi_{\theta}(a_{t} \mid s_{t}) \right) \prod_{t = 1}^{\mathrm{T}} p(s_{t} \mid s_{t - 1},\ a_{t - 1}) \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \ln \pi_{\theta}(a_{t} \mid s_{t}) \right] \\[7mm] &\quad - \sum_{s_{0}} \sum_{a_{0}} \sum_{s_{1}} \sum_{a_{1}} \cdots \sum_{s_{\mathrm{T}}} \sum_{a_{\mathrm{T}}} \left[ b_{0}(s_{0}) \prod_{t = 0}^{\mathrm{T}} \pi_{\theta}(a_{t} \mid s_{t}) \prod_{t = 1}^{\mathrm{T}} p(s_{t} \mid s_{t - 1},\ a_{t - 1}) \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) \right] \\[7mm] &= -\mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \mathcal{E}_{a_{\mathrm{T}}} \left[ \left( \sum_{t = 0}^{\mathrm{T}} \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) \right) \cdot \left( \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \ln \pi_{\theta}(a_{t} \mid s_{t}) \right) + \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) \right] \end{aligned}

梯度中的后一部分可以化简为 0:

t=0TγtEs0Ea0Es1Ea1EstEatθlnπθ(atst)0=0\sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{t}} \underset{0}{\underbrace{\mathcal{E}_{a_{t}} \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t})}} = 0

通过和策略梯度中类似的方法,可以证明乘积因子 θlnπθ(atst)lnπθ(aτsτ)\nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) \ln \pi_{\theta}(a_{\tau} \mid s_{\tau})t>τt > \tau 时的期望为 0:

Es0Ea0Es1Ea1EsTEaT[θlnπθ(atst)lnπθ(aτsτ)]=Es0Ea0Es1Ea1Est[lnπθ(aτsτ)Eatθlnπθ(atst)0]=0\mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \mathcal{E}_{a_{\mathrm{T}}} \Big[ \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) \ln \pi_{\theta}(a_{\tau} \mid s_{\tau}) \Big] = \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{t}} \Big[ \ln \pi_{\theta}(a_{\tau} \mid s_{\tau}) \underset{0}{\underbrace{\mathcal{E}_{a_{t}} \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t})}} \Big] = 0

最终将熵正则项的梯度化简为:

θΩ(θ)=t=0TγtEs0Ea0Es1Ea1EstEat[θlnπθ(atst)Est+1Eat+1EsTEaTτ=0Tγτtlnπθ(aτsτ)]\nabla_{\theta} \Omega(\theta) = -\sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{t}} \mathcal{E}_{a_{t}} \left[ \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) \mathcal{E}_{s_{t + 1}} \mathcal{E}_{a_{t + 1}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \mathcal{E}_{a_{\mathrm{T}}} \sum_{\tau = 0}^{\mathrm{T}} \gamma^{\tau - t} \ln \pi_{\theta}(a_{\tau} \mid s_{\tau}) \right]

结合 J(θ)J(\theta) 的策略梯度,可以将总体优化目标函数的梯度写作:

θJH(θ)=t=0TγtEs0Ea0Es1Ea1EstEat[θlnπθ(atst)Est+1Eat+1EsTEaTτ=tTγτtR(sτ, aτ)]αt=0TγtEs0Ea0Es1Ea1EstEat[θlnπθ(atst)Est+1Eat+1EsTEaTτ=0Tγτtlnπθ(aτsτ)]=t=0TγtEs0Ea0Es1Ea1EstEat[θlnπθ(atst)(qH(t)(st, at)αlnπθ(atst))]T11γEsνπθ()Eaπθ(s)[θlnπθ(as)(qH(s, a)αlnπθ(as))]\begin{aligned} \nabla_{\theta} J_{\mathcal{H}}(\theta) &= \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{t}} \mathcal{E}_{a_{t}} \left[ \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) \mathcal{E}_{s_{t + 1}} \mathcal{E}_{a_{t + 1}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \mathcal{E}_{a_{\mathrm{T}}} \sum_{\tau = t}^{\mathrm{T}} \gamma^{\tau - t} \mathcal{R}(s_{\tau},\ a_{\tau}) \right] \\[7mm] &- \alpha \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{t}} \mathcal{E}_{a_{t}} \left[ \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) \mathcal{E}_{s_{t + 1}} \mathcal{E}_{a_{t + 1}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \mathcal{E}_{a_{\mathrm{T}}} \sum_{\tau = 0}^{\mathrm{T}} \gamma^{\tau - t} \ln \pi_{\theta}(a_{\tau} \mid s_{\tau}) \right] \\[7mm] &= \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{t}} \mathcal{E}_{a_{t}} \left[ \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) \Big( q_{\mathcal{H}}^{(t)}(s_{t},\ a_{t}) - \alpha \ln \pi_{\theta}(a_{t} \mid s_{t}) \Big) \right] \\[7mm] &\overset{\mathrm{T} \to \infty}{\longrightarrow} \frac{1}{1 - \gamma} \mathcal{E}_{s \sim \nu_{\pi_{\theta}}(\cdot)} \mathcal{E}_{a \sim \pi_{\theta}(\cdot \mid s)} \left[ \nabla_{\theta} \ln \pi_{\theta}(a \mid s) \Big( q_{\mathcal{H}}(s,\ a) - \alpha \ln \pi_{\theta}(a \mid s) \Big) \right] \end{aligned}

为了学习对 qH(s, a)q_{\mathcal{H}}(s,\ a) 的近似,SAC 算法采用 critic 网络 qw(s, a)q_{w}(s,\ a) 并通过 TD 误差来指导网络参数的更新:

Lq(w)=12ni=1n[qw(si, ai)riγ(qw(si, ai)αlnπθ(aisi))]2\mathcal{L}_{q}(w) = \frac{1}{2n} \sum_{i = 1}^{n} \Big[ q_{w}(s_{i},\ a_{i}) - r_{i} - \gamma \Big( q_{w}(s_{i}',\ a_{i}') - \alpha \ln \pi_{\theta}(a_{i}' \mid s_{i}') \Big) \Big]^{2}

其中,动作 aiπθ(si)a_{i}' \sim \pi_{\theta}(\cdot \mid s_{i}'),同时为了切断自举并防止高估,可以使用两套 critic + 目标网络:

Lq(w)=12ni=1n[qw(si, ai)riγ(minj{1, 2}qwj(si, ai)αlnπθ(aisi))]2\mathcal{L}_{q}(w) = \frac{1}{2n} \sum_{i = 1}^{n} \Big[ q_{w}(s_{i},\ a_{i}) - r_{i} - \gamma \Big( \min_{j \in \{1,\ 2\}} q_{w_{j}^{-}}(s_{i}',\ a_{i}') - \alpha \ln \pi_{\theta}(a_{i}' \mid s_{i}') \Big) \Big]^{2}

自适应温度系数

由于原始的 SAC 算法对温度系数项 α\alpha 较为敏感,在有限期规划下将优化问题重写为带约束的形式:

maxπ(0), π(1), , π(T)J(π)=Es0b0()Ea0π(0)(s0)Es1p(s0, a0)Ea1π(1)(s1)EsTp(sT1, aT1)EaTπ(T)(sT)[t=0TγtR(st, at)]s.t. t:Es0b0()Ea0π(0)(s0)Es1p(s0, a0)Ea1π(1)(s1)Estp(st1, at1)Eatπ(t)(st)[lnπ(t)(atst)]H\begin{gathered} \max_{\pi^{(0)},\ \pi^{(1)},\ \cdots,\ \pi^{(\mathrm{T})}} J(\pi) = \mathcal{E}_{s_{0} \sim b_{0}(\cdot)} \mathcal{E}_{a_{0} \sim \pi^{(0)}(\cdot \mid s_{0})} \mathcal{E}_{s_{1} \sim p(\cdot \mid s_{0},\ a_{0})} \mathcal{E}_{a_{1} \sim \pi^{(1)}(\cdot \mid s_{1})} \cdots \mathcal{E}_{s_{\mathrm{T}} \sim p(\cdot \mid s_{\mathrm{T} - 1},\ a_{\mathrm{T} - 1})} \mathcal{E}_{a_{\mathrm{T}} \sim \pi^{(\mathrm{T})}(\cdot \mid s_{\mathrm{T}})} \left[ \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{R}(s_{t},\ a_{t}) \right] \\[7mm] \mathrm{s.t.} \quad \forall\ t : \mathcal{E}_{s_{0} \sim b_{0}(\cdot)} \mathcal{E}_{a_{0} \sim \pi^{(0)}(\cdot \mid s_{0})} \mathcal{E}_{s_{1} \sim p(\cdot \mid s_{0},\ a_{0})} \mathcal{E}_{a_{1} \sim \pi^{(1)}(\cdot \mid s_{1})} \cdots \mathcal{E}_{s_{t} \sim p(\cdot \mid s_{t - 1},\ a_{t - 1})} \mathcal{E}_{a_{t} \sim \pi^{(t)}(\cdot \mid s_{t})} \Big[ -\ln \pi^{(t)}(a_{t} \mid s_{t}) \Big] \ge \mathcal{H} \end{gathered}

目标函数的最大值可以被递归地分解为:

maxπ(0), π(1), , π(T)J(π)=maxπ(0)[Es0Ea0[R(s0, a0)]+γmaxπ(1)[Es0Ea0Es1Ea1[R(s1, a1)]+γmaxπ(2)()]]\max_{\pi^{(0)},\ \pi^{(1)},\ \cdots,\ \pi^{(\mathrm{T})}} J(\pi) = \max_{\pi^{(0)}} \left[ \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \Big[ \mathcal{R}(s_{0},\ a_{0}) \Big] + \gamma \max_{\pi^{(1)}} \left[ \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \Big[ \mathcal{R}(s_{1},\ a_{1}) \Big] + \gamma \max_{\pi^{(2)}} (\cdots) \right] \right]

其中每层的最大化操作都需要满足相应的约束,最内层的最大化操作为:

maxπ(T)Es0Ea0Es1Ea1EsTEaT[R(sT, aT)]s.t.Es0Ea0Es1Ea1EsTEaT[lnπ(T)(aTsT)]H\max_{\pi^{(\mathrm{T})}} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \mathcal{E}_{a_{\mathrm{T}}} \Big[ \mathcal{R}(s_{\mathrm{T}},\ a_{\mathrm{T}}) \Big] \quad \mathrm{s.t.}\quad \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \mathcal{E}_{a_{\mathrm{T}}} \Big[ -\ln \pi^{(\mathrm{T})}(a_{\mathrm{T}} \mid s_{\mathrm{T}}) \Big] \ge \mathcal{H}

其中优化目标和约束条件均为凸函数,因此可以将带约束的最大化操作转换为相应的对偶问题:

minα(T)maxπ(T)Es0Ea0Es1Ea1EsTEaT[R(sT, aT)α(T)lnπ(T)(aTsT)α(T)H]\min_{\alpha^{(\mathrm{T})}} \max_{\pi^{(\mathrm{T})}} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \mathcal{E}_{a_{\mathrm{T}}} \Big[ \mathcal{R}(s_{\mathrm{T}},\ a_{\mathrm{T}}) - \alpha^{(\mathrm{T})} \ln \pi^{(\mathrm{T})}(a_{\mathrm{T}} \mid s_{\mathrm{T}}) - \alpha^{(\mathrm{T})} \mathcal{H} \Big]

其中,内层的 maxπ(T)\max_{\pi^{(\mathrm{T})}} 对应的最优策略 π(T)\pi_{\star}^{(\mathrm{T})} 即温度系数 α(T)\alpha^{(\mathrm{T})} 下的最优策略(soft):

π(T)=arg maxπ(T)Es0Ea0Es1Ea1EsTEaT[R(sT, aT)α(T)lnπ(T)(aTsT)α(T)H]=arg maxπ(T)(sT)EaTπ(T)(sT)[q(T)(sT, aT)α(T)lnπ(T)(aTsT)]exp[1α(T)q(T)(sT, aT)]\begin{aligned} \pi_{\star}^{(\mathrm{T})} &= \argmax_{\pi^{(\mathrm{T})}} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \mathcal{E}_{a_{\mathrm{T}}} \Big[ \mathcal{R}(s_{\mathrm{T}},\ a_{\mathrm{T}}) - \alpha^{(\mathrm{T})} \ln \pi^{(\mathrm{T})}(a_{\mathrm{T}} \mid s_{\mathrm{T}}) - \alpha^{(\mathrm{T})} \mathcal{H} \Big] \\[5mm] &= \argmax_{\pi^{(\mathrm{T})}(\cdot \mid s_{\mathrm{T}})} \mathcal{E}_{a_{\mathrm{T}} \sim \pi^{(\mathrm{T})}(\cdot \mid s_{\mathrm{T}})} \Big[ q_{\star}^{(\mathrm{T})}(s_{\mathrm{T}},\ a_{\mathrm{T}}) - \alpha^{(\mathrm{T})} \ln \pi^{(\mathrm{T})}(a_{\mathrm{T}} \mid s_{\mathrm{T}}) \Big] \propto \exp \left[ \frac{1}{\alpha^{(\mathrm{T})}} q_{\star}^{(\mathrm{T})}(s_{\mathrm{T}},\ a_{\mathrm{T}}) \right] \end{aligned}

注意此处的 π(T)\pi_{\star}^{(\mathrm{T})}α(T)\alpha^{(\mathrm{T})} 的函数,进而得到最优温度系数:

α(T)=arg minα(T)Es0Ea0Es1Ea1EsTEaT[R(sT, aT)α(T)lnπ(T)(aTsT)α(T)H]=arg minα(T)Es0Ea0Es1Ea1EsTEaTπ(T)(sT)[α(T)lnπ(T)(aTsT)α(T)H]\begin{aligned} \alpha_{\star}^{(\mathrm{T})} &= \argmin_{\alpha^{(\mathrm{T})}} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \mathcal{E}_{a_{\mathrm{T}}} \Big[ \mathcal{R}(s_{\mathrm{T}},\ a_{\mathrm{T}}) - \alpha^{(\mathrm{T})} \ln \pi_{\star}^{(\mathrm{T})}(a_{\mathrm{T}} \mid s_{\mathrm{T}}) - \alpha^{(\mathrm{T})} \mathcal{H} \Big] \\[5mm] &= \argmin_{\alpha^{(\mathrm{T})}} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \mathcal{E}_{a_{\mathrm{T}} \sim \pi_{\star}^{(\mathrm{T})}(\cdot \mid s_{\mathrm{T}})} \Big[ - \alpha^{(\mathrm{T})} \ln \pi_{\star}^{(\mathrm{T})}(a_{\mathrm{T}} \mid s_{\mathrm{T}}) - \alpha^{(\mathrm{T})} \mathcal{H} \Big] \end{aligned}

将最优系数 α(T)\alpha_{\star}^{(\mathrm{T})} 及其对应的最优策略(soft)代入第二层的最大化操作中:

maxπ(T1){Es0Ea0Es1Ea1EsT1EaT1[R(sT1, aT1)+γEsTEaT[R(sT, aT)α(T)lnπ(T)(aTsT)]]γα(T)H}\max_{\pi^{(\mathrm{T} - 1)}} \bigg\{ \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{\mathrm{T} - 1}} \mathcal{E}_{a_{\mathrm{T} - 1}} \Big[ \mathcal{R}(s_{\mathrm{T} - 1},\ a_{\mathrm{T} - 1}) + \gamma \mathcal{E}_{s_{\mathrm{T}}} \mathcal{E}_{a_{\mathrm{T}}} \Big[ \mathcal{R}(s_{\mathrm{T}},\ a_{\mathrm{T}}) - \alpha_{\star}^{(\mathrm{T})} \ln \pi_{\star}^{(\mathrm{T})}(a_{\mathrm{T}} \mid s_{\mathrm{T}}) \Big] \Big] - \gamma \alpha_{\star}^{(\mathrm{T})} \mathcal{H} \bigg\}

最优动作价值函数(soft)满足:

q(t)(st, at)=R(st, at)+γEst+1p(st, at)Eat+1π(t+1)(st+1)[q(t+1)(st+1, at+1)αlnπ(t+1)(as)]q_{\star}^{(t)}(s_{t},\ a_{t}) = \mathcal{R}(s_{t},\ a_{t}) +\gamma \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t},\ a_{t})} \mathcal{E}_{a_{t + 1} \sim \pi_{\star}^{(t + 1)}(\cdot \mid s_{t + 1})} \Big[ q_{\star}^{(t + 1)}(s_{t + 1},\ a_{t + 1}) - \alpha \ln \pi_{\star}^{(t + 1)}(a' \mid s') \Big]

结合初始条件 q(T)(sT, aT)=R(sT, aT)q_{\star}^{(\mathrm{T})}(s_{\mathrm{T}},\ a_{\mathrm{T}}) = \mathcal{R}(s_{\mathrm{T}},\ a_{\mathrm{T}}),可以将第二层最大化操作转化为:

maxπ(T1)Es0Ea0Es1Ea1EsT1EaT1[q(T1)(sT1, aT1)]s.t.Es0Ea0Es1Ea1EsT1EaT1[lnπ(T1)(aT1sT1)]H\max_{\pi^{(\mathrm{T} - 1)}} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{\mathrm{T} - 1}} \mathcal{E}_{a_{\mathrm{T} - 1}} \Big[ q_{\star}^{(\mathrm{T} - 1)}(s_{\mathrm{T} - 1},\ a_{\mathrm{T} - 1}) \Big] \quad \mathrm{s.t.} \quad \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{\mathrm{T} - 1}} \mathcal{E}_{a_{\mathrm{T} - 1}} \Big[ -\ln \pi^{(\mathrm{T} - 1)}(a_{\mathrm{T} - 1} \mid s_{\mathrm{T} - 1}) \Big] \ge \mathcal{H}

通过相同的方法得到:

π(T1)=arg maxπ(T1)(sT1)EaT1π(T1)(sT1)[q(T1)(sT1, aT1)α(T1)lnπ(T1)(aT1sT1)]α(T1)=arg minα(T1)Es0Ea0Es1Ea1EsT1EaT1π(T1)(sT1)[α(T1)lnπ(T1)(aT1sT1)α(T1)H]\begin{gathered} \pi_{\star}^{(\mathrm{T} - 1)} = \argmax_{\pi^{(\mathrm{T} - 1)}(\cdot \mid s_{\mathrm{T} - 1})} \mathcal{E}_{a_{\mathrm{T} - 1} \sim \pi^{(\mathrm{T} - 1)}(\cdot \mid s_{\mathrm{T} - 1})} \Big[ q_{\star}^{(\mathrm{T} - 1)}(s_{\mathrm{T} - 1},\ a_{\mathrm{T} - 1}) - \alpha^{(\mathrm{T} - 1)} \ln \pi^{(\mathrm{T} - 1)}(a_{\mathrm{T} - 1} \mid s_{\mathrm{T} - 1}) \Big] \\[5mm] \alpha_{\star}^{(\mathrm{T} - 1)} = \argmin_{\alpha^{(\mathrm{T} - 1)}} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{\mathrm{T} - 1}} \mathcal{E}_{a_{\mathrm{T} - 1} \sim \pi_{\star}^{(\mathrm{T} - 1)}(\cdot \mid s_{\mathrm{T} - 1})} \Big[ - \alpha^{(\mathrm{T} - 1)} \ln \pi_{\star}^{(\mathrm{T} - 1)}(a_{\mathrm{T} - 1} \mid s_{\mathrm{T} - 1}) - \alpha^{(\mathrm{T} - 1)} \mathcal{H} \Big] \end{gathered}

重复以上操作,即可得到每个规划时间步上的最优温度系数 α(t)\alpha_{\star}^{(t)} 以及对应的最优策略(soft)π(t)\pi_{\star}^{(t)}。在参数化的策略和凸性假设下,可以通过对偶梯度下降的方式对截断的温度系数的目标函数进行求解:

J(α)=Eaπθ(s)[αlnπθ(as)αH]J(\alpha) = \mathcal{E}_{a \sim \pi_{\theta}(\cdot \mid s)} \Big[ -\alpha \ln \pi_{\theta}(a \mid s) - \alpha \mathcal{H} \Big]


Soft Actor-Critic
http://example.com/2024/07/20/SAC/
Author
木辛
Posted on
July 20, 2024
Licensed under