Maximum Entropy RL

最大熵强化学习

基础定义

为了防止策略网络输出的概率集中在一个动作上,平衡探索性能,可以通过熵来衡量策略分布的不确定性:

H(πs)=Eaπ(s)lnπ(as)=aπ(as)lnπ(as)\mathcal{H}(\pi \mid s) = -\mathcal{E}_{a \sim \pi(\cdot \mid s)} \ln \pi(a \mid s) = -\sum_{a} \pi(a \mid s) \ln \pi(a \mid s)

原始的 RL 问题的优化目标为:

J(π)=Es0b0()Ea0π(s0)Es1p(s0, a0)Ea1π(s1)EsTp(sT1, aT1)EaTπ(sT)[t=0TγtR(st, at)]J(\pi) = \mathcal{E}_{s_{0} \sim b_{0}(\cdot)} \mathcal{E}_{a_{0} \sim \pi(\cdot \mid s_{0})} \mathcal{E}_{s_{1} \sim p(\cdot \mid s_{0},\ a_{0})} \mathcal{E}_{a_{1} \sim \pi(\cdot \mid s_{1})} \cdots \mathcal{E}_{s_{\mathrm{T}} \sim p(\cdot \mid s_{\mathrm{T} - 1},\ a_{\mathrm{T} - 1})} \mathcal{E}_{a_{\mathrm{T}} \sim \pi(\cdot \mid s_{\mathrm{T}})} \left[ \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{R}(s_{t},\ a_{t}) \right]

在最大熵 RL 框架下,优化目标加入了熵正则项来增强探索程度,同时减少陷入局部最优的可能性:

Ω(π)=Es0b0()Ea0π(s0)Es1p(s0, a0)Ea1π(s1)EsTp(sT1, aT1)EaTπ(sT)[t=0TγtH(πst)]=t=0TγtEs0Ea0Es1Ea1Est[H(πst)]=t=0TγtEs0Ea0Es1Ea1Est[Eatπ(st)lnπ(atst)]=t=0TγtEs0Ea0Es1Ea1EstEat[lnπ(atst)]=Es0Ea0Es1Ea1EsTEaT[t=0Tγtlnπ(atst)]\begin{aligned} \Omega(\pi) &= \mathcal{E}_{s_{0} \sim b_{0}(\cdot)} \mathcal{E}_{a_{0} \sim \pi(\cdot \mid s_{0})} \mathcal{E}_{s_{1} \sim p(\cdot \mid s_{0},\ a_{0})} \mathcal{E}_{a_{1} \sim \pi(\cdot \mid s_{1})} \cdots \mathcal{E}_{s_{\mathrm{T}} \sim p(\cdot \mid s_{\mathrm{T} - 1},\ a_{\mathrm{T} - 1})} \mathcal{E}_{a_{\mathrm{T}} \sim \pi(\cdot \mid s_{\mathrm{T}})} \left[ \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{H}(\pi \mid s_{t}) \right] \\[7mm] &= \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{t}} \Big[ \mathcal{H}(\pi \mid s_{t}) \Big] = \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{t}} \Big[ -\mathcal{E}_{a_{t} \sim \pi(\cdot \mid s_{t})} \ln \pi(a_{t} \mid s_{t}) \Big] \\[7mm] &= -\sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{t}} \mathcal{E}_{a_{t}} \Big[ \ln \pi(a_{t} \mid s_{t}) \Big] = -\mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \mathcal{E}_{a_{\mathrm{T}}} \left[ \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \ln \pi(a_{t} \mid s_{t}) \right] \end{aligned}

总体优化目标为:

JH(π)=J(π)+αΩ(π)=Es0Ea0Es1Ea1EsTEaT[t=0Tγt(R(st, at)αlnπ(atst))]J_{\mathcal{H}}(\pi) = J(\pi) + \alpha \Omega(\pi) = \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \mathcal{E}_{a_{\mathrm{T}}} \left[ \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \Big( \mathcal{R}(s_{t},\ a_{t}) - \alpha \ln \pi(a_{t} \mid s_{t}) \Big) \right]

Soft Bellman Expectation Equation

为了方便后续推导,定义带熵正则的动作价值函数为:

qH(t)(st, at)=Est+1Eat+1EsTEaT[τ=tTγτtR(sτ, aτ)ατ=t+1Tγτtlnπ(aτsτ)]q_{\mathcal{H}}^{(t)}(s_{t},\ a_{t}) = \mathcal{E}_{s_{t + 1}} \mathcal{E}_{a_{t + 1}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \mathcal{E}_{a_{\mathrm{T}}} \left[ \sum_{\tau = t}^{\mathrm{T}} \gamma^{\tau - t} \mathcal{R}(s_{\tau},\ a_{\tau}) - \alpha \sum_{\tau = t + 1}^{\mathrm{T}} \gamma^{\tau - t} \ln \pi(a_{\tau} \mid s_{\tau}) \right]

以上动作价值的定义中没有加入 lnπ(atst)\ln \pi(a_{t} \mid s_{t}) 项,原因是在状态 sts_{t} 下给动作 ata_{t} 后该项失去了指导策略熵提升的作用。类似地,定义带熵正则的状态价值函数为:

vH(t)(st)=EatEst+1Eat+1EsTEaT[τ=tTγτtR(sτ, aτ)ατ=tTγτtlnπ(aτsτ)]v_{\mathcal{H}}^{(t)}(s_{t}) = \mathcal{E}_{a_{t}} \mathcal{E}_{s_{t + 1}} \mathcal{E}_{a_{t + 1}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \mathcal{E}_{a_{\mathrm{T}}} \left[ \sum_{\tau = t}^{\mathrm{T}} \gamma^{\tau - t} \mathcal{R}(s_{\tau},\ a_{\tau}) - \alpha \sum_{\tau = t}^{\mathrm{T}} \gamma^{\tau - t} \ln \pi(a_{\tau} \mid s_{\tau}) \right]

通过定义可以将二者进行展开,得到 MaxEntRL 下的贝尔曼期望方程(soft):

qH(t)(st, at)=R(st, at)+γEst+1[vH(t+1)(st+1)]vH(t)(st)=αH(πst)+Eat[qH(t)(st, at)]q_{\mathcal{H}}^{(t)}(s_{t},\ a_{t}) = \mathcal{R}(s_{t},\ a_{t}) + \gamma \mathcal{E}_{s_{t + 1}} \Big[ v_{\mathcal{H}}^{(t + 1)}(s_{t + 1}) \Big] \qquad v_{\mathcal{H}}^{(t)}(s_{t}) = \alpha \mathcal{H}(\pi \mid s_{t}) + \mathcal{E}_{a_{t}} \Big[ q_{\mathcal{H}}^{(t)}(s_{t},\ a_{t}) \Big]

在无限期规划下可以重写为:

qH(s, a)=R(s, a)+γEs[vH(s)]vH(s)=αH(πs)+Ea[qH(s, a)]q_{\mathcal{H}}(s,\ a) = \mathcal{R}(s,\ a) + \gamma \mathcal{E}_{s'} \Big[ v_{\mathcal{H}}(s') \Big] \qquad v_{\mathcal{H}}(s) = \alpha \mathcal{H}(\pi \mid s) + \mathcal{E}_{a} \Big[ q_{\mathcal{H}}(s,\ a) \Big]

在该定义下的贝尔曼期望方程(soft)算子 Lπ:QQ\mathscr{L}_{\pi} : \mathcal{Q} \mapsto \mathcal{Q} 为:

Lπ{qH}=R(s, a)+γEs[vH(s)]=R(s, a)+γEsEa[qH(s, a)αlnπ(as)]\mathscr{L}_{\pi} \{ q_{\mathcal{H}} \} = \mathcal{R}(s,\ a) + \gamma \mathcal{E}_{s'} \Big[ v_{\mathcal{H}}(s') \Big] = \mathcal{R}(s,\ a) +\gamma \mathcal{E}_{s'} \mathcal{E}_{a'} \Big[ q_{\mathcal{H}}(s',\ a') - \alpha \ln \pi(a' \mid s') \Big]

利用类似的方法可以证明该算子为压缩映射,贝尔曼期望方程(soft)存在唯一解。

Soft Bellman Optimal Equation

在有限期规划中 tt 时间步下,利用动态规划的思想,得到当前的最优策略和最优动作价值函数之间的关系:

π(t)(s)=arg maxπ[αH(πs)+Eaπ(s)q(t)(s, a)]=arg maxπEaπ(s)[q(t)(s, a)αlnπ(as)]\pi_{\star}^{(t)}(\cdot \mid s) = \argmax_{\pi} \Big[ \alpha \mathcal{H}(\pi \mid s) + \mathcal{E}_{a \sim \pi(\cdot \mid s)} q_{\star}^{(t)}(s,\ a) \Big] = \argmax_{\pi} \mathcal{E}_{a \sim \pi(\cdot \mid s)} \Big[ q_{\star}^{(t)}(s,\ a) - \alpha \ln \pi(a \mid s) \Big]

考虑优化问题

maxpxp(x)[ϕ(x)αlnp(x)]s.t.xp(x)=10p(x)1minpxp(x)[αlnp(x)ϕ(x)]s.t.xp(x)1=0p(x)0p(x)10\begin{gathered} \max_{p} \sum_{x} p(x) \Big[ \phi(x) - \alpha \ln p(x) \Big] \\[5mm] \mathrm{s.t.} \quad \sum_{x} p(x) = 1 \quad 0 \le p(x) \le 1 \end{gathered} \quad \Rightarrow \quad \begin{gathered} \min_{p} \sum_{x} p(x) \Big[ \alpha \ln p(x) - \phi(x) \Big] \\[5mm] \mathrm{s.t.} \quad \sum_{x} p(x) - 1 = 0 \quad -p(x) \le 0 \quad p(x) - 1 \le 0 \end{gathered}

可以看出目标函数和不等式约束函数均为凸函数,并且等式约束函数为仿射变换,构造拉格朗日函数:

L(p, λ, μ, ν)=xp(x)[αlnp(x)ϕ(x)]+λ[xp(x)1]+xμ(x)p(x)+xν(x)[p(x)1]\mathcal{L}(p,\ \lambda,\ \mu,\ \nu) = \sum_{x} p(x) \Big[ \alpha \ln p(x) - \phi(x) \Big] + \lambda \left[ \sum_{x} p(x) - 1 \right] + \sum_{x} -\mu(x) p(x) + \sum_{x} \nu(x) \Big[ p(x) - 1 \Big]

最优分布需要满足:

Lp(x)=α[lnp(x)+1]ϕ(x)+λμ(x)+ν(x)=0\frac{\partial \mathcal{L}}{\partial p(x)} = \alpha \Big[ \ln p(x) + 1 \Big] - \phi(x) + \lambda - \mu(x) + \nu(x) = 0

解得:

p(x)=exp[1α(ϕ(x)λ+μ(x)ν(x))1]exp[1αϕ(x)]p_{\star}(x) = \exp \left[ \frac{1}{\alpha} \Big( \phi(x) - \lambda + \mu(x) - \nu(x) \Big) - 1 \right] \propto \exp \left[ \frac{1}{\alpha} \phi(x) \right]

在最优分布下的,目标函数取到最大值:

J=xp(x)[ϕ(x)αlnp(x)]=xp(x)[ϕ(x)ϕ(x)+αlnZ]=αlnZ=softmaxαxϕ(x)J_{\star} = \sum_{x} p_{\star}(x) \Big[ \phi(x) - \alpha \ln p_{\star}(x) \Big] = \sum_{x} p_{\star}(x) \Big[ \phi(x) - \phi(x) + \alpha \ln Z \Big] = \alpha \ln Z = \underset{x}{\operatorname{softmax}_{\alpha}} \phi(x)

其中,归一化因子 Z=xexp[1αϕ(x)]Z = \sum_{x} \exp \left[ \dfrac{1}{\alpha} \phi(x) \right]。因此当前的最优策略为:

π(t)(as)=1Zexp[1αq(t)(s, a)]Z=aexp[1αq(t)(s, a)]\pi_{\star}^{(t)}(a \mid s) = \frac{1}{Z} \exp \left[ \frac{1}{\alpha} q_{\star}^{(t)}(s,\ a) \right] \quad Z = \sum_{a} \exp \left[ \frac{1}{\alpha} q_{\star}^{(t)}(s,\ a) \right]

进而得到最优状态价值函数和最优动作价值函数之间的关系,即贝尔曼最优方程(soft):

v(t)(s)=αH(π(t)s)+Eaπ(t)(s)q(t)(s, a)=softmaxαaq(t)(s, a)v_{\star}^{(t)}(s) = \alpha \mathcal{H}(\pi_{\star}^{(t)} \mid s) + \mathcal{E}_{a \sim \pi_{\star}^{(t)}(\cdot \mid s)} q_{\star}^{(t)}(s,\ a) = \underset{a}{\operatorname{softmax}_{\alpha}} q_{\star}^{(t)}(s,\ a)

在该定义下的贝尔曼最优方程(soft)算子 L:QQ\mathscr{L} : \mathcal{Q} \mapsto \mathcal{Q} 为:

L{qH}=R(s, a)+γEsp(s, a)[softmaxαbqH(s, b)]=R(s, a)+γEsp(s, a)[αlnbexp[1αqH(s, b)]]\mathscr{L} \{ q_{\mathcal{H}} \} = \mathcal{R}(s,\ a) + \gamma \mathcal{E}_{s' \sim p(\cdot \mid s,\ a)} \left[ \underset{b}{\operatorname{softmax}_{\alpha}} q_{\mathcal{H}}(s',\ b) \right] = \mathcal{R}(s,\ a) + \gamma \mathcal{E}_{s' \sim p(\cdot \mid s,\ a)} \left[ \alpha \ln \sum_{b} \exp \left[ \frac{1}{\alpha} q_{\mathcal{H}}(s',\ b) \right] \right]

为了证明 L\mathscr{L} 是一个压缩映射,首先证明:

exp[1αq1(s, a)1αq2(s, a)]exp[1αmaxs, aq1(s, a)q2(s, a)]=exp[1αq1q2]\exp \left[ \frac{1}{\alpha} q_{1}(s,\ a) - \frac{1}{\alpha} q_{2}(s,\ a) \right] \le \exp \left[ \frac{1}{\alpha} \max_{s,\ a} \Big| q_{1}(s,\ a) - q_{2}(s,\ a) \Big| \right] = \exp \left[ \frac{1}{\alpha} \Big\| q_{1} - q_{2} \Big\|_{\infty} \right]

因此:

exp[1αq1(s, a)]exp[1αq2(s, a)+1αq1q2]\exp \left[ \frac{1}{\alpha} q_{1}(s,\ a) \right] \le \exp \left[ \frac{1}{\alpha} q_{2}(s,\ a) + \frac{1}{\alpha} \Big\| q_{1} - q_{2} \Big\|_{\infty} \right]

进而有:

softmaxαaq1(s, a)=αlnaexp[1αq1(s, a)]αlnaexp[1αq2(s, a)+1αq1q2]=αlnexp[1αq1q2]+αlnaexp[1αq2(s, a)]=q1q2+softmaxαaq2(s, a)\begin{aligned} \underset{a}{\operatorname{softmax}_{\alpha}} q_{1}(s,\ a) &= \alpha \ln \sum_{a} \exp \left[ \frac{1}{\alpha} q_{1}(s,\ a) \right] \le \alpha \ln \sum_{a} \exp \left[ \frac{1}{\alpha} q_{2}(s,\ a) + \frac{1}{\alpha} \Big\| q_{1} - q_{2} \Big\|_{\infty} \right] \\[7mm] &= \alpha \ln \exp \left[ \frac{1}{\alpha} \Big\| q_{1} - q_{2} \Big\|_{\infty} \right] + \alpha \ln \sum_{a} \exp \left[ \frac{1}{\alpha} q_{2}(s,\ a) \right] = \Big\| q_{1} - q_{2} \Big\|_{\infty} + \underset{a}{\operatorname{softmax}_{\alpha}} q_{2}(s,\ a) \end{aligned}

通过类似的方法可以得到:

softmaxαaq1(s, a)q1q2+softmaxαaq2(s, a)\underset{a}{\operatorname{softmax}_{\alpha}} q_{1}(s,\ a) \ge -\Big\| q_{1} - q_{2} \Big\|_{\infty} + \underset{a}{\operatorname{softmax}_{\alpha}} q_{2}(s,\ a)

因此:

L{q1}(s, a)L{q2}(s, a)=γsp(ss, a)[softmaxαaq1(s, a)softmaxαaq2(s, a)]γsp(ss, a)softmaxαaq1(s, a)softmaxαaq2(s, a)γsp(ss, a)q1q2=γq1q2\begin{aligned} \Big| \mathscr{L} \{ q_{1} \}(s,\ a) - \mathscr{L} \{ q_{2} \}(s,\ a) \Big| &= \left| \gamma \sum_{s'} p(s' \mid s,\ a) \left[ \underset{a}{\operatorname{softmax}_{\alpha}} q_{1}(s,\ a) - \underset{a}{\operatorname{softmax}_{\alpha}} q_{2}(s,\ a) \right] \right| \\[7mm] &\le \gamma \sum_{s'} p(s' \mid s,\ a) \left| \underset{a}{\operatorname{softmax}_{\alpha}} q_{1}(s,\ a) - \underset{a}{\operatorname{softmax}_{\alpha}} q_{2}(s,\ a) \right| \\[7mm] &\le \gamma \sum_{s'} p(s' \mid s,\ a) \Big\| q_{1} - q_{2} \Big\|_{\infty} = \gamma \Big\| q_{1} - q_{2} \Big\|_{\infty} \end{aligned}

最终证明 L\mathscr{L} 是一个压缩映射 L{q1}L{q2}γq1q2\| \mathscr{L} \{ q_{1} \} - \mathscr{L} \{ q_{2} \} \|_{\infty} \le \gamma \| q_{1} - q_{2} \|_{\infty},贝尔曼最优方程(soft)存在唯一解 qq_{\star},即无限期规划下的最优动作价值函数,同时在无限期规划下的最优策略满足:

π(as)=1Zexp[1αq(s, a)]=exp[1αq(s, a)1αv(s)]exp[1αq(s, a)]\pi_{\star}(a \mid s) = \frac{1}{Z} \exp \left[ \frac{1}{\alpha} q_{\star}(s,\ a) \right] = \exp \left[ \frac{1}{\alpha} q_{\star}(s,\ a) - \frac{1}{\alpha} v_{\star}(s) \right] \propto \exp \left[ \frac{1}{\alpha} q_{\star}(s,\ a) \right]


Maximum Entropy RL
http://example.com/2024/07/19/MERL/
Author
木辛
Posted on
July 19, 2024
Licensed under