MADDPG

MADDPG

MA-A2C 算法仅限于离散控制,通过类似的思想可以将 DDPG 推广到多智能体环境下,优化目标函数为:

Ji(μ)=Es0b0()Es1p(s0, μ(s0))EsTp(sT1, μ(sT1))[t=0TγtR(st, μ(st))]J_{i}(\mu) = \mathcal{E}_{s_{0} \sim b_{0}(\cdot)} \mathcal{E}_{s_{1} \sim p(\cdot \mid s_{0},\ \mu(s_{0}))} \cdots \mathcal{E}_{s_{\mathrm{T}} \sim p(\cdot \mid s_{\mathrm{T} - 1},\ \mu(s_{\mathrm{T} - 1}))} \left[ \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{R}(s_{t},\ \mu(s_{t})) \right]

其中,每个智能体对应着一个确定性策略,并组成状态到联合动作的映射 SA\mathcal{S} \mapsto \mathcal{A}

at=μ(st)={at1, at2, , atn}={μ1(st), μ2(st), , μn(st)}a_{t} = \mu(s_{t}) = \{ a_{t}^{1},\ a_{t}^{2},\ \cdots,\ a_{t}^{n} \} = \{ \mu_{1}(s_{t}),\ \mu_{2}(s_{t}),\ \cdots,\ \mu_{n}(s_{t}) \}

在参数化的策略网络 μi(sθi)\mu_{i}(s \mid \theta_{i}) 和价值网络 qi(s, a1, a2, , anwi)q_{i}(s,\ a^{1},\ a^{2},\ \cdots,\ a^{n} \mid w_{i}) 下,对应的确定性策略梯度为:

θiJi(θi)=t=0TγtEs0Es1Estθiqi(t)[st, μ1(stθ1), , μi(stθi), , μn(stθn)]t=0TγtEs0Es1Estθiqi[st, μ1(stθ1), , μi(stθi), , μn(stθn)  wi]=t=0TγtEs0Es1Est[θiμi(stθi)atiqi(st, at1, , ati, , atnwi) j : atj=μj(stθj)]\begin{aligned} \nabla_{\theta_{i}} J_{i}(\theta_{i}) &= \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0}} \mathcal{E}_{s_{1}} \cdots \mathcal{E}_{s_{t}} \nabla_{\theta_{i}} q_{i}^{(t)} \Big[ s_{t},\ \mu_{1}(s_{t} \mid \theta_{1}),\ \cdots,\ \mu_{i}(s_{t} \mid \theta_{i}),\ \cdots,\ \mu_{n}(s_{t} \mid \theta_{n}) \Big] \\[7mm] &\approx \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0}} \mathcal{E}_{s_{1}} \cdots \mathcal{E}_{s_{t}} \nabla_{\theta_{i}} q_{i} \Big[ s_{t},\ \mu_{1}(s_{t} \mid \theta_{1}),\ \cdots,\ \mu_{i}(s_{t} \mid \theta_{i}),\ \cdots,\ \mu_{n}(s_{t} \mid \theta_{n})\ \Big|\ w_{i} \Big] \\[7mm] &= \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0}} \mathcal{E}_{s_{1}} \cdots \mathcal{E}_{s_{t}} \Big[ \nabla_{\theta_{i}} \mu_{i}(s_{t} \mid \theta_{i}) \nabla_{a_{t}^{i}} q_{i}(s_{t},\ a_{t}^{1},\ \cdots,\ a_{t}^{i},\ \cdots,\ a_{t}^{n} \mid w_{i}) \bigg|_{\forall\ j\ :\ a_{t}^{j} = \mu_{j}(s_{t} \mid \theta_{j})} \Big] \end{aligned}

同样地,为了平衡探索性,采样轨迹的行为策略 πi(s)\pi_{i}(\cdot \mid s)μi(sθi)\mu_{i}(s \mid \theta_{i}) 的基础上加入随机噪声 ξi\xi_{i}

θiJi(θi)t=0TγtEs0b0()Ea0π(s0)Es1p(s0, a0)Ea1π(s1)Estp(st1, at1)θiqi[st, μ(stθ)  wi]\nabla_{\theta_{i}} J_{i}(\theta_{i}) \approx \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0} \sim b_{0}(\cdot)} \mathcal{E}_{a_{0} \sim \pi(\cdot \mid s_{0})} \mathcal{E}_{s_{1} \sim p(\cdot \mid s_{0},\ a_{0})} \mathcal{E}_{a_{1} \sim \pi(\cdot \mid s_{1})} \cdots \mathcal{E}_{s_{t} \sim p(\cdot \mid s_{t - 1},\ a_{t - 1})} \nabla_{\theta_{i}} q_{i} \Big[ s_{t},\ \mu(s_{t} \mid \theta)\ \Big|\ w_{i} \Big]

在这种异策略形式下,可以引入经验回放机制,在随机抽取的经验四元组下将确定性策略梯度近似为:

θiJi(θi)θiqi[st, μ1(stθ1), , μi(stθi), , μn(stθn)  wi]\nabla_{\theta_{i}} J_{i}(\theta_{i}) \approx \nabla_{\theta_{i}} q_{i} \Big[ s_{t},\ \mu_{1}(s_{t} \mid \theta_{1}),\ \cdots,\ \mu_{i}(s_{t} \mid \theta_{i}),\ \cdots,\ \mu_{n}(s_{t} \mid \theta_{n})\ \Big|\ w_{i} \Big]

同时利用 TD 误差的平方作为损失函数来更新价值网络:

i(wi)=12[δti]2=12[rt+1i+γqi[st+1, μ(st+1θ)  wi]qi(st, atwi)]2wii(wi)=δtiwiqi(st, atwi)\ell_{i}(w_{i}) = \frac{1}{2} \Big[ \delta_{t}^{i} \Big]^{2} = \frac{1}{2} \Big[ r_{t + 1}^{i} + \gamma q_{i} \Big[s_{t + 1},\ \mu(s_{t + 1} \mid \theta)\ \Big|\ w_{i} \Big] - q_{i}(s_{t},\ a_{t} \mid w_{i}) \Big]^{2} \Rightarrow \nabla_{w_{i}} \ell_{i}(w_{i}) = -\delta_{t}^{i} \nabla_{w_{i}} q_{i}(s_{t},\ a_{t} \mid w_{i})

在局部观测下使用 CTDE 方法来实现 MADDPG 算法,并在线地交替更新策略参数和价值参数:

w1w1+αδt1w1q1(st, atw1)w2w2+αδt2w2q2(st, atw2)wnwn+αδtnwnqn(st, atwn)\begin{gathered} w_{1} \leftarrow w_{1} + \alpha \delta_{t}^{1} \nabla_{w_{1}} q_{1}(s_{t},\ a_{t} \mid w_{1}) \\[5mm] w_{2} \leftarrow w_{2} + \alpha \delta_{t}^{2} \nabla_{w_{2}} q_{2}(s_{t},\ a_{t} \mid w_{2}) \\[5mm] \vdots \\[5mm] w_{n} \leftarrow w_{n} + \alpha \delta_{t}^{n} \nabla_{w_{n}} q_{n}(s_{t},\ a_{t} \mid w_{n}) \end{gathered}

θ1θ1+βθ1q1[st, μ1(ot1θ1), μ2(ot2θ2), , μn(otnθn)  w1]θ2θ2+βθ2q2[st, μ1(ot1θ1), μ2(ot2θ2), , μn(otnθn)  w2]θnθn+βθnqn[st, μ1(ot1θ1), μ2(ot2θ2), , μn(otnθn)  wn]\begin{gathered} \theta_{1} \leftarrow \theta_{1} + \beta \nabla_{\theta_{1}} q_{1} \Big[ s_{t},\ \mu_{1}(o_{t}^{1} \mid \theta_{1}),\ \mu_{2}(o_{t}^{2} \mid \theta_{2}),\ \cdots,\ \mu_{n}(o_{t}^{n} \mid \theta_{n})\ \Big|\ w_{1} \Big] \\[5mm] \theta_{2} \leftarrow \theta_{2} + \beta \nabla_{\theta_{2}} q_{2} \Big[ s_{t},\ \mu_{1}(o_{t}^{1} \mid \theta_{1}),\ \mu_{2}(o_{t}^{2} \mid \theta_{2}),\ \cdots,\ \mu_{n}(o_{t}^{n} \mid \theta_{n})\ \Big|\ w_{2} \Big] \\[5mm] \vdots \\[5mm] \theta_{n} \leftarrow \theta_{n} + \beta \nabla_{\theta_{n}} q_{n} \Big[ s_{t},\ \mu_{1}(o_{t}^{1} \mid \theta_{1}),\ \mu_{2}(o_{t}^{2} \mid \theta_{2}),\ \cdots,\ \mu_{n}(o_{t}^{n} \mid \theta_{n})\ \Big|\ w_{n} \Big] \end{gathered}

在部署后每个智能体独立地通过局部观测进行决策:

也可以将 TD3 中的 Clipped Double Q、动作选择噪声和降低更新频率加入 MADDPG 进行改进缓解高估问题。


MADDPG
http://example.com/2024/08/06/MADDPG/
Author
木辛
Posted on
August 6, 2024
Licensed under