MADDPG

MA-A2C 算法仅限于离散控制，通过类似的思想可以将 DDPG 推广到多智能体环境下，优化目标函数为：

J_{i}(\mu) = \mathcal{E}_{s_{0} \sim b_{0}(\cdot)} \mathcal{E}_{s_{1} \sim p(\cdot \mid s_{0},\ \mu(s_{0}))} \cdots \mathcal{E}_{s_{\mathrm{T}} \sim p(\cdot \mid s_{\mathrm{T} - 1},\ \mu(s_{\mathrm{T} - 1}))} \left[ \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{R}(s_{t},\ \mu(s_{t})) \right]

其中，每个智能体对应着一个确定性策略，并组成状态到联合动作的映射 $\mathcal{S} \mapsto \mathcal{A}$ ：

a_{t} = \mu(s_{t}) = \{ a_{t}^{1},\ a_{t}^{2},\ \cdots,\ a_{t}^{n} \} = \{ \mu_{1}(s_{t}),\ \mu_{2}(s_{t}),\ \cdots,\ \mu_{n}(s_{t}) \}

在参数化的策略网络 $\mu_{i}(s \mid \theta_{i})$ 和价值网络 $q_{i}(s,\ a^{1},\ a^{2},\ \cdots,\ a^{n} \mid w_{i})$ 下，对应的确定性策略梯度为：

\begin{aligned} \nabla_{\theta_{i}} J_{i}(\theta_{i}) &= \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0}} \mathcal{E}_{s_{1}} \cdots \mathcal{E}_{s_{t}} \nabla_{\theta_{i}} q_{i}^{(t)} \Big[ s_{t},\ \mu_{1}(s_{t} \mid \theta_{1}),\ \cdots,\ \mu_{i}(s_{t} \mid \theta_{i}),\ \cdots,\ \mu_{n}(s_{t} \mid \theta_{n}) \Big] \\[7mm] &\approx \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0}} \mathcal{E}_{s_{1}} \cdots \mathcal{E}_{s_{t}} \nabla_{\theta_{i}} q_{i} \Big[ s_{t},\ \mu_{1}(s_{t} \mid \theta_{1}),\ \cdots,\ \mu_{i}(s_{t} \mid \theta_{i}),\ \cdots,\ \mu_{n}(s_{t} \mid \theta_{n})\ \Big|\ w_{i} \Big] \\[7mm] &= \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0}} \mathcal{E}_{s_{1}} \cdots \mathcal{E}_{s_{t}} \Big[ \nabla_{\theta_{i}} \mu_{i}(s_{t} \mid \theta_{i}) \nabla_{a_{t}^{i}} q_{i}(s_{t},\ a_{t}^{1},\ \cdots,\ a_{t}^{i},\ \cdots,\ a_{t}^{n} \mid w_{i}) \bigg|_{\forall\ j\ :\ a_{t}^{j} = \mu_{j}(s_{t} \mid \theta_{j})} \Big] \end{aligned}

同样地，为了平衡探索性，采样轨迹的行为策略 $\pi_{i}(\cdot \mid s)$ 在 $\mu_{i}(s \mid \theta_{i})$ 的基础上加入随机噪声 $\xi_{i}$ ：

\nabla_{\theta_{i}} J_{i}(\theta_{i}) \approx \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0} \sim b_{0}(\cdot)} \mathcal{E}_{a_{0} \sim \pi(\cdot \mid s_{0})} \mathcal{E}_{s_{1} \sim p(\cdot \mid s_{0},\ a_{0})} \mathcal{E}_{a_{1} \sim \pi(\cdot \mid s_{1})} \cdots \mathcal{E}_{s_{t} \sim p(\cdot \mid s_{t - 1},\ a_{t - 1})} \nabla_{\theta_{i}} q_{i} \Big[ s_{t},\ \mu(s_{t} \mid \theta)\ \Big|\ w_{i} \Big]

在这种异策略形式下，可以引入经验回放机制，在随机抽取的经验四元组下将确定性策略梯度近似为：

\nabla_{\theta_{i}} J_{i}(\theta_{i}) \approx \nabla_{\theta_{i}} q_{i} \Big[ s_{t},\ \mu_{1}(s_{t} \mid \theta_{1}),\ \cdots,\ \mu_{i}(s_{t} \mid \theta_{i}),\ \cdots,\ \mu_{n}(s_{t} \mid \theta_{n})\ \Big|\ w_{i} \Big]

同时利用 TD 误差的平方作为损失函数来更新价值网络：

\ell_{i}(w_{i}) = \frac{1}{2} \Big[ \delta_{t}^{i} \Big]^{2} = \frac{1}{2} \Big[ r_{t + 1}^{i} + \gamma q_{i} \Big[s_{t + 1},\ \mu(s_{t + 1} \mid \theta)\ \Big|\ w_{i} \Big] - q_{i}(s_{t},\ a_{t} \mid w_{i}) \Big]^{2} \Rightarrow \nabla_{w_{i}} \ell_{i}(w_{i}) = -\delta_{t}^{i} \nabla_{w_{i}} q_{i}(s_{t},\ a_{t} \mid w_{i})

在局部观测下使用 CTDE 方法来实现 MADDPG 算法，并在线地交替更新策略参数和价值参数：

\begin{gathered} w_{1} \leftarrow w_{1} + \alpha \delta_{t}^{1} \nabla_{w_{1}} q_{1}(s_{t},\ a_{t} \mid w_{1}) \\[5mm] w_{2} \leftarrow w_{2} + \alpha \delta_{t}^{2} \nabla_{w_{2}} q_{2}(s_{t},\ a_{t} \mid w_{2}) \\[5mm] \vdots \\[5mm] w_{n} \leftarrow w_{n} + \alpha \delta_{t}^{n} \nabla_{w_{n}} q_{n}(s_{t},\ a_{t} \mid w_{n}) \end{gathered}

\begin{gathered} \theta_{1} \leftarrow \theta_{1} + \beta \nabla_{\theta_{1}} q_{1} \Big[ s_{t},\ \mu_{1}(o_{t}^{1} \mid \theta_{1}),\ \mu_{2}(o_{t}^{2} \mid \theta_{2}),\ \cdots,\ \mu_{n}(o_{t}^{n} \mid \theta_{n})\ \Big|\ w_{1} \Big] \\[5mm] \theta_{2} \leftarrow \theta_{2} + \beta \nabla_{\theta_{2}} q_{2} \Big[ s_{t},\ \mu_{1}(o_{t}^{1} \mid \theta_{1}),\ \mu_{2}(o_{t}^{2} \mid \theta_{2}),\ \cdots,\ \mu_{n}(o_{t}^{n} \mid \theta_{n})\ \Big|\ w_{2} \Big] \\[5mm] \vdots \\[5mm] \theta_{n} \leftarrow \theta_{n} + \beta \nabla_{\theta_{n}} q_{n} \Big[ s_{t},\ \mu_{1}(o_{t}^{1} \mid \theta_{1}),\ \mu_{2}(o_{t}^{2} \mid \theta_{2}),\ \cdots,\ \mu_{n}(o_{t}^{n} \mid \theta_{n})\ \Big|\ w_{n} \Big] \end{gathered}

在部署后每个智能体独立地通过局部观测进行决策：

也可以将 TD3 中的 Clipped Double Q、动作选择噪声和降低更新频率加入 MADDPG 进行改进缓解高估问题。

RL > Preliminary

#MADDPG

MADDPG

http://example.com/2024/08/06/MADDPG/

Author

木辛

Posted on

August 6, 2024

Licensed under