DCG

Value Factorization

DCG uses coordination graph $\langle \mathcal{V},\ \mathcal{E} \rangle$ to factorize the joint action value into independent utilities and pairwise payoffs

q_{\theta \phi \psi}^{\text{DCG}}(a \mid h) = \frac{1}{|\mathcal{V}|} \sum_{i = 1}^{n} f_{\theta}^{v}(a^{i} \mid h^{i}) + \frac{1}{|\mathcal{E}|} \sum_{\{i,\ j\} \in \mathcal{E}} f_{\phi}^{e}(a^{i},\ a^{j} \mid h^{i},\ h^{j})

where the local observation history $h_{t}^{i} = (o_{\le t}^{i},\ a_{< t}^{i})$ is encoded through a common recurrent neural network $h_{\psi}$

The parameters $\theta$ , $\phi$ and $\psi$ are shared among agents and all the utility and payoff functions share the same action space $\cup_{i = 1}^{n} \mathcal{A}^{i}$ , in which the invalid entries are set to $-\infty$ . The payoffs can be further expressed as a low-rank approximation

f_{\phi}^{e}(a^{i},\ a^{j} \mid h^{i},\ h^{j}) = \sum_{k = 1}^{K} \hat{f}_{\hat{\phi}}^{k}(a^{i} \mid h^{i},\ h^{j}) \bar{f}_{\bar{\phi}}^{k}(a^{j} \mid h^{i},\ h^{j})

To support invariance on reshuffle of agents’ indices, DCG rewrites the payoffs into a symmetric form

q_{\theta \phi \psi}^{\text{DCG}}(a \mid h) = \frac{1}{|\mathcal{V}|} \sum_{i = 1}^{n} f_{\theta}^{v}(a^{i} \mid h^{i}) + \frac{1}{2|\mathcal{E}|} \sum_{\{i,\ j\} \in \mathcal{E}} f_{\phi}^{e}(a^{i},\ a^{j} \mid h^{i},\ h^{j}) + f_{\phi}^{e}(a^{j},\ a^{i} \mid h^{j},\ h^{i})

In addition, for some tasks, global state is available during centralized training and can be introduced as a privileged bias

q_{\theta \phi \psi \varphi}^{\text{DCG-S}}(a \mid s,\ h) = q_{\theta \phi \psi}^{\text{DCG}}(a \mid h) + v_{\varphi}(s)

Action Selection

The joint action value function with the aforementioned factorization can be trained through Double Q-Learning

\mathcal{L}(w) = \mathcal{E}_{(h_{t},\ a_{t},\ r_{t},\ h_{t + 1}) \sim \mathcal{D}} \Big[ r_{t} + \gamma q_{w^{-}}(\hat{a}_{t + 1} \mid h_{t + 1}) - q_{w}(a_{t} \mid s_{t}) \Big]^{2} \quad \operatorname{s.t.} \quad \hat{a}_{t + 1} = \argmax_{a} q_{w}(a \mid h_{t + 1})

The greedy selection of action $\hat{a}_{t + 1} = \argmax_{a} q_{w}(a \mid h_{t + 1})$ is performed through belief propagation algorithm

RL > Multi-Agent

#DCG

DCG

http://example.com/2024/10/04/DCG/

Author

木辛

Posted on

October 4, 2024

Licensed under

CORRO Previous

DGN Next