DCG

DCG

Value Factorization

DCG uses coordination graph V, E\langle \mathcal{V},\ \mathcal{E} \rangle to factorize the joint action value into independent utilities and pairwise payoffs

qθϕψDCG(ah)=1Vi=1nfθv(aihi)+1E{i, j}Efϕe(ai, ajhi, hj)q_{\theta \phi \psi}^{\text{DCG}}(a \mid h) = \frac{1}{|\mathcal{V}|} \sum_{i = 1}^{n} f_{\theta}^{v}(a^{i} \mid h^{i}) + \frac{1}{|\mathcal{E}|} \sum_{\{i,\ j\} \in \mathcal{E}} f_{\phi}^{e}(a^{i},\ a^{j} \mid h^{i},\ h^{j})

where the local observation history hti=(oti, a<ti)h_{t}^{i} = (o_{\le t}^{i},\ a_{< t}^{i}) is encoded through a common recurrent neural network hψh_{\psi}

The parameters θ\theta, ϕ\phi and ψ\psi are shared among agents and all the utility and payoff functions share the same action space i=1nAi\cup_{i = 1}^{n} \mathcal{A}^{i}, in which the invalid entries are set to -\infty. The payoffs can be further expressed as a low-rank approximation

fϕe(ai, ajhi, hj)=k=1Kf^ϕ^k(aihi, hj)fˉϕˉk(ajhi, hj)f_{\phi}^{e}(a^{i},\ a^{j} \mid h^{i},\ h^{j}) = \sum_{k = 1}^{K} \hat{f}_{\hat{\phi}}^{k}(a^{i} \mid h^{i},\ h^{j}) \bar{f}_{\bar{\phi}}^{k}(a^{j} \mid h^{i},\ h^{j})

To support invariance on reshuffle of agents’ indices, DCG rewrites the payoffs into a symmetric form

qθϕψDCG(ah)=1Vi=1nfθv(aihi)+12E{i, j}Efϕe(ai, ajhi, hj)+fϕe(aj, aihj, hi)q_{\theta \phi \psi}^{\text{DCG}}(a \mid h) = \frac{1}{|\mathcal{V}|} \sum_{i = 1}^{n} f_{\theta}^{v}(a^{i} \mid h^{i}) + \frac{1}{2|\mathcal{E}|} \sum_{\{i,\ j\} \in \mathcal{E}} f_{\phi}^{e}(a^{i},\ a^{j} \mid h^{i},\ h^{j}) + f_{\phi}^{e}(a^{j},\ a^{i} \mid h^{j},\ h^{i})

In addition, for some tasks, global state is available during centralized training and can be introduced as a privileged bias

qθϕψφDCG-S(as, h)=qθϕψDCG(ah)+vφ(s)q_{\theta \phi \psi \varphi}^{\text{DCG-S}}(a \mid s,\ h) = q_{\theta \phi \psi}^{\text{DCG}}(a \mid h) + v_{\varphi}(s)

Action Selection

The joint action value function with the aforementioned factorization can be trained through Double Q-Learning

L(w)=E(ht, at, rt, ht+1)D[rt+γqw(a^t+1ht+1)qw(atst)]2s.t.a^t+1=arg maxaqw(aht+1)\mathcal{L}(w) = \mathcal{E}_{(h_{t},\ a_{t},\ r_{t},\ h_{t + 1}) \sim \mathcal{D}} \Big[ r_{t} + \gamma q_{w^{-}}(\hat{a}_{t + 1} \mid h_{t + 1}) - q_{w}(a_{t} \mid s_{t}) \Big]^{2} \quad \operatorname{s.t.} \quad \hat{a}_{t + 1} = \argmax_{a} q_{w}(a \mid h_{t + 1})

The greedy selection of action a^t+1=arg maxaqw(aht+1)\hat{a}_{t + 1} = \argmax_{a} q_{w}(a \mid h_{t + 1}) is performed through belief propagation algorithm


DCG
http://example.com/2024/10/04/DCG/
Author
木辛
Posted on
October 4, 2024
Licensed under