DCG
Value Factorization
DCG uses coordination graph ⟨V, E⟩ to factorize the joint action value into independent utilities and pairwise payoffs
qθϕψDCG(a∣h)=∣V∣1i=1∑nfθv(ai∣hi)+∣E∣1{i, j}∈E∑fϕe(ai, aj∣hi, hj)
where the local observation history hti=(o≤ti, a<ti) is encoded through a common recurrent neural network hψ
The parameters θ, ϕ and ψ are shared among agents and all the utility and payoff functions share the same action space ∪i=1nAi, in which the invalid entries are set to −∞. The payoffs can be further expressed as a low-rank approximation
fϕe(ai, aj∣hi, hj)=k=1∑Kf^ϕ^k(ai∣hi, hj)fˉϕˉk(aj∣hi, hj)
To support invariance on reshuffle of agents’ indices, DCG rewrites the payoffs into a symmetric form
qθϕψDCG(a∣h)=∣V∣1i=1∑nfθv(ai∣hi)+2∣E∣1{i, j}∈E∑fϕe(ai, aj∣hi, hj)+fϕe(aj, ai∣hj, hi)
In addition, for some tasks, global state is available during centralized training and can be introduced as a privileged bias
qθϕψφDCG-S(a∣s, h)=qθϕψDCG(a∣h)+vφ(s)
Action Selection
The joint action value function with the aforementioned factorization can be trained through Double Q-Learning
L(w)=E(ht, at, rt, ht+1)∼D[rt+γqw−(a^t+1∣ht+1)−qw(at∣st)]2s.t.a^t+1=aargmaxqw(a∣ht+1)
The greedy selection of action a^t+1=argmaxaqw(a∣ht+1) is performed through belief propagation algorithm