DGN
General Structure
DGN views the multi-agent environment as a dynamic coordination graph varing over time, where the adjacency among agents is determined by a specific metric like distance and the neighbouring agents can communicate with each other
The agent i i i fuses the local information with its neighbours N ( i ) N(i) N ( i ) through multi-head attention in a convolutional layer
h i ← σ { concat m [ ∑ j ∈ N + ( i ) α i j m W V m h j ] } s.t. α i m = softmax j ∈ N + ( i ) ( τ ( W Q m h i ) ⊤ ( W K m h j ) ) N + ( i ) = N ( i ) ∪ { i } h_{i} \leftarrow \sigma \left\{ \operatorname{concat}_{m} \left[ \sum_{j \in N^{+}(i)} \alpha_{ij}^{m} W_{V}^{m} h_{j} \right] \right\} \quad \operatorname{s.t.} \quad \alpha_{i}^{m} = \operatorname{softmax}_{j \in N^{+}(i)} \Big( \tau (W_{Q}^{m} h_{i})^{\top} (W_{K}^{m} h_{j}) \Big) \quad N^{+}(i) = N(i) \cup \{ i \}
h i ← σ ⎩ ⎪ ⎨ ⎪ ⎧ c o n c a t m ⎣ ⎢ ⎡ j ∈ N + ( i ) ∑ α i j m W V m h j ⎦ ⎥ ⎤ ⎭ ⎪ ⎬ ⎪ ⎫ s . t . α i m = s o f t m a x j ∈ N + ( i ) ( τ ( W Q m h i ) ⊤ ( W K m h j ) ) N + ( i ) = N ( i ) ∪ { i }
where the h i h_{i} h i is initially encoded from the individual observation o i o_{i} o i and processed by multiple convolutional layers
The Q network of agent i i i takes the feature vectors of all preceding layers as input and output the individual action value
Learning Objective
The network parameters θ \theta θ are shared by all agents and trained through Target Q-Learning algorithm end-to-end
L ( θ ) = E ( o 1 : n , a 1 : n , r 1 : n , o 1 : n ′ ) ∈ D { 1 N ∑ i = 1 N [ r i + γ max a Q θ − ( o i ′ , o j ∈ N ( i ) ′ , a ) ⏟ y i − Q θ ( o i , o j ∈ N ( i ) , a i ) ] 2 } \mathcal{L}(\theta) = \mathcal{E}_{(o_{1:n},\ a_{1:n},\ r_{1:n},\ o_{1:n}') \in \mathcal{D}} \Bigg\{ \frac{1}{N} \sum_{i = 1}^{N} \bigg[ \underset{y_{i}}{\underbrace{r_{i} + \gamma \max_{a} Q_{\theta^{-}}(o_{i}',\ o_{j \in N(i)}',\ a)}} - Q_{\theta}(o_{i},\ o_{j \in N(i)},\ a_{i}) \bigg]^{2} \Bigg\}
L ( θ ) = E ( o 1 : n , a 1 : n , r 1 : n , o 1 : n ′ ) ∈ D { N 1 i = 1 ∑ N [ y i r i + γ a max Q θ − ( o i ′ , o j ∈ N ( i ) ′ , a ) − Q θ ( o i , o j ∈ N ( i ) , a i ) ] 2 }
In addition, to make coordination relationship more stable and consistent over time, DGN introduces a temporal relation regularization between the attention weights distribution α i m \alpha_{i}^{m} α i m in a high-level layer κ \kappa κ on current step and next step
L ( θ ) = E ( o 1 : n , a 1 : n , r 1 : n , o 1 : n ′ ) ∈ D { 1 N ∑ i = 1 N [ y i − Q θ ( o i , o j ∈ N ( i ) , a i ) ] 2 + λ 1 M ∑ m = 1 M D KL ( α i m , κ ∥ α ~ i m , κ ) } \mathcal{L}(\theta) = \mathcal{E}_{(o_{1:n},\ a_{1:n},\ r_{1:n},\ o_{1:n}') \in \mathcal{D}} \left\{ \frac{1}{N} \sum_{i = 1}^{N} \Big[ y_{i} - Q_{\theta}(o_{i},\ o_{j \in N(i)},\ a_{i}) \Big]^{2} + \lambda \frac{1}{M} \sum_{m = 1}^{M} D_{\text{KL}} \Big( \alpha_{i}^{m,\ \kappa}\ \|\ \tilde{\alpha}_{i}^{m,\ \kappa} \Big) \right\}
L ( θ ) = E ( o 1 : n , a 1 : n , r 1 : n , o 1 : n ′ ) ∈ D { N 1 i = 1 ∑ N [ y i − Q θ ( o i , o j ∈ N ( i ) , a i ) ] 2 + λ M 1 m = 1 ∑ M D KL ( α i m , κ ∥ α ~ i m , κ ) }