LAGMA

State Embedding

LAGMA construct discretized low-dimensional state embedding through VQ-VAE (encoder $f_{\phi}$ , decoder $f_{\psi}$ , codebook $\boldsymbol{e}$ )

\begin{gathered} \mathcal{L}_{\text{VQ}}(\phi,\ \psi,\ \boldsymbol{e}) = \Big\| f_{\psi}(x_{q}) - s \Big\|_{2}^{2} + \lambda_{\text{vq}} \Big\| \operatorname{sg}[x] - x_{q} \Big\|_{2}^{2} + \lambda_{\text{commit}} \Big\| x - \operatorname{sg}[x_{q}] \Big\|_{2}^{2} \\[5mm] x = f_{\phi}(s) \quad [x]_{q} = \argmin_{e \in \boldsymbol{e}} \| x - e \|_{2} \quad f_{\psi}(x_{q}) \triangleq f_{\psi} \Big( \operatorname{sg}[x_{q}] + x - \operatorname{sg}[x] \Big) \end{gathered}

However, the projected embedding space of feasible states $\chi$ is too narrow to the whole embedding space. most randomly initialized quantized vectors locate far from $\chi$ , only a few in the codebook are selected and trained throughout an episode

Raw VQ-VAE	Training with $\mathcal{L}_{\text{cvr}}^{\text{all}}$	Training with $\mathcal{L}_{\text{cvr}}$

To resolve this issue, LAGMA introduces an additional converge loss to minimize the overall distance between $x$ and $\boldsymbol{e}$

\mathcal{L}_{\text{cvr}}^{\text{all}}(\boldsymbol{e}) = \frac{1}{n_{c}} \sum_{j = 1}^{n_{c}} \Big\| \operatorname{sg}[x] - e_{j} \Big\|_{2}^{2}

While such converge loss may lead all quantized vectors stay in the center of $\chi$ , LAGMA thus adopts a variant of $\mathcal{L}_{\text{cvr}}^{\text{all}}$

\mathcal{L}_{\text{cvr}}(\boldsymbol{e}) = \frac{1}{|\mathcal{J}(t)|} \sum_{j \in \mathcal{J}(t)} \Big\| \operatorname{sg}[x = f_{\phi}(s_{t})] - e_{j} \Big\|_{2}^{2}

where the timestep dependent indexing $\mathcal{J}(t)$ distributes the quantized vectors uniformly according to the timesteps

Intrinsic Reward

With the codebook, LAGMA records the cumulative return of $s$ to the buffer of corresponding embedding $e = [f_{\phi}(s)]_{q}$

These stored returns in the FIFO buffer can be used for count-based moving average estimation of state value function

V(s) \approx C_{q}(s) = \frac{1}{m} \sum_{i = 1}^{m} R_{i}(e = [f_{\phi}(s)]_{q})

LAGMA also stores goal-reaching trajectories $\tau = \{ s_{t:\mathrm{T}} \}$ in the buffer of $e = [f_{\phi}(s_{t})]_{q}$ with $C_{q}(s_{t})$ as the priority

These goal-reaching trajectories are considered as the reference to only incentivize the desired transitions in $\tau = \{ s_{t:\mathrm{T}} \}$

r^{I} = \gamma \max\left\{ \Big[ C_{q}(s') - \max_{a'} Q_{\theta^{-}}^{\text{tot}}(s',\ a') \Big],\ 0 \right\} \qquad \text{s.t.} \quad x_{q}' \in \tau^{*} \wedge x_{q} \ne x_{q}' \wedge \tau^{*} \sim \mathcal{D}_{\tau}(e) \wedge e = [f_{\phi}(s_{t})]_{q}

The latent goal-guided intrinsic reward $r^{I}$ is further adopted for MARL, the overall objective is

\mathcal{L}(\theta) = \Big[ r + r^{I} + \gamma \max_{a'} Q_{\theta^{-}}^{\text{tot}}(s',\ a') - Q_{\theta}^{\text{tot}}(s,\ a) \Big]^{2}

For the desired transitions, the loss function will converge to the optimal loss function as the $C_{q}(s')$ converges to $V^{\star}(s')$

RL > Multi-Agent

#LAGMA

LAGMA

http://example.com/2024/10/23/LAGMA/

Author

木辛

Posted on

October 23, 2024

Licensed under

TRAMA Previous

EMU Next