TRAMA

Trajectory Classification

Similar to LAGMA, TRAMA adopts VQ-VAE and additional converge loss to construct quantized latent space for states

\mathcal{L}_{\text{VQ}}^{\text{tot}}(\phi,\ \boldsymbol{e}) = \mathcal{L}_{\text{VQ}}(\phi,\ \boldsymbol{e}) + \lambda_{\text{cvr}} \mathcal{L}_{\text{cvr}}(\boldsymbol{e}) = \mathcal{L}_{\text{VQ}}(\phi,\ \boldsymbol{e}) + \lambda_{\text{cvr}} \frac{1}{|\mathcal{J}(t,\ k)|} \sum_{j \in \mathcal{J}(t,\ k)} \Big\| \operatorname{sg} \left[ f_{\phi}^{e}(s_{t}^{k}) \right] - e_{j} \Big\|_{2}^{2}

where $\mathcal{J}(t,\ k)$ additionally distributes the codebook uniformly among trajectory classes compared to $\mathcal{J}(t)$ in LAGMA

$\mathcal{J}(t)$	$\mathcal{J}(t,\ k)$

TRAMA performs k-means clustering on trajectories sampled from replay buffer based on the trajectory embeddings

\bar{e}(\tau = \{ s_{0:\mathrm{T}} \}) = \sum_{s \in \tau} e(s) = \sum_{s \in \tau} \left[ x = f_{\phi}^{e}(s) \right]_{q}

For the rest trajectories without class label, TRAMA trains a trajectory classifier $f_{\psi}(\cdot \mid \bar{e})$ on the clustered $M$ trajectories

\mathcal{L}(\psi) = -\frac{1}{M} \sum_{m = 1}^{M} \boldsymbol{1}(\bar{k}_{m} = \hat{k}_{m}) \log f_{\psi}(\hat{k}_{m} \mid \bar{e}_{m})

To keep the consistency of class labels, the centroid of k-means clustering is initialized with previous centroid results

Multi-Task Policy Learning

With the labeled trajectories, TRAMA trains the agents to predict trajectory classes solely from their local observation

\mathcal{L}(\zeta) = -\frac{1}{B} \sum_{b = 1}^{B} \left[ \sum_{t = 0}^{\mathrm{T}} \sum_{i = 1}^{n} \boldsymbol{1}(\bar{k} = \hat{k}_{t}^{i}) \log \pi_{\zeta}(\hat{k}_{t}^{i} \mid o_{t}^{i},\ h_{g,\ t - 1}^{i}) \right]_{b}

The multi-task policy is conditioned on trajectory class embeddings generated by a class representation model $f_{\theta}^{g}(\cdot \mid \hat{k}_{t}^{i})$

The class representation model $f_{\theta}^{g}(\hat{k}_{t}^{i})$ and value network $Q_{\theta}^{\text{tot}}(s,\ a)$ are trained jointly through the following objective

\mathcal{L}(\theta) = \mathcal{E}_{o,\ a,\ r,\ o' \sim \mathcal{D}} \mathcal{E}_{\hat{k} \sim \pi_{\zeta}(\cdot \mid o),\ \hat{k}' \sim \pi_{\zeta}(\cdot \mid o')} \mathcal{E}_{g \sim f_{\theta}^{g}(\cdot \mid \hat{k}),\ g' \sim f_{\theta}^{g}(\cdot \mid \hat{k}')} \Big[ r + \max_{a'} Q_{\theta^{-}}^{\text{tot}}(o',\ a' \mid g') - Q_{\theta}^{\text{tot}}(o,\ a \mid g) \Big]^{2}

RL > Multi-Agent

#TRAMA

TRAMA

http://example.com/2024/10/24/TRAMA/

Author

木辛

Posted on

October 24, 2024

Licensed under

UNICORN Previous

LAGMA Next