TRAMA

TRAMA

Trajectory Classification

Similar to LAGMA, TRAMA adopts VQ-VAE and additional converge loss to construct quantized latent space for states

LVQtot(ϕ, e)=LVQ(ϕ, e)+λcvrLcvr(e)=LVQ(ϕ, e)+λcvr1J(t, k)jJ(t, k)sg[fϕe(stk)]ej22\mathcal{L}_{\text{VQ}}^{\text{tot}}(\phi,\ \boldsymbol{e}) = \mathcal{L}_{\text{VQ}}(\phi,\ \boldsymbol{e}) + \lambda_{\text{cvr}} \mathcal{L}_{\text{cvr}}(\boldsymbol{e}) = \mathcal{L}_{\text{VQ}}(\phi,\ \boldsymbol{e}) + \lambda_{\text{cvr}} \frac{1}{|\mathcal{J}(t,\ k)|} \sum_{j \in \mathcal{J}(t,\ k)} \Big\| \operatorname{sg} \left[ f_{\phi}^{e}(s_{t}^{k}) \right] - e_{j} \Big\|_{2}^{2}

where J(t, k)\mathcal{J}(t,\ k) additionally distributes the codebook uniformly among trajectory classes compared to J(t)\mathcal{J}(t) in LAGMA

J(t)\mathcal{J}(t) J(t, k)\mathcal{J}(t,\ k)

TRAMA performs k-means clustering on trajectories sampled from replay buffer based on the trajectory embeddings

eˉ(τ={s0:T})=sτe(s)=sτ[x=fϕe(s)]q\bar{e}(\tau = \{ s_{0:\mathrm{T}} \}) = \sum_{s \in \tau} e(s) = \sum_{s \in \tau} \left[ x = f_{\phi}^{e}(s) \right]_{q}

For the rest trajectories without class label, TRAMA trains a trajectory classifier fψ(eˉ)f_{\psi}(\cdot \mid \bar{e}) on the clustered MM trajectories

L(ψ)=1Mm=1M1(kˉm=k^m)logfψ(k^meˉm)\mathcal{L}(\psi) = -\frac{1}{M} \sum_{m = 1}^{M} \boldsymbol{1}(\bar{k}_{m} = \hat{k}_{m}) \log f_{\psi}(\hat{k}_{m} \mid \bar{e}_{m})

To keep the consistency of class labels, the centroid of k-means clustering is initialized with previous centroid results

Multi-Task Policy Learning

With the labeled trajectories, TRAMA trains the agents to predict trajectory classes solely from their local observation

L(ζ)=1Bb=1B[t=0Ti=1n1(kˉ=k^ti)logπζ(k^tioti, hg, t1i)]b\mathcal{L}(\zeta) = -\frac{1}{B} \sum_{b = 1}^{B} \left[ \sum_{t = 0}^{\mathrm{T}} \sum_{i = 1}^{n} \boldsymbol{1}(\bar{k} = \hat{k}_{t}^{i}) \log \pi_{\zeta}(\hat{k}_{t}^{i} \mid o_{t}^{i},\ h_{g,\ t - 1}^{i}) \right]_{b}

The multi-task policy is conditioned on trajectory class embeddings generated by a class representation model fθg(k^ti)f_{\theta}^{g}(\cdot \mid \hat{k}_{t}^{i})

The class representation model fθg(k^ti)f_{\theta}^{g}(\hat{k}_{t}^{i}) and value network Qθtot(s, a)Q_{\theta}^{\text{tot}}(s,\ a) are trained jointly through the following objective

L(θ)=Eo, a, r, oDEk^πζ(o), k^πζ(o)Egfθg(k^), gfθg(k^)[r+maxaQθtot(o, ag)Qθtot(o, ag)]2\mathcal{L}(\theta) = \mathcal{E}_{o,\ a,\ r,\ o' \sim \mathcal{D}} \mathcal{E}_{\hat{k} \sim \pi_{\zeta}(\cdot \mid o),\ \hat{k}' \sim \pi_{\zeta}(\cdot \mid o')} \mathcal{E}_{g \sim f_{\theta}^{g}(\cdot \mid \hat{k}),\ g' \sim f_{\theta}^{g}(\cdot \mid \hat{k}')} \Big[ r + \max_{a'} Q_{\theta^{-}}^{\text{tot}}(o',\ a' \mid g') - Q_{\theta}^{\text{tot}}(o,\ a \mid g) \Big]^{2}


TRAMA
http://example.com/2024/10/24/TRAMA/
Author
木辛
Posted on
October 24, 2024
Licensed under