TRAMA
Trajectory Classification
Similar to LAGMA, TRAMA adopts VQ-VAE and additional converge loss to construct quantized latent space for states
LVQtot(ϕ, e)=LVQ(ϕ, e)+λcvrLcvr(e)=LVQ(ϕ, e)+λcvr∣J(t, k)∣1j∈J(t, k)∑∥∥∥∥sg[fϕe(stk)]−ej∥∥∥∥22
where J(t, k) additionally distributes the codebook uniformly among trajectory classes compared to J(t) in LAGMA
J(t) |
J(t, k) |
 |
 |
TRAMA performs k-means clustering on trajectories sampled from replay buffer based on the trajectory embeddings
eˉ(τ={s0:T})=s∈τ∑e(s)=s∈τ∑[x=fϕe(s)]q
For the rest trajectories without class label, TRAMA trains a trajectory classifier fψ(⋅∣eˉ) on the clustered M trajectories
L(ψ)=−M1m=1∑M1(kˉm=k^m)logfψ(k^m∣eˉm)
To keep the consistency of class labels, the centroid of k-means clustering is initialized with previous centroid results
Multi-Task Policy Learning
With the labeled trajectories, TRAMA trains the agents to predict trajectory classes solely from their local observation
L(ζ)=−B1b=1∑B[t=0∑Ti=1∑n1(kˉ=k^ti)logπζ(k^ti∣oti, hg, t−1i)]b
The multi-task policy is conditioned on trajectory class embeddings generated by a class representation model fθg(⋅∣k^ti)
The class representation model fθg(k^ti) and value network Qθtot(s, a) are trained jointly through the following objective
L(θ)=Eo, a, r, o′∼DEk^∼πζ(⋅∣o), k^′∼πζ(⋅∣o′)Eg∼fθg(⋅∣k^), g′∼fθg(⋅∣k^′)[r+a′maxQθ−tot(o′, a′∣g′)−Qθtot(o, a∣g)]2