GenRL
World Model Learning
GenRL adopts a variant of RSSM structure with categorical latent stochastic state st as the representation of world model
Component |
Type |
Definition |
Component |
Type |
Definition |
Encoder |
Inference |
st∼qϕ(st∣xt) |
Decoder |
Generation |
xt∼pϕ(xt∣st) |
Sequence |
Generation |
ht=fϕ(ht−1, st−1, at−1) |
Dynamics |
Generation |
st∼pϕ(st∣ht) |
The world model is trained to maximize the ELBO of the log-likelihood on sampled trajectory p(x1:T∣a1:T)
ϕmaxEst∼qϕ(⋅∣ht)[t=0∑Tlnp(xt∣st)−DKL(qϕ(⋅∣xt) ∥ pϕ(st∣ht))]s.t.ht=fϕ(ht−1, st−1, at−1)
To utilize the knowledge of pretrained VLM, GenRL connects the representation space between VLM and world model
where the connector pψ(st∣st−1, e) learns to predict latent states st:t+k from VLM-embeddings of observations xt:t+k
Lconn=τ=t∑t+kDKL(pψ(st∣st−1, e) ∥ sg(qϕ(st∣xt)))s.t.e=e(v)=fVLM(v)(xt:t+k)
and the aligner fψ(e(l)) is used to align different modality due to the multimodality gap caused by contrastive pretraining
Lalign=∥∥∥∥e(v)−fψ(e(l))∥∥∥∥22
As vision-language data is typically unavailable in embodied domains, the aligner can be trained in a language-free manner
Lalign=∥∥∥∥e(v)−fψ(e(l))∥∥∥∥22≈∥∥∥∥e(v)−fψ(e(v)+ϵ)∥∥∥∥22
which assumes that language embeddings can be treated as a corrupted version of their vision counterparts
Multi-Task Behavior Learning
GenRL adopts trajectory matching reward for behavior learning on latent states under user-prompted tasks
θminEst∼pϕ(⋅∣ht), at∼πθ(⋅∣st)[t=0∑Tγtdistance(pϕ(⋅∣ht) ∥ pψ(⋅∣st−1, etask))]s.t.ht=fϕ(ht−1, st−1, at−1)
where the etask is the VLM-embedding of task prompts, the distribution distance can be KL divergence or cosine distance
In addition, the initial state of trajectories suggested by VLM from task prompts may differ from the trajectories generated by policy and world model, causing disalignment in the reward. GenRL performs the following steps to address this issue
- compares the similarity between initial b states in target trajectory and sliding b states in imagined trajectory
- finds the timestep ta with the highest similarity as the aligned initial timestep
- calculates the matching reward with the initial state of target trajectory for those timesteps before ta
- calculates the matching reward with the state on corresponding timestep for those timesteps after ta