Lang4Sim2Real

Lang4Sim2Real

Data Collection

Two dataset across source (simulation) and target domain (reality) are accessible for few-shot visual IL

Dataset Domain Scale Collection Cost
Ds\mathcal{D}^{\mathrm{s}} Source (Simulation) Multiple Tasks + Large Scale Cheap
Dtargett\mathcal{D}_{\mathrm{target}}^{\mathrm{t}} Target (Reality) Single Task + Small Scale Expensive

Both of datasets are in the form of expert trajectories τ={st, ot, lt, atltask}t=0T\tau = \{ s_{t},\ o_{t},\ l_{t},\ a_{t} \mid l_{\mathrm{task}} \}_{t = 0}^{\mathrm{T}}, which are made up of

  1. robot proprioceptive state sts_{t}
  2. image (RGB) observation oto_{t}
  3. language description of observation ltl_{t}
  4. robot action ata_{t}
  5. language task instruction ltaskl_{\mathrm{task}}

The language descriptions ltl_{t} on image observation can be labeled automatically through

  1. online annotation: scripted-policy + description templates
  2. offline annotation: raw trajectories + off-the-shelf VLMs

Image-Language Pretraining

The language descriptions can be further used for the learning of domain-invariant visual representation

Instead of using pretrained VLMs, Lang4Sim2Real leverages language as supervision for image encoder fcnnf_{cnn}

Regression

The language embedding flang(lt)f_{lang}(l_{t}) can be used to shape the image embedding space through regression

minfcnn, gE(ot, lt)DsDtargettg(fcnn(ot))flang(lt)22\min_{f_{\mathrm{cnn}},\ g} \mathcal{E}_{(o_{t},\ l_{t}) \sim \mathcal{D}^{\mathrm{s}} \cup \mathcal{D}_{\mathrm{target}}^{t}} \Big\| g(f_{cnn}(o_{t})) - f_{lang}(l_{t}) \Big\|_{2}^{2}

where the temporary adaptor g:RdcnnRdlangg : \mathbb{R}^{d_{cnn}} \mapsto \mathbb{R}^{d_{lang}} is trainable. This objective effectively encourages the image encoder to reflect the LLM embedding space and extract task-relevant, semantic aspects of the image

Contrastive Learning

The other variant adopts the contrastive learning paradigm with the distance of language as similarity

minfcnnE(os, ls)DsDtargettE(ot, lt)DsDtargett[fcnn(os)fcnn(ot)fcnn(os)fcnn(ot)d(ls, lt)]2\min_{f_{cnn}} \mathcal{E}_{(o_{s},\ l_{s}) \sim \mathcal{D}^{\mathrm{s}} \cup \mathcal{D}_{\mathrm{target}}^{t}} \mathcal{E}_{(o_{t},\ l_{t}) \sim \mathcal{D}^{\mathrm{s}} \cup \mathcal{D}_{\mathrm{target}}^{t}} \left[ \frac{f_{cnn}(o_{s}) \cdot f_{cnn}(o_{t})}{\| f_{cnn}(o_{s}) \| \cdot \| f_{cnn}(o_{t}) \|} - d(l_{s},\ l_{t}) \right]^{2}

where d(, )d(\cdot,\ \cdot) is BLEURT distance normalized into [0, 1][0,\ 1] across all possible (ls, lt)(l_{s},\ l_{t}) pairs in the dataset

Multi-Task / Domain BC

Based on the learned domain-invariant representation, the policy is trained to maximize the log-likelihood

maxθE(st, ot, atltask)DsDtargett[logπθ(atst, fcnn(ot), ltask)]\max_{\theta} \mathcal{E}_{(s_{t},\ o_{t},\ a_{t} \mid l_{\mathrm{task}}) \sim \mathcal{D}^{\mathrm{s}} \cup \mathcal{D}_{\mathrm{target}}^{t}} \Big[ \log \pi_{\theta}(a_{t} \mid s_{t},\ f_{cnn}(o_{t}),\ l_{\mathrm{task}}) \Big]

In the policy network, the image encoder fcnnf_{cnn} (ResNet18) is substantially frozen and combined with the task instruction embedding through FiLM. The output action is predicted with image embedding and robot state


Lang4Sim2Real
http://example.com/2024/09/18/Lang4Sim2Real/
Author
木辛
Posted on
September 18, 2024
Licensed under