Lang4Sim2Real

Data Collection

Two dataset across source (simulation) and target domain (reality) are accessible for few-shot visual IL

Dataset	Domain	Scale	Collection Cost
$\mathcal{D}^{\mathrm{s}}$	Source (Simulation)	Multiple Tasks + Large Scale	Cheap
$\mathcal{D}_{\mathrm{target}}^{\mathrm{t}}$	Target (Reality)	Single Task + Small Scale	Expensive

Both of datasets are in the form of expert trajectories $\tau = \{ s_{t},\ o_{t},\ l_{t},\ a_{t} \mid l_{\mathrm{task}} \}_{t = 0}^{\mathrm{T}}$ , which are made up of

robot proprioceptive state $s_{t}$
image (RGB) observation $o_{t}$
language description of observation $l_{t}$
robot action $a_{t}$
language task instruction $l_{\mathrm{task}}$

The language descriptions $l_{t}$ on image observation can be labeled automatically through

online annotation: scripted-policy + description templates
offline annotation: raw trajectories + off-the-shelf VLMs

Image-Language Pretraining

The language descriptions can be further used for the learning of domain-invariant visual representation

Instead of using pretrained VLMs, Lang4Sim2Real leverages language as supervision for image encoder $f_{cnn}$

Regression

The language embedding $f_{lang}(l_{t})$ can be used to shape the image embedding space through regression

\min_{f_{\mathrm{cnn}},\ g} \mathcal{E}_{(o_{t},\ l_{t}) \sim \mathcal{D}^{\mathrm{s}} \cup \mathcal{D}_{\mathrm{target}}^{t}} \Big\| g(f_{cnn}(o_{t})) - f_{lang}(l_{t}) \Big\|_{2}^{2}

where the temporary adaptor $g : \mathbb{R}^{d_{cnn}} \mapsto \mathbb{R}^{d_{lang}}$ is trainable. This objective effectively encourages the image encoder to reflect the LLM embedding space and extract task-relevant, semantic aspects of the image

Contrastive Learning

The other variant adopts the contrastive learning paradigm with the distance of language as similarity

\min_{f_{cnn}} \mathcal{E}_{(o_{s},\ l_{s}) \sim \mathcal{D}^{\mathrm{s}} \cup \mathcal{D}_{\mathrm{target}}^{t}} \mathcal{E}_{(o_{t},\ l_{t}) \sim \mathcal{D}^{\mathrm{s}} \cup \mathcal{D}_{\mathrm{target}}^{t}} \left[ \frac{f_{cnn}(o_{s}) \cdot f_{cnn}(o_{t})}{\| f_{cnn}(o_{s}) \| \cdot \| f_{cnn}(o_{t}) \|} - d(l_{s},\ l_{t}) \right]^{2}

where $d(\cdot,\ \cdot)$ is BLEURT distance normalized into $[0,\ 1]$ across all possible $(l_{s},\ l_{t})$ pairs in the dataset

Multi-Task / Domain BC

Based on the learned domain-invariant representation, the policy is trained to maximize the log-likelihood

\max_{\theta} \mathcal{E}_{(s_{t},\ o_{t},\ a_{t} \mid l_{\mathrm{task}}) \sim \mathcal{D}^{\mathrm{s}} \cup \mathcal{D}_{\mathrm{target}}^{t}} \Big[ \log \pi_{\theta}(a_{t} \mid s_{t},\ f_{cnn}(o_{t}),\ l_{\mathrm{task}}) \Big]

In the policy network, the image encoder $f_{cnn}$ (ResNet18) is substantially frozen and combined with the task instruction embedding through FiLM. The output action is predicted with image embedding and robot state

RL > Language-Assistant

#Lang4Sim2Real

Lang4Sim2Real

http://example.com/2024/09/18/Lang4Sim2Real/

Author

木辛

Posted on

September 18, 2024

Licensed under

Tesseract Previous

EMMmA Next