Lang4Sim2Real
Lang4Sim2Real
Data Collection
Two dataset across source (simulation) and target domain (reality) are accessible for few-shot visual IL
Dataset | Domain | Scale | Collection Cost |
---|---|---|---|
Source (Simulation) | Multiple Tasks + Large Scale | Cheap | |
Target (Reality) | Single Task + Small Scale | Expensive |
Both of datasets are in the form of expert trajectories , which are made up of
- robot proprioceptive state
- image (RGB) observation
- language description of observation
- robot action
- language task instruction
The language descriptions on image observation can be labeled automatically through
- online annotation: scripted-policy + description templates
- offline annotation: raw trajectories + off-the-shelf VLMs
Image-Language Pretraining
The language descriptions can be further used for the learning of domain-invariant visual representation

Instead of using pretrained VLMs, Lang4Sim2Real leverages language as supervision for image encoder
Regression
The language embedding can be used to shape the image embedding space through regression
where the temporary adaptor is trainable. This objective effectively encourages the image encoder to reflect the LLM embedding space and extract task-relevant, semantic aspects of the image
Contrastive Learning
The other variant adopts the contrastive learning paradigm with the distance of language as similarity
where is BLEURT distance normalized into across all possible pairs in the dataset
Multi-Task / Domain BC
Based on the learned domain-invariant representation, the policy is trained to maximize the log-likelihood

In the policy network, the image encoder (ResNet18) is substantially frozen and combined with the task instruction embedding through FiLM. The output action is predicted with image embedding and robot state
