LanGWM
LanGWM

Observation Representation Learning
The proposed language grounded representation learning has the following sub-modules
- object instance masking
- randomly select an instance of the object then mask the smallest rectangular bounding box
- add additional random margins to the bounding box
- mask up to 3 objects and stop early if the masked region reaches 75% of the image
- use spatial jitter, Gaussian blur, color jitter and grayscale data augmentations
- language description generation
- use language templates to generate the description of the masked object, for example
- If you look
{distance}
in the{direction}
, you will see{object}
- There is
{object}
in the{direction}
{distance}
- The
{object}
is approximately{distance}
,{direction}
from here - …
- If you look
- The template is parameterised by
{object}
⇒ semantic class of the object{direction}
⇒ average horizontal position of the object center{distance}
⇒ average distance of the object regions
- use BERT to extract the feature embeds from the language descriptions
- give constant token values of the empty description during the evaluation and test time
- use language templates to generate the description of the masked object, for example
- object instance MAE
- use early convolution layer and apply the masking in the convolutional feature maps
- convert feature maps into a sequence of patches (tokens) with a patch size of 1
- use ViT encoder to extract the grounded visual features from concatenated image + language tokens
- discard the concatenated language tokens after the encoder and feed the visual positional tokens only
- use ViT decoder to reconstruct depth and predict reward from visual tokens
All modules are optimized using the mean square loss of the depth reconstruction and reward prediction
Module | Definition |
---|---|
Object Masking + Language Description | |
Early Convolution | |
Convolution Tokens Masking | |
Language Embedding | |
Tokens Concatence | |
MAE encoder | |
Slice Out | |
MAE decoder |
Predictive World Model + Behavior Learning
The future predictive world model contains the following components
Component | Type | Definition |
---|---|---|
Representation Model | Inference | |
Transition Model | Generation |
The observation representation model and predictive world model are optimized jointly through
The actor and critic is learned in the imagined latent space
LanGWM
http://example.com/2024/09/08/LanGWM/