CORRO

General Structure

CORRO designs a bi-level task encoder to generate robust task representation from transitions (context)

z = E_{\theta_{2}}(\{ z_{i} \}_{i = 1}^{k}) \quad z_{i} = E_{\theta_{1}}(x_{i}) = E_{\theta_{1}}(s_{i},\ a_{i},\ r_{i},\ s_{i}')

where the low-level encoder $E_{\theta_{1}}$ extracts latent representation from a single transition tuple and the high-level encoder $E_{\theta_{2}}$ aggregates all the latent codes $z_{i}$ of a context $c = \{ (s_{i},\ a_{i},\ r_{i},\ s_{i}') \}_{i = 1}^{k}$ through attention into a task representation $z$

The task representation is further used for contextual behavior learning, which can be quickly adapted to unseen tasks

Task Representation Learning

The low-level encoder is trained through a contrastive learning objective, which is inspired by InfoNCE

\max_{\theta_{1}} \sum_{M \in \mathcal{M}} \sum_{x,\ x' \in \mathcal{D}(M)} \log \frac{\exp S(z,\ z')}{\exp S(z,\ z') + \sum_{\tilde{x}} \exp S(z,\ \tilde{z})} \qquad \tilde{s},\ \tilde{a} = s,\ a \quad \tilde{r},\ \tilde{s}' \sim p_{\text{neg}}(\tilde{r},\ \tilde{s}' \mid s,\ a)

where the negative transition sample $\tilde{x}$ in the negative pair $(x,\ \tilde{x})$ is generated from an approximated distribution

p_{\text{neg}}(\tilde{r},\ \tilde{s}' \mid s,\ a) \approx \mathcal{E}_{M \sim p(M)} \Big[ p(\tilde{s}' \mid s,\ a,\ M) p(\tilde{r} \mid s,\ a,\ M) \Big]

The objective enforces the task encoder to capture the feature of task dynamics and reward, while ignore the variation caused by difference of data collection policy, since the negative sample share the same $(s,\ a)$ with the origin sample

Negative Pair Generation

CORRO proposes the following approaches for the aforementioned negative pairs generation

Generative Modeling

A CVAE $q_{\omega}(z \mid s,\ a,\ r,\ s')$ + $p_{\xi}(r,\ s' \mid s,\ a,\ z)$ is adopted as generative model for distribution approximation

\max_{\omega,\ \xi} \mathcal{E}_{M \sim \mathcal{M}} \mathcal{E}_{(s,\ a,\ r,\ s') \sim \mathcal{D}(M)} \left[ \mathcal{E}_{z \sim q_{\omega}(z \mid s,\ a,\ r,\ s')} \log p_{\xi}(r,\ s' \mid s,\ a,\ z) - D_{\text{KL}} \Big( q_{\omega}(z \mid s,\ a,\ r,\ s')\ \|\ p(z) \Big) \right]

where the latent vector $z$ follows a prior standard Gaussian distribution $p(z)$

Reward Randomization

When tasks only differ in reward functions, negative pairs can be generated by adding random noise to reward

\tilde{r} = r + \nu \quad \nu \sim p(\nu)

Though can’t approximate the true distribution, it provides an infinitely large space to generate diverse rewards

RL > Meta-Learning

#CORRO

CORRO

http://example.com/2024/10/05/CORRO/

Author

木辛

Posted on

October 5, 2024

Licensed under

G2ANet Previous

DCG Next