CORRO

CORRO

General Structure

CORRO designs a bi-level task encoder to generate robust task representation from transitions (context)

z=Eθ2({zi}i=1k)zi=Eθ1(xi)=Eθ1(si, ai, ri, si)z = E_{\theta_{2}}(\{ z_{i} \}_{i = 1}^{k}) \quad z_{i} = E_{\theta_{1}}(x_{i}) = E_{\theta_{1}}(s_{i},\ a_{i},\ r_{i},\ s_{i}')

where the low-level encoder Eθ1E_{\theta_{1}} extracts latent representation from a single transition tuple and the high-level encoder Eθ2E_{\theta_{2}} aggregates all the latent codes ziz_{i} of a context c={(si, ai, ri, si)}i=1kc = \{ (s_{i},\ a_{i},\ r_{i},\ s_{i}') \}_{i = 1}^{k} through attention into a task representation zz

The task representation is further used for contextual behavior learning, which can be quickly adapted to unseen tasks

Task Representation Learning

The low-level encoder is trained through a contrastive learning objective, which is inspired by InfoNCE

maxθ1MMx, xD(M)logexpS(z, z)expS(z, z)+x~expS(z, z~)s~, a~=s, ar~, s~pneg(r~, s~s, a)\max_{\theta_{1}} \sum_{M \in \mathcal{M}} \sum_{x,\ x' \in \mathcal{D}(M)} \log \frac{\exp S(z,\ z')}{\exp S(z,\ z') + \sum_{\tilde{x}} \exp S(z,\ \tilde{z})} \qquad \tilde{s},\ \tilde{a} = s,\ a \quad \tilde{r},\ \tilde{s}' \sim p_{\text{neg}}(\tilde{r},\ \tilde{s}' \mid s,\ a)

where the negative transition sample x~\tilde{x} in the negative pair (x, x~)(x,\ \tilde{x}) is generated from an approximated distribution

pneg(r~, s~s, a)EMp(M)[p(s~s, a, M)p(r~s, a, M)]p_{\text{neg}}(\tilde{r},\ \tilde{s}' \mid s,\ a) \approx \mathcal{E}_{M \sim p(M)} \Big[ p(\tilde{s}' \mid s,\ a,\ M) p(\tilde{r} \mid s,\ a,\ M) \Big]

The objective enforces the task encoder to capture the feature of task dynamics and reward, while ignore the variation caused by difference of data collection policy, since the negative sample share the same (s, a)(s,\ a) with the origin sample

Negative Pair Generation

CORRO proposes the following approaches for the aforementioned negative pairs generation

Generative Modeling

A CVAE qω(zs, a, r, s)q_{\omega}(z \mid s,\ a,\ r,\ s') + pξ(r, ss, a, z)p_{\xi}(r,\ s' \mid s,\ a,\ z) is adopted as generative model for distribution approximation

maxω, ξEMME(s, a, r, s)D(M)[Ezqω(zs, a, r, s)logpξ(r, ss, a, z)DKL(qω(zs, a, r, s)  p(z))]\max_{\omega,\ \xi} \mathcal{E}_{M \sim \mathcal{M}} \mathcal{E}_{(s,\ a,\ r,\ s') \sim \mathcal{D}(M)} \left[ \mathcal{E}_{z \sim q_{\omega}(z \mid s,\ a,\ r,\ s')} \log p_{\xi}(r,\ s' \mid s,\ a,\ z) - D_{\text{KL}} \Big( q_{\omega}(z \mid s,\ a,\ r,\ s')\ \|\ p(z) \Big) \right]

where the latent vector zz follows a prior standard Gaussian distribution p(z)p(z)

Reward Randomization

When tasks only differ in reward functions, negative pairs can be generated by adding random noise to reward

r~=r+ννp(ν)\tilde{r} = r + \nu \quad \nu \sim p(\nu)

Though can’t approximate the true distribution, it provides an infinitely large space to generate diverse rewards


CORRO
http://example.com/2024/10/05/CORRO/
Author
木辛
Posted on
October 5, 2024
Licensed under