CORRO
General Structure
CORRO designs a bi-level task encoder to generate robust task representation from transitions (context)
z=Eθ2({zi}i=1k)zi=Eθ1(xi)=Eθ1(si, ai, ri, si′)
where the low-level encoder Eθ1 extracts latent representation from a single transition tuple and the high-level encoder Eθ2 aggregates all the latent codes zi of a context c={(si, ai, ri, si′)}i=1k through attention into a task representation z
The task representation is further used for contextual behavior learning, which can be quickly adapted to unseen tasks
Task Representation Learning
The low-level encoder is trained through a contrastive learning objective, which is inspired by InfoNCE
θ1maxM∈M∑x, x′∈D(M)∑logexpS(z, z′)+∑x~expS(z, z~)expS(z, z′)s~, a~=s, ar~, s~′∼pneg(r~, s~′∣s, a)
where the negative transition sample x~ in the negative pair (x, x~) is generated from an approximated distribution
pneg(r~, s~′∣s, a)≈EM∼p(M)[p(s~′∣s, a, M)p(r~∣s, a, M)]
The objective enforces the task encoder to capture the feature of task dynamics and reward, while ignore the variation caused by difference of data collection policy, since the negative sample share the same (s, a) with the origin sample
Negative Pair Generation
CORRO proposes the following approaches for the aforementioned negative pairs generation
Generative Modeling
A CVAE qω(z∣s, a, r, s′) + pξ(r, s′∣s, a, z) is adopted as generative model for distribution approximation
ω, ξmaxEM∼ME(s, a, r, s′)∼D(M)[Ez∼qω(z∣s, a, r, s′)logpξ(r, s′∣s, a, z)−DKL(qω(z∣s, a, r, s′) ∥ p(z))]
where the latent vector z follows a prior standard Gaussian distribution p(z)
Reward Randomization
When tasks only differ in reward functions, negative pairs can be generated by adding random noise to reward
r~=r+νν∼p(ν)
Though can’t approximate the true distribution, it provides an infinitely large space to generate diverse rewards