CSRO
Task Representation Learning
As a context-based OMRL algorithm, CSRO extracts the task information from offline context through context encoder
qϕ(z∣c={(si, ai, ri, si′)}i=1nc)=nc1i=1∑ncqϕ(z∣si, ai, ri, si′)
The context encoder can be trained to maximizing I(z; M) through contrastive learning objective in FOCAL
LmaxMI(ϕ)=EMi, Mj∼p(M)Eci∼D(Mi), cj∼D(Mj)Ezi∼qϕ(⋅∣ci), zj∼qϕ(⋅∣cj)[1(Mi=Mj)∣zi−zj∣22+1(Mi=Mj)∣zi−zj∣2n+ϵβ]
To alleviate testing performance decline caused by context distribution shift, CSRO adopts CLUB to reduce I(z; s, a)
LminMI(ϕ)⇓=ICLUB(z; s, a)=Ez, s, a∼p(z, s, a)logp(z∣s, a)−Ez∼p(z)Es, a∼p(s, a)logp(z∣s, a)≥I(z; s, a)≈EM∼p(M)E(s, a, r, s′)∼D(M)[Ez∼qϕ(⋅∣s, a, r, s′)logqψ(z∣s, a)+EM~∼p(M)E(s~, a~)∼D(M~)logqψ(z∣s~, a~)]
where p(z∣s, a) is approximated by an unbiased estimator qψ(z∣s, a) trained in parallel with context encoder
LVD(ψ)=−EM∼p(M)E(s, a, r, s′)∼D(M)Ez∼qϕ(⋅∣s, a, r, s′)logqψ(z∣s, a)
CSRO combines the above two objective and get the total loss of context encdoer Lencoder=LmaxMI+λLminMI
Based on the context encoder, CSRO conducts meta offline behavior learning on behavior regularized actor and critic
At meta test phase, CSRO performs random exploration to collect experience at early stage instead of πθ(a∣s, z0)
Such non-prior context collection strategy can elimiate the influence of initially sampled task representation z0