FOCAL
Task Representation Learning
FOCAL considers a series of tasks {Ti} with point-wise unique transitions, rewards and respective offline datasets {Di}
∀ (s, a)∈S×A:T1=T2⟺P1(s, a)=P2(s, a)∧R1(s, a)=R2(s, a)
To achieve efficient and robust task representation inference, FOCAL adopts a negative-power variant of contrastive loss
Ldml=i=1∑nj=1∑nE(si, ai, ri, si′)∼DiE(sj, aj, rj, sj′)∼Dj[1{i=j}∥zi−zj∥22+1{i=j}∥zi−zj∥2n+ϵβ]
where the task embedding z is derived from a deterministic context encoder qϕ(z∣c=s, a, r, s′), since the transition tuple (context) is unique across tasks. And the negative-power distance ensures seperation between task clusters
Based on the multi-task offline datasets, FOCAL trains behavior regularized actor+critic conditioned on task embeddings
πmaxEs0∼ρ0(⋅, z), at∼π(⋅∣st, z), st+1∼p(⋅∣st, at, z)[t=0∑∞γt(R(st, at, z)−αD(π, πb∣st, z))]
Similar to SAC, the loss functions approximated by offline datasets for regularized actor and critic are
Lcritic=i=1∑nE(s, a, r, s′)∼Di[r+γ(Ea′∼πθ(⋅∣s′, zi)Qψ(s′, a′, zi)−αD(πθ, πb∣s′, zi))−Qψ(s, a)]2Lactor=i=1∑nE(s, a, r, s′)∼Di[Ea~∼πθ(⋅∣s)Qψ(s, a~)−αD(πθ, πb∣s, zi)]
The learning process of task representation and behavior are detached from each other for better efficiency and stability
The learned contextual policy can be deployed to unseen tasks with task embeddings generated from few-shot samples