MBRL
Contrastive Triplet Loss
The offline datasets {Bi} of multiple tasks {Mi} used for task representation learning may differ in visitation distribution
thus leading the context encoder q(z∣c) to only depend on (st, at) rather than the causal relationship between (st, at) and (rt, st′). MBRL proposes a contrastive triplet loss to enforce context encoder to neglect the difference of (st, at)
Ltriplet=EMi∼{M}EMj∼{M}∖MiEci∼BiEcj∼BjReLU[d(q(⋅∣cj→i) ∥ q(⋅∣ci))−d(q(⋅∣cj→i) ∥ q(⋅∣cj))+m]
where the cj→i is the relabeled by the learned transition or reward function of task Mi based on the (st, at) in cj
Multi-Task Policy Distillation
MBRL distills the single-task policy ⟨Qi, Gi, ξi⟩ learned by BCQ to a multi-task policy conditioned on task embeddings
Component |
Loss |
QD(s, a∣z) |
LQ=EMi∼{M}E(s, a), c∼BiEz∼q(⋅∣c)[(Qi(s, a)−QD(s, a∣z))2+βDKL(q(⋅∣c) ∥ N(0, I))] |
GD(s, ν∣z) |
LG=EMi∼{M}Es, c∼BiEz∼q(⋅∣c)Eν∼N(0, I)[Gi(s, ν)−GD(s, ν∣sg(z))]2 |
ξD(s, a∣z) |
Lξ=EMi∼{M}Es, c∼BiEz∼q(⋅∣c)Eν∼N(0, I)[ξi(s, Gi(s, ν))−ξD(s, Gi(s, ν)∣sg(z))]2 |
The distilled multi-task policy can be directly deployed to new tasks with few-shot samples to generate task embeddings