MBRL

Contrastive Triplet Loss

The offline datasets $\{ \mathcal{B}_{i} \}$ of multiple tasks $\{ M_{i} \}$ used for task representation learning may differ in visitation distribution

thus leading the context encoder $q(z \mid c)$ to only depend on $(s_{t},\ a_{t})$ rather than the causal relationship between $(s_{t},\ a_{t})$ and $(r_{t},\ s_{t}')$ . MBRL proposes a contrastive triplet loss to enforce context encoder to neglect the difference of $(s_{t},\ a_{t})$

\mathcal{L}_{\text{triplet}} = \mathcal{E}_{M_{i} \sim \{ M \}} \mathcal{E}_{M_{j} \sim \{ M \} \setminus M_{i}} \mathcal{E}_{c_{i} \sim \mathcal{B}_{i}} \mathcal{E}_{c_{j} \sim \mathcal{B}_{j}} \operatorname{ReLU} \left[ d \Big( q(\cdot \mid c_{j \to i})\ \|\ q(\cdot \mid c_{i}) \Big) - d \Big( q(\cdot \mid c_{j \to i})\ \|\ q(\cdot \mid c_{j}) \Big) + m \right]

where the $c_{j \to i}$ is the relabeled by the learned transition or reward function of task $M_{i}$ based on the $(s_{t},\ a_{t})$ in $c_{j}$

Multi-Task Policy Distillation

MBRL distills the single-task policy $\langle Q_{i},\ G_{i},\ \xi_{i} \rangle$ learned by BCQ to a multi-task policy conditioned on task embeddings

Component	Loss
$Q_{D}(s,\ a \mid z)$	$\mathcal{L}_{Q} = \mathcal{E}_{M_{i} \sim \{ M \}} \mathcal{E}_{(s,\ a),\ c \sim \mathcal{B}_{i}} \mathcal{E}_{z \sim q(\cdot \mid c)} \left[ \Big( Q_{i}(s,\ a) - Q_{D}(s,\ a \mid z) \Big)^{2} + \beta D_{\text{KL}} \Big( q(\cdot \mid c)\ \\|\ \mathcal{N}(0,\ \boldsymbol{I}) \Big) \right]$
$G_{D}(s,\ \nu \mid z)$	$\mathcal{L}_{G} = \mathcal{E}_{M_{i} \sim \{ M \}} \mathcal{E}_{s,\ c \sim \mathcal{B}_{i}} \mathcal{E}_{z \sim q(\cdot \mid c)} \mathcal{E}_{\nu \sim \mathcal{N}(0,\ \boldsymbol{I})} \left[ G_{i}(s,\ \nu) - G_{D} \Big( s,\ \nu \mid \operatorname{sg}(z) \Big) \right]^{2}$
$\xi_{D}(s,\ a \mid z)$	$\mathcal{L}_{\xi} = \mathcal{E}_{M_{i} \sim \{ M \}} \mathcal{E}_{s,\ c \sim \mathcal{B}_{i}} \mathcal{E}_{z \sim q(\cdot \mid c)} \mathcal{E}_{\nu \sim \mathcal{N}(0,\ \boldsymbol{I})} \left[ \xi_{i} \Big( s,\ G_{i}(s,\ \nu) \Big) - \xi_{D} \Big( s,\ G_{i}(s,\ \nu) \mid \operatorname{sg}(z) \Big) \right]^{2}$