MBRL

MBRL

Contrastive Triplet Loss

The offline datasets {Bi}\{ \mathcal{B}_{i} \} of multiple tasks {Mi}\{ M_{i} \} used for task representation learning may differ in visitation distribution

thus leading the context encoder q(zc)q(z \mid c) to only depend on (st, at)(s_{t},\ a_{t}) rather than the causal relationship between (st, at)(s_{t},\ a_{t}) and (rt, st)(r_{t},\ s_{t}'). MBRL proposes a contrastive triplet loss to enforce context encoder to neglect the difference of (st, at)(s_{t},\ a_{t})

Ltriplet=EMi{M}EMj{M}MiEciBiEcjBjReLU[d(q(cji)  q(ci))d(q(cji)  q(cj))+m]\mathcal{L}_{\text{triplet}} = \mathcal{E}_{M_{i} \sim \{ M \}} \mathcal{E}_{M_{j} \sim \{ M \} \setminus M_{i}} \mathcal{E}_{c_{i} \sim \mathcal{B}_{i}} \mathcal{E}_{c_{j} \sim \mathcal{B}_{j}} \operatorname{ReLU} \left[ d \Big( q(\cdot \mid c_{j \to i})\ \|\ q(\cdot \mid c_{i}) \Big) - d \Big( q(\cdot \mid c_{j \to i})\ \|\ q(\cdot \mid c_{j}) \Big) + m \right]

where the cjic_{j \to i} is the relabeled by the learned transition or reward function of task MiM_{i} based on the (st, at)(s_{t},\ a_{t}) in cjc_{j}

Multi-Task Policy Distillation

MBRL distills the single-task policy Qi, Gi, ξi\langle Q_{i},\ G_{i},\ \xi_{i} \rangle learned by BCQ to a multi-task policy conditioned on task embeddings

Component Loss
QD(s, az)Q_{D}(s,\ a \mid z) LQ=EMi{M}E(s, a), cBiEzq(c)[(Qi(s, a)QD(s, az))2+βDKL(q(c)  N(0, I))]\mathcal{L}_{Q} = \mathcal{E}_{M_{i} \sim \{ M \}} \mathcal{E}_{(s,\ a),\ c \sim \mathcal{B}_{i}} \mathcal{E}_{z \sim q(\cdot \mid c)} \left[ \Big( Q_{i}(s,\ a) - Q_{D}(s,\ a \mid z) \Big)^{2} + \beta D_{\text{KL}} \Big( q(\cdot \mid c)\ \|\ \mathcal{N}(0,\ \boldsymbol{I}) \Big) \right]
GD(s, νz)G_{D}(s,\ \nu \mid z) LG=EMi{M}Es, cBiEzq(c)EνN(0, I)[Gi(s, ν)GD(s, νsg(z))]2\mathcal{L}_{G} = \mathcal{E}_{M_{i} \sim \{ M \}} \mathcal{E}_{s,\ c \sim \mathcal{B}_{i}} \mathcal{E}_{z \sim q(\cdot \mid c)} \mathcal{E}_{\nu \sim \mathcal{N}(0,\ \boldsymbol{I})} \left[ G_{i}(s,\ \nu) - G_{D} \Big( s,\ \nu \mid \operatorname{sg}(z) \Big) \right]^{2}
ξD(s, az)\xi_{D}(s,\ a \mid z) Lξ=EMi{M}Es, cBiEzq(c)EνN(0, I)[ξi(s, Gi(s, ν))ξD(s, Gi(s, ν)sg(z))]2\mathcal{L}_{\xi} = \mathcal{E}_{M_{i} \sim \{ M \}} \mathcal{E}_{s,\ c \sim \mathcal{B}_{i}} \mathcal{E}_{z \sim q(\cdot \mid c)} \mathcal{E}_{\nu \sim \mathcal{N}(0,\ \boldsymbol{I})} \left[ \xi_{i} \Big( s,\ G_{i}(s,\ \nu) \Big) - \xi_{D} \Big( s,\ G_{i}(s,\ \nu) \mid \operatorname{sg}(z) \Big) \right]^{2}

The distilled multi-task policy can be directly deployed to new tasks with few-shot samples to generate task embeddings


MBRL
http://example.com/2024/10/17/MBRL/
Author
木辛
Posted on
October 17, 2024
Licensed under