SMAC

SMAC

Offline Meta Training

SMAC adapts PEARL to offline settings with AWAC as the offline RL algorithm and learns an additional reward model

The context encoder qϕe(zh)q_{\phi_{e}}(z \mid h) and reward model rϕd(s, a, z)r_{\phi_{d}}(s,\ a,\ z) are trained jointly and the overall objectives are listed as

Objective Target
Lcritic(h, h)=Ezqϕe(h), (s, a, r, s)h, aπθ(s, z)[r+γQw(s, a, z)Qw(s, a, z)]2\mathcal{L}_{\text{critic}}(h,\ h') = \mathcal{E}_{z \sim q_{\phi_{e}}(\cdot \mid h),\ (s,\ a,\ r,\ s') \sim h',\ a' \sim \pi_{\theta}(\cdot \mid s',\ z)} \Big[ r + \gamma Q_{w^{-}}(s',\ a',\ z) - Q_{w}(s,\ a,\ z) \Big]^{2} ww
Lactor(h, h)=Ezqϕe(h), (s, a, r)h, a~πθ(s, z)[exp(Qw(s, a, z)Qw(s, a~, z)λ)logπθ(as)]\mathcal{L}_{\text{actor}}(h,\ h') = \mathcal{E}_{z \sim q_{\phi_{e}}(\cdot \mid h),\ (s,\ a,\ r) \sim h',\ \tilde{a} \sim \pi_{\theta}(\cdot \mid s,\ z)} \left[ \exp \left( \dfrac{Q_{w}(s,\ a,\ z) - Q_{w}(s,\ \tilde{a},\ z)}{\lambda} \right) \log \pi_{\theta}(a \mid s) \right] θ\theta
Lreward(h, h)=Ezqϕe(h), (s, a, r)h[(rϕd(s, a, z)r)2+DKL(qϕe(zh)  p(z))]\mathcal{L}_{\text{reward}}(h,\ h') = \mathcal{E}_{z \sim q_{\phi_{e}}(\cdot \mid h),\ (s,\ a,\ r) \sim h'} \left[ (r_{\phi_{d}}(s,\ a,\ z) - r)^{2} + D_{\text{KL}} \Big( q_{\phi_{e}}(z \mid h)\ \|\ p(z) \Big) \right] ϕ\phi

Online Meta Finetuning

Before meta testing, SMAC performs meaningful exploration with task embedding zz sampled from prior distribution p(z)p(z)

The explored trajectories without reward are relabeled by reward model and context sampled from the offline dataset

SMAC finetunes the policy with dataset augmented by the relabeled trajectories to tackles the distribution shift on zz space


SMAC
http://example.com/2024/10/16/SMAC/
Author
木辛
Posted on
October 16, 2024
Licensed under