SMAC

Offline Meta Training

SMAC adapts PEARL to offline settings with AWAC as the offline RL algorithm and learns an additional reward model

The context encoder $q_{\phi_{e}}(z \mid h)$ and reward model $r_{\phi_{d}}(s,\ a,\ z)$ are trained jointly and the overall objectives are listed as

Objective	Target
$\mathcal{L}_{\text{critic}}(h,\ h') = \mathcal{E}_{z \sim q_{\phi_{e}}(\cdot \mid h),\ (s,\ a,\ r,\ s') \sim h',\ a' \sim \pi_{\theta}(\cdot \mid s',\ z)} \Big[ r + \gamma Q_{w^{-}}(s',\ a',\ z) - Q_{w}(s,\ a,\ z) \Big]^{2}$	$w$
$\mathcal{L}_{\text{actor}}(h,\ h') = \mathcal{E}_{z \sim q_{\phi_{e}}(\cdot \mid h),\ (s,\ a,\ r) \sim h',\ \tilde{a} \sim \pi_{\theta}(\cdot \mid s,\ z)} \left[ \exp \left( \dfrac{Q_{w}(s,\ a,\ z) - Q_{w}(s,\ \tilde{a},\ z)}{\lambda} \right) \log \pi_{\theta}(a \mid s) \right]$	$\theta$
$\mathcal{L}_{\text{reward}}(h,\ h') = \mathcal{E}_{z \sim q_{\phi_{e}}(\cdot \mid h),\ (s,\ a,\ r) \sim h'} \left[ (r_{\phi_{d}}(s,\ a,\ z) - r)^{2} + D_{\text{KL}} \Big( q_{\phi_{e}}(z \mid h)\ \\|\ p(z) \Big) \right]$	$\phi$