SMAC
SMAC adapts PEARL to offline settings with AWAC as the offline RL algorithm and learns an additional reward model
The context encoder qϕe(z∣h) and reward model rϕd(s, a, z) are trained jointly and the overall objectives are listed as
Objective |
Target |
Lcritic(h, h′)=Ez∼qϕe(⋅∣h), (s, a, r, s′)∼h′, a′∼πθ(⋅∣s′, z)[r+γQw−(s′, a′, z)−Qw(s, a, z)]2 |
w |
Lactor(h, h′)=Ez∼qϕe(⋅∣h), (s, a, r)∼h′, a~∼πθ(⋅∣s, z)[exp(λQw(s, a, z)−Qw(s, a~, z))logπθ(a∣s)] |
θ |
Lreward(h, h′)=Ez∼qϕe(⋅∣h), (s, a, r)∼h′[(rϕd(s, a, z)−r)2+DKL(qϕe(z∣h) ∥ p(z))] |
ϕ |
Before meta testing, SMAC performs meaningful exploration with task embedding z sampled from prior distribution p(z)
The explored trajectories without reward are relabeled by reward model and context sampled from the offline dataset
SMAC finetunes the policy with dataset augmented by the relabeled trajectories to tackles the distribution shift on z space