MACAW

Offline Meta Training

MACAW combines MAML with AWR to provide more stable and bootstrap-free gradients for policy and value function

Objective	Target
$\mathcal{L}_{V}(D) = \mathcal{E}_{(s,\ a) \sim D} \Big[ V_{\phi}(s) - \mathcal{R}_{D}(s,\ a) \Big]^{2}$	$\phi$
$\mathcal{L}^{\text{AWR}}(D) = -\mathcal{E}_{(s,\ a) \sim D} \left[ \log \pi_{\theta}(a \mid s) \exp \left( \dfrac{\mathcal{R}_{D}(s,\ a) - V_{\phi'}(s)}{T} \right) \right]$	$\theta$

where $\mathcal{R}_{D}(s,\ a)$ is the cumulative return. Besides, MACAW introduces additional objective for inner-loop policy update

\begin{gathered} \mathcal{L}^{\text{ADV}}(D) = \mathcal{E}_{(s,\ a) \sim D} \Big[ \mathcal{R}_{D}(s,\ a) - V_{\phi'}(s) - A_{\theta}(s,\ a) \Big]^{2} \\[7mm] \mathcal{L}_{\pi}(D) = \mathcal{L}^{\text{AWR}}(D) + \lambda \mathcal{L}^{\text{ADV}}(D) \end{gathered}

where the advantage network share the same body with policy network and updated through advantage regression

Meta Training	Meta Testing

Such modification can increase the expressivity of the meta-learner and obtain a significant performance improvement

Weight Transform Layer

To further increase the expressivity of MAML’s gradient, MACAW replace the MLP layer with a hyper-style variant

y = \sigma \Big[ W^{\top} x + b \Big] \Longrightarrow y = \sigma \Big[ W(z)^{\top} x + b(z) \Big] \qquad \operatorname{Flatten}[W(z)], b(z) = W^{\text{wt}} z = \mathbb{R}^{(d_{i} d_{o} + d_{o}) \times c} \cdot \mathbb{R}^{c}

The weight transform matrix $W^{\text{wt}}$ and latent code $z$ in each layer are both learnable parameters, which can acquire weight matrix updates of rank up to the dimensionality of the latent code, compares to the rank-1 update of MLP layers

RL > Meta-Learning

#MACAW

MACAW

http://example.com/2024/10/18/MACAW/

Author

木辛

Posted on

October 18, 2024

Licensed under

PEARL Previous

MBRL Next