MACAW

MACAW

Offline Meta Training

MACAW combines MAML with AWR to provide more stable and bootstrap-free gradients for policy and value function

Objective Target
LV(D)=E(s, a)D[Vϕ(s)RD(s, a)]2\mathcal{L}_{V}(D) = \mathcal{E}_{(s,\ a) \sim D} \Big[ V_{\phi}(s) - \mathcal{R}_{D}(s,\ a) \Big]^{2} ϕ\phi
LAWR(D)=E(s, a)D[logπθ(as)exp(RD(s, a)Vϕ(s)T)]\mathcal{L}^{\text{AWR}}(D) = -\mathcal{E}_{(s,\ a) \sim D} \left[ \log \pi_{\theta}(a \mid s) \exp \left( \dfrac{\mathcal{R}_{D}(s,\ a) - V_{\phi'}(s)}{T} \right) \right] θ\theta

where RD(s, a)\mathcal{R}_{D}(s,\ a) is the cumulative return. Besides, MACAW introduces additional objective for inner-loop policy update

LADV(D)=E(s, a)D[RD(s, a)Vϕ(s)Aθ(s, a)]2Lπ(D)=LAWR(D)+λLADV(D)\begin{gathered} \mathcal{L}^{\text{ADV}}(D) = \mathcal{E}_{(s,\ a) \sim D} \Big[ \mathcal{R}_{D}(s,\ a) - V_{\phi'}(s) - A_{\theta}(s,\ a) \Big]^{2} \\[7mm] \mathcal{L}_{\pi}(D) = \mathcal{L}^{\text{AWR}}(D) + \lambda \mathcal{L}^{\text{ADV}}(D) \end{gathered}

where the advantage network share the same body with policy network and updated through advantage regression

Meta Training Meta Testing

Such modification can increase the expressivity of the meta-learner and obtain a significant performance improvement

Weight Transform Layer

To further increase the expressivity of MAML’s gradient, MACAW replace the MLP layer with a hyper-style variant

y=σ[Wx+b]y=σ[W(z)x+b(z)]Flatten[W(z)],b(z)=Wwtz=R(dido+do)×cRcy = \sigma \Big[ W^{\top} x + b \Big] \Longrightarrow y = \sigma \Big[ W(z)^{\top} x + b(z) \Big] \qquad \operatorname{Flatten}[W(z)], b(z) = W^{\text{wt}} z = \mathbb{R}^{(d_{i} d_{o} + d_{o}) \times c} \cdot \mathbb{R}^{c}

The weight transform matrix WwtW^{\text{wt}} and latent code zz in each layer are both learnable parameters, which can acquire weight matrix updates of rank up to the dimensionality of the latent code, compares to the rank-1 update of MLP layers


MACAW
http://example.com/2024/10/18/MACAW/
Author
木辛
Posted on
October 18, 2024
Licensed under