MACAW
MACAW combines MAML with AWR to provide more stable and bootstrap-free gradients for policy and value function
Objective |
Target |
LV(D)=E(s, a)∼D[Vϕ(s)−RD(s, a)]2 |
ϕ |
LAWR(D)=−E(s, a)∼D[logπθ(a∣s)exp(TRD(s, a)−Vϕ′(s))] |
θ |
where RD(s, a) is the cumulative return. Besides, MACAW introduces additional objective for inner-loop policy update
LADV(D)=E(s, a)∼D[RD(s, a)−Vϕ′(s)−Aθ(s, a)]2Lπ(D)=LAWR(D)+λLADV(D)
|
|
where the advantage network share the same body with policy network and updated through advantage regression
Meta Training |
Meta Testing |
 |
 |
Such modification can increase the expressivity of the meta-learner and obtain a significant performance improvement
To further increase the expressivity of MAML’s gradient, MACAW replace the MLP layer with a hyper-style variant
y=σ[W⊤x+b]⟹y=σ[W(z)⊤x+b(z)]Flatten[W(z)],b(z)=Wwtz=R(dido+do)×c⋅Rc
The weight transform matrix Wwt and latent code z in each layer are both learnable parameters, which can acquire weight matrix updates of rank up to the dimensionality of the latent code, compares to the rank-1 update of MLP layers