LAGMA
State Embedding
LAGMA construct discretized low-dimensional state embedding through VQ-VAE (encoder fϕ, decoder fψ, codebook e)
LVQ(ϕ, ψ, e)=∥∥∥∥fψ(xq)−s∥∥∥∥22+λvq∥∥∥∥sg[x]−xq∥∥∥∥22+λcommit∥∥∥∥x−sg[xq]∥∥∥∥22x=fϕ(s)[x]q=e∈eargmin∥x−e∥2fψ(xq)≜fψ(sg[xq]+x−sg[x])
However, the projected embedding space of feasible states χ is too narrow to the whole embedding space. most randomly initialized quantized vectors locate far from χ, only a few in the codebook are selected and trained throughout an episode
Raw VQ-VAE |
Training with Lcvrall |
Training with Lcvr |
 |
 |
 |
To resolve this issue, LAGMA introduces an additional converge loss to minimize the overall distance between x and e
Lcvrall(e)=nc1j=1∑nc∥∥∥∥sg[x]−ej∥∥∥∥22
While such converge loss may lead all quantized vectors stay in the center of χ, LAGMA thus adopts a variant of Lcvrall
Lcvr(e)=∣J(t)∣1j∈J(t)∑∥∥∥∥sg[x=fϕ(st)]−ej∥∥∥∥22
where the timestep dependent indexing J(t) distributes the quantized vectors uniformly according to the timesteps
Intrinsic Reward
With the codebook, LAGMA records the cumulative return of s to the buffer of corresponding embedding e=[fϕ(s)]q
These stored returns in the FIFO buffer can be used for count-based moving average estimation of state value function
V(s)≈Cq(s)=m1i=1∑mRi(e=[fϕ(s)]q)
LAGMA also stores goal-reaching trajectories τ={st:T} in the buffer of e=[fϕ(st)]q with Cq(st) as the priority
These goal-reaching trajectories are considered as the reference to only incentivize the desired transitions in τ={st:T}
rI=γmax{[Cq(s′)−a′maxQθ−tot(s′, a′)], 0}s.t.xq′∈τ∗∧xq=xq′∧τ∗∼Dτ(e)∧e=[fϕ(st)]q
The latent goal-guided intrinsic reward rI is further adopted for MARL, the overall objective is
L(θ)=[r+rI+γa′maxQθ−tot(s′, a′)−Qθtot(s, a)]2
For the desired transitions, the loss function will converge to the optimal loss function as the Cq(s′) converges to V⋆(s′)