LAGMA

LAGMA

State Embedding

LAGMA construct discretized low-dimensional state embedding through VQ-VAE (encoder fϕf_{\phi}, decoder fψf_{\psi}, codebook e\boldsymbol{e})

LVQ(ϕ, ψ, e)=fψ(xq)s22+λvqsg[x]xq22+λcommitxsg[xq]22x=fϕ(s)[x]q=arg mineexe2fψ(xq)fψ(sg[xq]+xsg[x])\begin{gathered} \mathcal{L}_{\text{VQ}}(\phi,\ \psi,\ \boldsymbol{e}) = \Big\| f_{\psi}(x_{q}) - s \Big\|_{2}^{2} + \lambda_{\text{vq}} \Big\| \operatorname{sg}[x] - x_{q} \Big\|_{2}^{2} + \lambda_{\text{commit}} \Big\| x - \operatorname{sg}[x_{q}] \Big\|_{2}^{2} \\[5mm] x = f_{\phi}(s) \quad [x]_{q} = \argmin_{e \in \boldsymbol{e}} \| x - e \|_{2} \quad f_{\psi}(x_{q}) \triangleq f_{\psi} \Big( \operatorname{sg}[x_{q}] + x - \operatorname{sg}[x] \Big) \end{gathered}

However, the projected embedding space of feasible states χ\chi is too narrow to the whole embedding space. most randomly initialized quantized vectors locate far from χ\chi, only a few in the codebook are selected and trained throughout an episode

Raw VQ-VAE Training with Lcvrall\mathcal{L}_{\text{cvr}}^{\text{all}} Training with Lcvr\mathcal{L}_{\text{cvr}}

To resolve this issue, LAGMA introduces an additional converge loss to minimize the overall distance between xx and e\boldsymbol{e}

Lcvrall(e)=1ncj=1ncsg[x]ej22\mathcal{L}_{\text{cvr}}^{\text{all}}(\boldsymbol{e}) = \frac{1}{n_{c}} \sum_{j = 1}^{n_{c}} \Big\| \operatorname{sg}[x] - e_{j} \Big\|_{2}^{2}

While such converge loss may lead all quantized vectors stay in the center of χ\chi, LAGMA thus adopts a variant of Lcvrall\mathcal{L}_{\text{cvr}}^{\text{all}}

Lcvr(e)=1J(t)jJ(t)sg[x=fϕ(st)]ej22\mathcal{L}_{\text{cvr}}(\boldsymbol{e}) = \frac{1}{|\mathcal{J}(t)|} \sum_{j \in \mathcal{J}(t)} \Big\| \operatorname{sg}[x = f_{\phi}(s_{t})] - e_{j} \Big\|_{2}^{2}

where the timestep dependent indexing J(t)\mathcal{J}(t) distributes the quantized vectors uniformly according to the timesteps

Intrinsic Reward

With the codebook, LAGMA records the cumulative return of ss to the buffer of corresponding embedding e=[fϕ(s)]qe = [f_{\phi}(s)]_{q}

These stored returns in the FIFO buffer can be used for count-based moving average estimation of state value function

V(s)Cq(s)=1mi=1mRi(e=[fϕ(s)]q)V(s) \approx C_{q}(s) = \frac{1}{m} \sum_{i = 1}^{m} R_{i}(e = [f_{\phi}(s)]_{q})

LAGMA also stores goal-reaching trajectories τ={st:T}\tau = \{ s_{t:\mathrm{T}} \} in the buffer of e=[fϕ(st)]qe = [f_{\phi}(s_{t})]_{q} with Cq(st)C_{q}(s_{t}) as the priority

These goal-reaching trajectories are considered as the reference to only incentivize the desired transitions in τ={st:T}\tau = \{ s_{t:\mathrm{T}} \}

rI=γmax{[Cq(s)maxaQθtot(s, a)], 0}s.t.xqτxqxqτDτ(e)e=[fϕ(st)]qr^{I} = \gamma \max\left\{ \Big[ C_{q}(s') - \max_{a'} Q_{\theta^{-}}^{\text{tot}}(s',\ a') \Big],\ 0 \right\} \qquad \text{s.t.} \quad x_{q}' \in \tau^{*} \wedge x_{q} \ne x_{q}' \wedge \tau^{*} \sim \mathcal{D}_{\tau}(e) \wedge e = [f_{\phi}(s_{t})]_{q}

The latent goal-guided intrinsic reward rIr^{I} is further adopted for MARL, the overall objective is

L(θ)=[r+rI+γmaxaQθtot(s, a)Qθtot(s, a)]2\mathcal{L}(\theta) = \Big[ r + r^{I} + \gamma \max_{a'} Q_{\theta^{-}}^{\text{tot}}(s',\ a') - Q_{\theta}^{\text{tot}}(s,\ a) \Big]^{2}

For the desired transitions, the loss function will converge to the optimal loss function as the Cq(s)C_{q}(s') converges to V(s)V^{\star}(s')


LAGMA
http://example.com/2024/10/23/LAGMA/
Author
木辛
Posted on
October 23, 2024
Licensed under