ODIS

ODIS

Latent Skill Discovery

To achieve effective coordination on unseen tasks, ODIS discovers domain latent skills for high-level behavior learning

The latent skills are modeled as latent variables underlying behind actions, which can be learned through β\beta-VAE

maxϕs, θaE(st, τt, at)D[i=1nEztiq(ztist, at; ϕs)logp(atiτti, zti; θa)βDKL(q(ztist, at; ϕs)  p~(zti))]\max_{\phi_{s},\ \theta_{a}} \mathcal{E}_{(s_{t},\ \tau_{t},\ a_{t}) \sim \mathcal{D}} \left[ \sum_{i = 1}^{n} \mathcal{E}_{z_{t}^{i} \sim q(z_{t}^{i} \mid s_{t},\ a_{t};\ \phi_{s})} \log p(a_{t}^{i} \mid \tau_{t}^{i},\ z_{t}^{i};\ \theta_{a}) - \beta D_{\mathrm{KL}} \Big( q(z_{t}^{i} \mid s_{t},\ a_{t};\ \phi_{s})\ \|\ \tilde{p}(z_{t}^{i}) \Big) \right]

where D\mathcal{D} is a multi-task offline dataset, ztiZz_{t}^{i} \in \mathcal{Z} is discrete latent skill, q(ztist, at)q(z_{t}^{i} \mid s_{t},\ a_{t}) and p(atiτti, zti)p(a_{t}^{i} \mid \tau_{t}^{i},\ z_{t}^{i}) are state encoder and action decoder respectively, the prior distribution of latent skills p~(zti)\tilde{p}(z_{t}^{i}) are assumed to be uniform categorical distribution

Offline Policy Learning

Based on the discovered latent skills, ODIS performs multi-task offline Q-Learning with QMIX-style network

LTD(θv)=E(st, τt, at, rt, st+1, τt+1)DEztiq(ztist, at; ϕs)[rt+γQtot(τt+1, zt+1; θv, st+1)Qtot(τt, zt; θv, st)]2s.t.Qtot(τt, zt; θv, st)=fmix[Qi(τti, zti)st, θv]zt={arg maxztiQi(τti, zti; θv)}i=1n\begin{gathered} \mathcal{L}_{\text{TD}}(\theta_{v}) = \mathcal{E}_{(s_{t},\ \tau_{t},\ a_{t},\ r_{t},\ s_{t + 1},\ \tau_{t + 1}) \sim \mathcal{D}} \mathcal{E}_{z_{t}^{i} \sim q(z_{t}^{i} \mid s_{t},\ a_{t};\ \phi_{s})} \Big[ r_{t} + \gamma Q_{\text{tot}}(\tau_{t + 1},\ z_{t + 1}^{\star};\ \theta_{v}^{-},\ s_{t + 1}) - Q_{\text{tot}}(\tau_{t},\ z_{t};\ \theta_{v},\ s_{t}) \Big]^{2} \\[5mm] \text{s.t.} \quad Q_{\text{tot}}(\tau_{t},\ z_{t};\ \theta_{v},\ s_{t}) = f_{\text{mix}} \Big[ Q^{i}(\tau_{t}^{i},\ z_{t}^{i}) \mid s_{t},\ \theta_{v} \Big] \quad z_{t}^{\star} = \left\{ \argmax_{z_{t}^{i}} Q^{i}(\tau_{t}^{i},\ z_{t}^{i};\ \theta_{v}) \right\}_{i = 1}^{n} \end{gathered}

As the latent skills are discovered from global state, it remians inefficient to learn to choose appropriate skills with local observation. ODIS introduces the consistent loss term to guide observation representation learning towards coordination

Lc(ϕo)=E(st, τt, at)D[i=1nDKL(q^(ztiτti)  q(ztist, at; ϕs))]\mathcal{L}_{\text{c}}(\phi_{o}) = \mathcal{E}_{(s_{t},\ \tau_{t},\ a_{t}) \sim \mathcal{D}} \left[ \sum_{i = 1}^{n} D_{\text{KL}} \Big( \hat{q}(z_{t}^{i} \mid \tau_{t}^{i})\ \|\ q(z_{t}^{i} \mid s_{t},\ a_{t};\ \phi_{s}) \Big) \right]

where q^(ztiτti)\hat{q}(z_{t}^{i} \mid \tau_{t}^{i}) predicts skills from observation representation through last layer of state encoder

To tackle the OOD issue in offline RL, CQL method is also adopted and the overall loss function is presented as

L(θv, ϕo)=LTD(θv)+αLCQL(θv)+λLc(ϕo)\mathcal{L}(\theta_{v},\ \phi_{o}) = \mathcal{L}_{\text{TD}}(\theta_{v}) + \alpha \mathcal{L}_{\text{CQL}}(\theta_{v}) + \lambda \mathcal{L}_{\text{c}}(\phi_{o})

The learned observation encoder and individual value networks are directly deployed on new tasks with a decentralized manner. The chosen high-level coordination skills are further used to omit low-level actions through action decoder


ODIS
http://example.com/2024/09/29/ODIS/
Author
木辛
Posted on
September 29, 2024
Licensed under