ODIS

Latent Skill Discovery

To achieve effective coordination on unseen tasks, ODIS discovers domain latent skills for high-level behavior learning

The latent skills are modeled as latent variables underlying behind actions, which can be learned through $\beta$ -VAE

\max_{\phi_{s},\ \theta_{a}} \mathcal{E}_{(s_{t},\ \tau_{t},\ a_{t}) \sim \mathcal{D}} \left[ \sum_{i = 1}^{n} \mathcal{E}_{z_{t}^{i} \sim q(z_{t}^{i} \mid s_{t},\ a_{t};\ \phi_{s})} \log p(a_{t}^{i} \mid \tau_{t}^{i},\ z_{t}^{i};\ \theta_{a}) - \beta D_{\mathrm{KL}} \Big( q(z_{t}^{i} \mid s_{t},\ a_{t};\ \phi_{s})\ \|\ \tilde{p}(z_{t}^{i}) \Big) \right]

where $\mathcal{D}$ is a multi-task offline dataset, $z_{t}^{i} \in \mathcal{Z}$ is discrete latent skill, $q(z_{t}^{i} \mid s_{t},\ a_{t})$ and $p(a_{t}^{i} \mid \tau_{t}^{i},\ z_{t}^{i})$ are state encoder and action decoder respectively, the prior distribution of latent skills $\tilde{p}(z_{t}^{i})$ are assumed to be uniform categorical distribution

Offline Policy Learning

Based on the discovered latent skills, ODIS performs multi-task offline Q-Learning with QMIX-style network

\begin{gathered} \mathcal{L}_{\text{TD}}(\theta_{v}) = \mathcal{E}_{(s_{t},\ \tau_{t},\ a_{t},\ r_{t},\ s_{t + 1},\ \tau_{t + 1}) \sim \mathcal{D}} \mathcal{E}_{z_{t}^{i} \sim q(z_{t}^{i} \mid s_{t},\ a_{t};\ \phi_{s})} \Big[ r_{t} + \gamma Q_{\text{tot}}(\tau_{t + 1},\ z_{t + 1}^{\star};\ \theta_{v}^{-},\ s_{t + 1}) - Q_{\text{tot}}(\tau_{t},\ z_{t};\ \theta_{v},\ s_{t}) \Big]^{2} \\[5mm] \text{s.t.} \quad Q_{\text{tot}}(\tau_{t},\ z_{t};\ \theta_{v},\ s_{t}) = f_{\text{mix}} \Big[ Q^{i}(\tau_{t}^{i},\ z_{t}^{i}) \mid s_{t},\ \theta_{v} \Big] \quad z_{t}^{\star} = \left\{ \argmax_{z_{t}^{i}} Q^{i}(\tau_{t}^{i},\ z_{t}^{i};\ \theta_{v}) \right\}_{i = 1}^{n} \end{gathered}

As the latent skills are discovered from global state, it remians inefficient to learn to choose appropriate skills with local observation. ODIS introduces the consistent loss term to guide observation representation learning towards coordination

\mathcal{L}_{\text{c}}(\phi_{o}) = \mathcal{E}_{(s_{t},\ \tau_{t},\ a_{t}) \sim \mathcal{D}} \left[ \sum_{i = 1}^{n} D_{\text{KL}} \Big( \hat{q}(z_{t}^{i} \mid \tau_{t}^{i})\ \|\ q(z_{t}^{i} \mid s_{t},\ a_{t};\ \phi_{s}) \Big) \right]

where $\hat{q}(z_{t}^{i} \mid \tau_{t}^{i})$ predicts skills from observation representation through last layer of state encoder

To tackle the OOD issue in offline RL, CQL method is also adopted and the overall loss function is presented as

\mathcal{L}(\theta_{v},\ \phi_{o}) = \mathcal{L}_{\text{TD}}(\theta_{v}) + \alpha \mathcal{L}_{\text{CQL}}(\theta_{v}) + \lambda \mathcal{L}_{\text{c}}(\phi_{o})

The learned observation encoder and individual value networks are directly deployed on new tasks with a decentralized manner. The chosen high-level coordination skills are further used to omit low-level actions through action decoder

RL > Multi-Agent

#ODIS

ODIS

http://example.com/2024/09/29/ODIS/

Author

木辛

Posted on

September 29, 2024

Licensed under

FOCAL Previous

GenRL Next