Dreamer

Dreamer Series

Dreamer v1

World Model Learning

Model	Type	Definition	Distribution Family
Transition	Generation	$s_{t} \sim p_{\theta}(s_{t} \mid s_{t - 1},\ a_{t - 1})$	$\mathcal{N} \Big( \mu_{\theta}(s_{t - 1},\ a_{t - 1}),\ \mathrm{diag}_{\theta}(s_{t - 1},\ a_{t - 1}) \Big)$
Observation	Generation	$o_{t} \sim p_{\theta}(o_{t} \mid s_{t})$	$\mathcal{N} \Big( \mu_{\theta}(s_{t}),\ \boldsymbol{I} \Big)$
Reward	Generation	$r_{t} \sim p_{\theta}(r_{t} \mid s_{t})$	$\mathcal{N} \Big( \mu_{\theta}(s_{t}),\ 1 \Big)$
Posterior	Inference	$s_{t} \sim q_{\theta}(s_{t} \mid s_{t - 1},\ a_{t - 1},\ o_{t})$	$\mathcal{N} \Big( \mu_{\theta}(s_{t - 1},\ a_{t - 1},\ o_{t}),\ \mathrm{diag}_{\theta}(s_{t - 1},\ a_{t - 1},\ o_{t}) \Big)$

reward prediction

Simply learning to predict future rewards given actions and past observation

reconstruction

Learns world model by reconstructing observations via variational information bottleneck (VIB) objective

\max_{\theta} \mathcal{J}_{\mathrm{REC}} = \mathcal{E}_{s_{1:\mathrm{T}}} \left[ \sum_{t} \Big( \underset{\mathcal{J}_{\mathrm{O}}^{t}}{\underbrace{\ln p_{\theta}(o_{t} \mid s_{t})}} + \underset{\mathcal{J}_{\mathrm{R}}^{t}}{\underbrace{\ln p_{\theta}(r_{t} \mid s_{t})}} - \underset{\mathcal{J}_{\mathrm{D}}^{t}}{\underbrace{\beta D_{\mathrm{KL}} \Big( q_{\theta}(s_{t} \mid s_{t - 1},\ a_{t - 1},\ o_{t})\ \Big\|\ p_{\theta}(s_{t} \mid s_{t - 1},\ a_{t - 1})}} \Big) \right]

contrastive estimation

Predicting observation can require high model capacity, it’s also feasible to predict state from observation. The world model and an alternative state model $q_{\theta}(s_{t} \mid o_{t})$ can be trained via noise contrastive estimation (NCE)

\max_{\theta} \mathcal{J}_{\mathrm{NCE}} = \mathcal{E}_{s_{1:\mathrm{T}}} \left[ \sum_{t} \Big( \mathcal{J}_{\mathrm{S}}^{t} + \mathcal{J}_{\mathrm{R}}^{t} + \mathcal{J}_{\mathrm{D}}^{t} \Big) \right] \quad \mathcal{J}_{\mathrm{S}}^{t} = \ln q_{\theta}(s_{t} \mid o_{t}) - \ln \left( \sum_{o'} q_{\theta}(s_{t} \mid o') \right)

Behavior Learning

Dreamer uses TD(λ) method to train critic network based on imagined trajectories $\{ s_{\tau},\ a_{\tau},\ r_{\tau} \}_{\tau = t}^{t + H}$

V_{\tau}^{k} = \sum_{n = 0}^{k - 1} \gamma^{n} r_{\tau + n} + \gamma^{k} v_{\psi}(s_{\tau + k}) \qquad V_{\tau}^{\lambda} = (1 - \lambda) \sum_{k = 1}^{t + H - \tau - 1} \lambda^{k - 1} V_{\tau}^{k} + \lambda^{t + H - \tau - 1} V_{\tau}^{t + H - \tau}

Use the square of TD error as the learning objective of critic network $v_{\psi}(s)$

\min_{\psi} \mathcal{E}_{q_{\theta},\ q_{\phi}} \left( \frac{1}{2} \sum_{\tau = t}^{t + H} \Big[ v_{\psi}(s_{\tau}) - V_{\tau}^{\lambda} \Big]^{2} \right)

The action model outputs a tanh-transformed Gaussian, allowing for reparameterized sampling

a_{\tau} = \tanh \Big( \mu_{\phi}(s_{\tau}) + \sigma_{\phi}(s_{\tau}) \epsilon \Big) \quad \epsilon \sim \mathcal{N}(\boldsymbol{0},\ \boldsymbol{I})

Use the TD target as the learning objective of actor network $q_{\phi}(a \mid s)$

\max_{\phi} \mathcal{E}_{q_{\theta},\ q_{\phi}} \left( \sum_{\tau = t}^{t + H} V_{\tau}^{\lambda} \right) \Rightarrow \min_{\phi} \mathcal{E}_{q_{\theta},\ q_{\phi}} \left( -\sum_{\tau = t}^{t + H} V_{\tau}^{\lambda} \right)

Methods	Feature	Difference in Dreamer
Actor-Critic	directly calculates the policy gradient through estimated action value	backpropagates policy gradient through value model
DDPG	backpropagates through value model but maximize only immediate value	leverages gradients backpropagated from value in multiple steps through transition

Dreamer v2

World Model Learning

Component	Type	Definition	Distribution Family
Recurrent Model	Generation	$h_{t} = f_{\phi}(h_{t - 1},\ s_{t - 1},\ a_{t - 1})$	Deterministic
Representation Model	Inference	$s_{t} \sim q_{\phi}(s_{t} \mid h_{t},\ o_{t})$	Categorical
Transition Predictor	Generation	$s_{t} \sim p_{\phi}(s_{t} \mid h_{t})$	Categorical
Image Predictor	Generation	$o_{t} \sim p_{\phi}(o_{t} \mid h_{t},\ s_{t})$	$\mathcal{N} \Big( \mu_{\phi}(h_{t},\ s_{t}),\ \boldsymbol{I} \Big)$
Reward Predictor	Generation	$r_{t} \sim p_{\phi}(r_{t} \mid h_{t},\ s_{t})$	$\mathcal{N} \Big( \mu_{\phi}(h_{t},\ s_{t}),\ 1 \Big)$
Discount Predictor	Generation	$\gamma_{t} \sim p_{\phi}(\gamma_{t} \mid h_{t},\ s_{t})$	Bernoulli

The probabilistic model of stochastic state $s_{t}$ outputs a vector of several categorical variables and optimize them using straight-through gradients rather than Guassian distribution and reparameterized sample

1
2
3

sample = one_hot(draw(logits))                  # sample has no gradient
probs = softmax(logits)                         # want gradient of this
sample = sample + probs - stop_grad(probs)      # has gradient of probs

The discounted factor is set to a fixed hyperparameter within an episode and zero for terminal time steps, which reveals the end of episode during behavior learning.

All components of the world model are optimized jointly by maximizing the ELBO

\mathcal{J}(\phi) = \mathcal{E}_{s_{1:\mathrm{T}}} \left[ \sum_{t = 1}^{\mathrm{T}} \ln p_{\phi}(o_{t} \mid h_{t},\ s_{t}) + \ln p_{\phi}(r_{t} \mid h_{t},\ s_{t}) + \ln p_{\phi}(\gamma^{t} \mid h_{t},\ s_{t}) - \beta D_{\mathrm{KL}} \Big( q_{\phi}(\cdot \mid h_{t},\ o_{t})\ \|\ p_{\phi}(\cdot \mid h_{t}) \Big) \right]

Learning the transition function (prior) is more difficult, to avoid regularizing the representations toward a poorly trained prior, the KL loss is minimized faster w.r.t. the prior than the representations by using different learning rates

1	`kl_loss = α * kl(stop_grad(posterior), prior) + (1 - α) * kl(posterior, stop_grad(prior))`

Behavior Learning

Similar to Dreamer-v1, the critic $v_{\xi}(s)$ is trained with TD(λ) method while the actor $p_{\psi}(a_{t} \mid s_{t})$ is trained with REINFORCE gradients (Atari, $\rho = 1$ ) or dynamics backpropgation gradients (Continuous Control, $\rho = 0$ )

\mathcal{J}(\psi) = \mathcal{E}_{s_{1:\mathrm{T}}, a_{1:\mathrm{T}}} \left[ \sum_{t = 1}^{\mathrm{T} - 1} \rho \ln p_{\psi}(a_{t} \mid s_{t}) \operatorname{sg} \Big( V_{t}^{\lambda} - v_{\xi}(s_{t}) \Big) + (1 - \rho) V_{t}^{\lambda} + \eta \mathcal{H}(p_{\psi} \mid s_{t}) \right]

where

V_{t}^{\lambda} = r_{t} + \gamma_{t} \left\{\begin{matrix} (1 - \lambda) v_{\xi}(s_{t + 1}) + \lambda V_{t + 1}^{\lambda} & t < \mathrm{T} \\[5mm] v_{\xi}(s_{\mathrm{T}}) & t = \mathrm{T} \end{matrix}\right.

Dreamer-v3

World Model Learning

Component	Type	Definition	Distribution Family
Sequence Model	Generation	$h_{t} = f_{\phi}(h_{t - 1},\ s_{t - 1},\ a_{t - 1})$	Deterministic
Encoder	Inference	$s_{t} \sim q_{\phi}(s_{t} \mid h_{t},\ o_{t})$	Categorical
Dynamics Predictor	Generation	$s_{t} \sim p_{\phi}(s_{t} \mid h_{t})$	Categorical
Decoder	Generation	$o_{t} \sim p_{\phi}(o_{t} \mid h_{t},\ s_{t})$
Reward Predictor	Generation	$r_{t} \sim p_{\phi}(r_{t} \mid h_{t},\ s_{t})$
Continue Predictor	Generation	$c_{t} \sim p_{\phi}(c_{t} \mid h_{t},\ s_{t})$	Bernoulli

The world model parameters $\phi$ are optimized end-to-end to minimize

\mathcal{L}(\phi) = \mathcal{E}_{s_{1:\mathrm{T}}} \left[ \sum_{t = 1}^{\mathrm{T}} \beta_{\mathrm{pred}} \mathcal{L}_{\mathrm{pred}}(\phi) + \beta_{\mathrm{dyn}} \mathcal{L}_{\mathrm{dyn}}(\phi) + \beta_{\mathrm{rep}} \mathcal{L}_{\mathrm{rep}}(\phi) \right]

where

\begin{gathered} \mathcal{L}_{\mathrm{pred}}(\phi) = -\ln p_{\phi}(o_{t} \mid h_{t},\ s_{t}) - \ln p_{\phi}(r_{t} \mid h_{t},\ s_{t}) - \ln p_{\phi}(c_{t} \mid h_{t},\ s_{t}) \\[5mm] \mathcal{L}_{\mathrm{dyn}}(\phi) = \max \left[ 1,\ D_{\mathrm{KL}} \Big( \operatorname{sg}(q_{\phi}(s_{t} \mid h_{t},\ o_{t}))\ \|\ p_{\phi}(s_{t} \mid h_{t}) \Big) \right] \\[5mm] \mathcal{L}_{\mathrm{rep}}(\phi) = \max \left[ 1,\ D_{\mathrm{KL}} \Big( q_{\phi}(s_{t} \mid h_{t},\ o_{t})\ \|\ \operatorname{sg}(p_{\phi}(s_{t} \mid h_{t})) \Big) \right] \end{gathered}

To avoid a degenerate solution where the dynamics are trivial to predict but fail to contain enough information about the inputs, Dreamer v3 employs free bits by clipping the dynamics and representation losses.

Behavior Learning

use the REINFORCE gradient for actor network $\pi_{\theta}(a_{t} \mid s_{t})$ with both discrete and continuous actions

\mathcal{L}(\theta) = \sum_{t = 1}^{\mathrm{T}} \operatorname{sg} \left[ \frac{R_{t}^{\lambda} - v_{\psi}(s_{t})}{\max(1,\ S)} \right] \ln \pi_{\theta}(a_{t} \mid s_{t}) + \eta \mathcal{H}(\pi_{\theta} \mid s_{t})

To be robust to outliers, the return is normalized by exponential moving average smooth of the range from the 5th to the 95th return percentil over the return batch, which is clipped by 1

S = \operatorname{EMA} \Big[ \Delta_{t} = \operatorname{Per}(R_{t}^{\lambda},\ 95) - \operatorname{Per}(R_{t}^{\lambda},\ 5),\ 0.99 \Big] = 0.99 \Delta_{t} + 0.01 \Delta_{t - 1}

The critic network $v_{\psi}(R_{t} \mid s_{t})$ predicts a distribution of bootstrapped λ returns by maximizing log-likelihood

\mathcal{L}(\psi) = -\sum_{t = 1}^{\mathrm{T}} \ln p_{\psi}(R_{t}^{\lambda} \mid s_{t}) \quad R_{t}^{\lambda} = r_{t} + \gamma c_{t} \left\{ \begin{matrix} (1 - \lambda) \mathcal{E} v_{\psi}(\cdot \mid s_{t + 1}) + \lambda R_{t + 1}^{\lambda} & t < \mathrm{T} \\[5mm] \mathcal{E} v_{\psi}(\cdot \mid s_{\mathrm{T}}) & t = \mathrm{T} \end{matrix} \right.

Robust Predictions

It is challenging to reconstruct input and predict reward and returns whose scale can vary across domains. The symplog is used to compress the magnitudes of both positive and negative values

\operatorname{symlog}(x) = \operatorname{sign}(x) \ln (|x| + 1) \quad \operatorname{symexp}(y) = \operatorname{sign}(y) (\exp(|y|) - 1)

The distribution of potentially stochastic targets can be modeled with exponentially spaced bins $B$

y = \operatorname{softmax}(f(x))^{\top} B = \operatorname{softmax}(f(x))^{\top} \operatorname{symexp}([-20,\ \cdots,\ +20])

The ground truth vector can be written as the following twohot encoding procedure

\operatorname{twohot}(y)_{i} = \left\{ \begin{matrix} |B_{k + 1} - y| \Big/ |B_{k + 1} - B_{k}| & i = k \\[5mm] |B_{k} - y| \Big/ |B_{k + 1} - B_{k}| & i = k + 1 \\[5mm] 0 & \mathrm{else} \end{matrix} \right. \qquad k = \sum_{i = 1}^{|B|} \mathbb{I}(b_{i} < x)

The error between prediction and ground truth can be written as cross entropy loss

\mathcal{L}(\theta) = -\operatorname{twohot}(y)^{\top} \log \operatorname{softmax}(f_{\theta}(x))

RL > Model-Based

#Dreamer

Dreamer

http://example.com/2024/08/01/Dreamer/

Author

木辛

Posted on

August 1, 2024

Licensed under

MuZero Previous

TD-MPC Next