Dreamer

Dreamer Series

Dreamer v1

World Model Learning

Model Type Definition Distribution Family
Transition Generation stpθ(stst1, at1)s_{t} \sim p_{\theta}(s_{t} \mid s_{t - 1},\ a_{t - 1}) N(μθ(st1, at1), diagθ(st1, at1))\mathcal{N} \Big( \mu_{\theta}(s_{t - 1},\ a_{t - 1}),\ \mathrm{diag}_{\theta}(s_{t - 1},\ a_{t - 1}) \Big)
Observation Generation otpθ(otst)o_{t} \sim p_{\theta}(o_{t} \mid s_{t}) N(μθ(st), I)\mathcal{N} \Big( \mu_{\theta}(s_{t}),\ \boldsymbol{I} \Big)
Reward Generation rtpθ(rtst)r_{t} \sim p_{\theta}(r_{t} \mid s_{t}) N(μθ(st), 1)\mathcal{N} \Big( \mu_{\theta}(s_{t}),\ 1 \Big)
Posterior Inference stqθ(stst1, at1, ot)s_{t} \sim q_{\theta}(s_{t} \mid s_{t - 1},\ a_{t - 1},\ o_{t}) N(μθ(st1, at1, ot), diagθ(st1, at1, ot))\mathcal{N} \Big( \mu_{\theta}(s_{t - 1},\ a_{t - 1},\ o_{t}),\ \mathrm{diag}_{\theta}(s_{t - 1},\ a_{t - 1},\ o_{t}) \Big)

reward prediction

Simply learning to predict future rewards given actions and past observation

reconstruction

Learns world model by reconstructing observations via variational information bottleneck (VIB) objective

maxθJREC=Es1:T[t(lnpθ(otst)JOt+lnpθ(rtst)JRtβDKL(qθ(stst1, at1, ot)  pθ(stst1, at1)JDt)]\max_{\theta} \mathcal{J}_{\mathrm{REC}} = \mathcal{E}_{s_{1:\mathrm{T}}} \left[ \sum_{t} \Big( \underset{\mathcal{J}_{\mathrm{O}}^{t}}{\underbrace{\ln p_{\theta}(o_{t} \mid s_{t})}} + \underset{\mathcal{J}_{\mathrm{R}}^{t}}{\underbrace{\ln p_{\theta}(r_{t} \mid s_{t})}} - \underset{\mathcal{J}_{\mathrm{D}}^{t}}{\underbrace{\beta D_{\mathrm{KL}} \Big( q_{\theta}(s_{t} \mid s_{t - 1},\ a_{t - 1},\ o_{t})\ \Big\|\ p_{\theta}(s_{t} \mid s_{t - 1},\ a_{t - 1})}} \Big) \right]

contrastive estimation

Predicting observation can require high model capacity, it’s also feasible to predict state from observation. The world model and an alternative state model qθ(stot)q_{\theta}(s_{t} \mid o_{t}) can be trained via noise contrastive estimation (NCE)

maxθJNCE=Es1:T[t(JSt+JRt+JDt)]JSt=lnqθ(stot)ln(oqθ(sto))\max_{\theta} \mathcal{J}_{\mathrm{NCE}} = \mathcal{E}_{s_{1:\mathrm{T}}} \left[ \sum_{t} \Big( \mathcal{J}_{\mathrm{S}}^{t} + \mathcal{J}_{\mathrm{R}}^{t} + \mathcal{J}_{\mathrm{D}}^{t} \Big) \right] \quad \mathcal{J}_{\mathrm{S}}^{t} = \ln q_{\theta}(s_{t} \mid o_{t}) - \ln \left( \sum_{o'} q_{\theta}(s_{t} \mid o') \right)

Behavior Learning

Dreamer uses TD(λ) method to train critic network based on imagined trajectories {sτ, aτ, rτ}τ=tt+H\{ s_{\tau},\ a_{\tau},\ r_{\tau} \}_{\tau = t}^{t + H}

Vτk=n=0k1γnrτ+n+γkvψ(sτ+k)Vτλ=(1λ)k=1t+Hτ1λk1Vτk+λt+Hτ1Vτt+HτV_{\tau}^{k} = \sum_{n = 0}^{k - 1} \gamma^{n} r_{\tau + n} + \gamma^{k} v_{\psi}(s_{\tau + k}) \qquad V_{\tau}^{\lambda} = (1 - \lambda) \sum_{k = 1}^{t + H - \tau - 1} \lambda^{k - 1} V_{\tau}^{k} + \lambda^{t + H - \tau - 1} V_{\tau}^{t + H - \tau}

Use the square of TD error as the learning objective of critic network vψ(s)v_{\psi}(s)

minψEqθ, qϕ(12τ=tt+H[vψ(sτ)Vτλ]2)\min_{\psi} \mathcal{E}_{q_{\theta},\ q_{\phi}} \left( \frac{1}{2} \sum_{\tau = t}^{t + H} \Big[ v_{\psi}(s_{\tau}) - V_{\tau}^{\lambda} \Big]^{2} \right)

The action model outputs a tanh-transformed Gaussian, allowing for reparameterized sampling

aτ=tanh(μϕ(sτ)+σϕ(sτ)ϵ)ϵN(0, I)a_{\tau} = \tanh \Big( \mu_{\phi}(s_{\tau}) + \sigma_{\phi}(s_{\tau}) \epsilon \Big) \quad \epsilon \sim \mathcal{N}(\boldsymbol{0},\ \boldsymbol{I})

Use the TD target as the learning objective of actor network qϕ(as)q_{\phi}(a \mid s)

maxϕEqθ, qϕ(τ=tt+HVτλ)minϕEqθ, qϕ(τ=tt+HVτλ)\max_{\phi} \mathcal{E}_{q_{\theta},\ q_{\phi}} \left( \sum_{\tau = t}^{t + H} V_{\tau}^{\lambda} \right) \Rightarrow \min_{\phi} \mathcal{E}_{q_{\theta},\ q_{\phi}} \left( -\sum_{\tau = t}^{t + H} V_{\tau}^{\lambda} \right)

Methods Feature Difference in Dreamer
Actor-Critic directly calculates the policy gradient
through estimated action value
backpropagates policy gradient
through value model
DDPG backpropagates through value model but
maximize only immediate value
leverages gradients backpropagated from
value in multiple steps through transition

Dreamer v2

World Model Learning

Component Type Definition Distribution Family
Recurrent Model Generation ht=fϕ(ht1, st1, at1)h_{t} = f_{\phi}(h_{t - 1},\ s_{t - 1},\ a_{t - 1}) Deterministic
Representation Model Inference stqϕ(stht, ot)s_{t} \sim q_{\phi}(s_{t} \mid h_{t},\ o_{t}) Categorical
Transition Predictor Generation stpϕ(stht)s_{t} \sim p_{\phi}(s_{t} \mid h_{t}) Categorical
Image Predictor Generation otpϕ(otht, st)o_{t} \sim p_{\phi}(o_{t} \mid h_{t},\ s_{t}) N(μϕ(ht, st), I)\mathcal{N} \Big( \mu_{\phi}(h_{t},\ s_{t}),\ \boldsymbol{I} \Big)
Reward Predictor Generation rtpϕ(rtht, st)r_{t} \sim p_{\phi}(r_{t} \mid h_{t},\ s_{t}) N(μϕ(ht, st), 1)\mathcal{N} \Big( \mu_{\phi}(h_{t},\ s_{t}),\ 1 \Big)
Discount Predictor Generation γtpϕ(γtht, st)\gamma_{t} \sim p_{\phi}(\gamma_{t} \mid h_{t},\ s_{t}) Bernoulli

The probabilistic model of stochastic state sts_{t} outputs a vector of several categorical variables and optimize them using straight-through gradients rather than Guassian distribution and reparameterized sample

1
2
3
sample = one_hot(draw(logits))                  # sample has no gradient
probs = softmax(logits) # want gradient of this
sample = sample + probs - stop_grad(probs) # has gradient of probs

The discounted factor is set to a fixed hyperparameter within an episode and zero for terminal time steps, which reveals the end of episode during behavior learning.

All components of the world model are optimized jointly by maximizing the ELBO

J(ϕ)=Es1:T[t=1Tlnpϕ(otht, st)+lnpϕ(rtht, st)+lnpϕ(γtht, st)βDKL(qϕ(ht, ot)  pϕ(ht))]\mathcal{J}(\phi) = \mathcal{E}_{s_{1:\mathrm{T}}} \left[ \sum_{t = 1}^{\mathrm{T}} \ln p_{\phi}(o_{t} \mid h_{t},\ s_{t}) + \ln p_{\phi}(r_{t} \mid h_{t},\ s_{t}) + \ln p_{\phi}(\gamma^{t} \mid h_{t},\ s_{t}) - \beta D_{\mathrm{KL}} \Big( q_{\phi}(\cdot \mid h_{t},\ o_{t})\ \|\ p_{\phi}(\cdot \mid h_{t}) \Big) \right]

Learning the transition function (prior) is more difficult, to avoid regularizing the representations toward a poorly trained prior, the KL loss is minimized faster w.r.t. the prior than the representations by using different learning rates

1
kl_loss = α * kl(stop_grad(posterior), prior) + (1 - α) * kl(posterior, stop_grad(prior))

Behavior Learning

Similar to Dreamer-v1, the critic vξ(s)v_{\xi}(s) is trained with TD(λ) method while the actor pψ(atst)p_{\psi}(a_{t} \mid s_{t}) is trained with REINFORCE gradients (Atari, ρ=1\rho = 1) or dynamics backpropgation gradients (Continuous Control, ρ=0\rho = 0)

J(ψ)=Es1:T,a1:T[t=1T1ρlnpψ(atst)sg(Vtλvξ(st))+(1ρ)Vtλ+ηH(pψst)]\mathcal{J}(\psi) = \mathcal{E}_{s_{1:\mathrm{T}}, a_{1:\mathrm{T}}} \left[ \sum_{t = 1}^{\mathrm{T} - 1} \rho \ln p_{\psi}(a_{t} \mid s_{t}) \operatorname{sg} \Big( V_{t}^{\lambda} - v_{\xi}(s_{t}) \Big) + (1 - \rho) V_{t}^{\lambda} + \eta \mathcal{H}(p_{\psi} \mid s_{t}) \right]

where

Vtλ=rt+γt{(1λ)vξ(st+1)+λVt+1λt<Tvξ(sT)t=TV_{t}^{\lambda} = r_{t} + \gamma_{t} \left\{\begin{matrix} (1 - \lambda) v_{\xi}(s_{t + 1}) + \lambda V_{t + 1}^{\lambda} & t < \mathrm{T} \\[5mm] v_{\xi}(s_{\mathrm{T}}) & t = \mathrm{T} \end{matrix}\right.

Dreamer-v3

World Model Learning

Component Type Definition Distribution Family
Sequence Model Generation ht=fϕ(ht1, st1, at1)h_{t} = f_{\phi}(h_{t - 1},\ s_{t - 1},\ a_{t - 1}) Deterministic
Encoder Inference stqϕ(stht, ot)s_{t} \sim q_{\phi}(s_{t} \mid h_{t},\ o_{t}) Categorical
Dynamics Predictor Generation stpϕ(stht)s_{t} \sim p_{\phi}(s_{t} \mid h_{t}) Categorical
Decoder Generation otpϕ(otht, st)o_{t} \sim p_{\phi}(o_{t} \mid h_{t},\ s_{t})
Reward Predictor Generation rtpϕ(rtht, st)r_{t} \sim p_{\phi}(r_{t} \mid h_{t},\ s_{t})
Continue Predictor Generation ctpϕ(ctht, st)c_{t} \sim p_{\phi}(c_{t} \mid h_{t},\ s_{t}) Bernoulli

The world model parameters ϕ\phi are optimized end-to-end to minimize

L(ϕ)=Es1:T[t=1TβpredLpred(ϕ)+βdynLdyn(ϕ)+βrepLrep(ϕ)]\mathcal{L}(\phi) = \mathcal{E}_{s_{1:\mathrm{T}}} \left[ \sum_{t = 1}^{\mathrm{T}} \beta_{\mathrm{pred}} \mathcal{L}_{\mathrm{pred}}(\phi) + \beta_{\mathrm{dyn}} \mathcal{L}_{\mathrm{dyn}}(\phi) + \beta_{\mathrm{rep}} \mathcal{L}_{\mathrm{rep}}(\phi) \right]

where

Lpred(ϕ)=lnpϕ(otht, st)lnpϕ(rtht, st)lnpϕ(ctht, st)Ldyn(ϕ)=max[1, DKL(sg(qϕ(stht, ot))  pϕ(stht))]Lrep(ϕ)=max[1, DKL(qϕ(stht, ot)  sg(pϕ(stht)))]\begin{gathered} \mathcal{L}_{\mathrm{pred}}(\phi) = -\ln p_{\phi}(o_{t} \mid h_{t},\ s_{t}) - \ln p_{\phi}(r_{t} \mid h_{t},\ s_{t}) - \ln p_{\phi}(c_{t} \mid h_{t},\ s_{t}) \\[5mm] \mathcal{L}_{\mathrm{dyn}}(\phi) = \max \left[ 1,\ D_{\mathrm{KL}} \Big( \operatorname{sg}(q_{\phi}(s_{t} \mid h_{t},\ o_{t}))\ \|\ p_{\phi}(s_{t} \mid h_{t}) \Big) \right] \\[5mm] \mathcal{L}_{\mathrm{rep}}(\phi) = \max \left[ 1,\ D_{\mathrm{KL}} \Big( q_{\phi}(s_{t} \mid h_{t},\ o_{t})\ \|\ \operatorname{sg}(p_{\phi}(s_{t} \mid h_{t})) \Big) \right] \end{gathered}

To avoid a degenerate solution where the dynamics are trivial to predict but fail to contain enough information about the inputs, Dreamer v3 employs free bits by clipping the dynamics and representation losses.

Behavior Learning

use the REINFORCE gradient for actor network πθ(atst)\pi_{\theta}(a_{t} \mid s_{t}) with both discrete and continuous actions

L(θ)=t=1Tsg[Rtλvψ(st)max(1, S)]lnπθ(atst)+ηH(πθst)\mathcal{L}(\theta) = \sum_{t = 1}^{\mathrm{T}} \operatorname{sg} \left[ \frac{R_{t}^{\lambda} - v_{\psi}(s_{t})}{\max(1,\ S)} \right] \ln \pi_{\theta}(a_{t} \mid s_{t}) + \eta \mathcal{H}(\pi_{\theta} \mid s_{t})

To be robust to outliers, the return is normalized by exponential moving average smooth of the range from the 5th to the 95th return percentil over the return batch, which is clipped by 1

S=EMA[Δt=Per(Rtλ, 95)Per(Rtλ, 5), 0.99]=0.99Δt+0.01Δt1S = \operatorname{EMA} \Big[ \Delta_{t} = \operatorname{Per}(R_{t}^{\lambda},\ 95) - \operatorname{Per}(R_{t}^{\lambda},\ 5),\ 0.99 \Big] = 0.99 \Delta_{t} + 0.01 \Delta_{t - 1}

The critic network vψ(Rtst)v_{\psi}(R_{t} \mid s_{t}) predicts a distribution of bootstrapped λ returns by maximizing log-likelihood

L(ψ)=t=1Tlnpψ(Rtλst)Rtλ=rt+γct{(1λ)Evψ(st+1)+λRt+1λt<TEvψ(sT)t=T\mathcal{L}(\psi) = -\sum_{t = 1}^{\mathrm{T}} \ln p_{\psi}(R_{t}^{\lambda} \mid s_{t}) \quad R_{t}^{\lambda} = r_{t} + \gamma c_{t} \left\{ \begin{matrix} (1 - \lambda) \mathcal{E} v_{\psi}(\cdot \mid s_{t + 1}) + \lambda R_{t + 1}^{\lambda} & t < \mathrm{T} \\[5mm] \mathcal{E} v_{\psi}(\cdot \mid s_{\mathrm{T}}) & t = \mathrm{T} \end{matrix} \right.

Robust Predictions

It is challenging to reconstruct input and predict reward and returns whose scale can vary across domains. The symplog is used to compress the magnitudes of both positive and negative values

symlog(x)=sign(x)ln(x+1)symexp(y)=sign(y)(exp(y)1)\operatorname{symlog}(x) = \operatorname{sign}(x) \ln (|x| + 1) \quad \operatorname{symexp}(y) = \operatorname{sign}(y) (\exp(|y|) - 1)

The distribution of potentially stochastic targets can be modeled with exponentially spaced bins BB

y=softmax(f(x))B=softmax(f(x))symexp([20, , +20])y = \operatorname{softmax}(f(x))^{\top} B = \operatorname{softmax}(f(x))^{\top} \operatorname{symexp}([-20,\ \cdots,\ +20])

The ground truth vector can be written as the following twohot encoding procedure

twohot(y)i={Bk+1y/Bk+1Bki=kBky/Bk+1Bki=k+10elsek=i=1BI(bi<x)\operatorname{twohot}(y)_{i} = \left\{ \begin{matrix} |B_{k + 1} - y| \Big/ |B_{k + 1} - B_{k}| & i = k \\[5mm] |B_{k} - y| \Big/ |B_{k + 1} - B_{k}| & i = k + 1 \\[5mm] 0 & \mathrm{else} \end{matrix} \right. \qquad k = \sum_{i = 1}^{|B|} \mathbb{I}(b_{i} < x)

The error between prediction and ground truth can be written as cross entropy loss

L(θ)=twohot(y)logsoftmax(fθ(x))\mathcal{L}(\theta) = -\operatorname{twohot}(y)^{\top} \log \operatorname{softmax}(f_{\theta}(x))


Dreamer
http://example.com/2024/08/01/Dreamer/
Author
木辛
Posted on
August 1, 2024
Licensed under