DLLM

DLLM

Goal Generation

Similar to ELLM, DLLM obtains language caption otlo_{t}^{l} of current observation oto_{t} throguh an observation captioner

The observation caption otlo_{t}^{l} and other possible language description about the environment are provided to GPT to generate KK suggested goals, which are further encoded to vector embedding gt1:Kg_{t}^{1:K} through SentenceBERT

Intrinsic Reward

To stimulate meaningful and effective exploration, the intrinsic reward in a model rollout is calculated as

rtint=αk=1Kw(utgk)ikI(tkt=tk)r_{t}^{\mathrm{int}} = \alpha \sum_{k = 1}^{K} w(u_{t} \mid g_{k}) \cdot i_{k} \cdot \mathbb{I}(\exist t_{k} \wedge t = t_{k})

The semantic similarity between goal gkg_{k} and language description embedding of transition utu_{t} is

w(utgk)=utgkutgkI[utgkutgk>M]w(u_{t} \mid g_{k}) = \frac{u_{t} \cdot g_{k}}{\| u_{t} \| \| g_{k} \|} \cdot \mathbb{I} \left[ \frac{u_{t} \cdot g_{k}}{\| u_{t} \| \| g_{k} \|} > M \right]

and tkt_{k} represents the time step tt when w(utgk)w(u_{t} \mid g_{k}) first exceeds MM withing the rollout horizon. To decrease the intrinsic reward for previously encountered goals, the novelty measure iki_{k} is calculated by RND

i=emean(e1:B, 1:L, 1:K)std(e1:B, 1:L, 1:K)e(g)=f^θ(g)f(g)2i = \frac{e - \operatorname{mean}(e_{1:B,\ 1:L,\ 1:K})}{\operatorname{std}(e_{1:B,\ 1:L,\ 1:K})} \qquad e(g) = \| \hat{f}_{\theta}(g) - f(g) \|^{2}

where the f:GRf : \mathcal{G} \mapsto \mathbb{R} is the target network and f^θ:GR\hat{f}_{\theta} : \mathcal{G} \mapsto \mathbb{R} is the predictor network. The latter is trained to approximate the former. The lower error reveals higher exploration frequency on semantically similar goals

World Model Learning

Following Dreamer v3, the RSSM-based world model in DLLM consists of the following components

Component Type Definition Description
Seqeunce Model Generation z^t, ht=seq(zt1, ht1, at1)\hat{z}_{t},\ h_{t} = \operatorname{seq}(z_{t - 1},\ h_{t - 1},\ a_{t - 1}) Recurrent + Transition (Prior)
Encoder Inference ztenc(ot, ut, ht)z_{t} \sim \operatorname{enc}(o_{t},\ u_{t},\ h_{t}) Representation (Posterior)
Decoder Generation o^t, u^t, r^t, c^t=dec(zt, ht)\hat{o}_{t},\ \hat{u}_{t},\ \hat{r}_{t},\ \hat{c}_{t} = \operatorname{dec}(z_{t},\ h_{t}) Observation Predictor

Transition Predictor

Reward Predictor

Continue Predictor

The entire world mode is trained through the following objective in an end-to-end manner

Ltotal=Lo+Lu+Lr+Lc+β1Lpred+β2Lreg\mathcal{L}_{\mathrm{total}} = \mathcal{L}_{o} + \mathcal{L}_{u} + \mathcal{L}_{r} + \mathcal{L}_{c} + \beta_{1} \mathcal{L}_{\mathrm{pred}} + \beta_{2} \mathcal{L}_{\mathrm{reg}}

Loss Definition Loss Definition
Observation Lo=o^tot22\mathcal{L}_{o} = \| \hat{o}_{t} - o_{t} \|_{2}^{2} Transition Lu=catxent(u^t, ut)\mathcal{L}_{u} = \operatorname{catxent}(\hat{u}_{t},\ u_{t})
Reward Lr=catxent(r^, twohot(rt))\mathcal{L}_{r} = \operatorname{catxent}(\hat{r},\ \operatorname{twohot}(r_{t})) Continue Lc=binxent(c^t, ct)\mathcal{L}_{c} = \operatorname{binxent}(\hat{c}_{t},\ c_{t})
Prediction Lpred=max[1, DKL(sg[zt]  z^t)]\mathcal{L}_{\mathrm{pred}} = \max \Big[ 1,\ D_{\mathrm{KL}}(\operatorname{sg}[z_{t}]\ \|\ \hat{z}_{t}) \Big] Regularizer Lreg=max[1, DKL(zt  sg[z^t])]\mathcal{L}_{\mathrm{reg}} = \max \Big[ 1,\ D_{\mathrm{KL}}(z_{t}\ \|\ \operatorname{sg}[\hat{z}_{t}]) \Big]

Behavior Learning

The actor πθ(atzt, ht)\pi_{\theta}(a_{t} \mid z_{t},\ h_{t}) and critic Vψ(zt, ht)V_{\psi}(z_{t},\ h_{t}) are trained through behavior learning algorithm in Dreamer v3

where the reward for behavior learning in each step involves both extrinsic reward rtr_{t} and intrinsic reward rtintr_{t}^{\mathrm{int}}


DLLM
http://example.com/2024/09/13/DLLM/
Author
木辛
Posted on
September 13, 2024
Licensed under