DLLM

Goal Generation

Similar to ELLM, DLLM obtains language caption $o_{t}^{l}$ of current observation $o_{t}$ throguh an observation captioner

The observation caption $o_{t}^{l}$ and other possible language description about the environment are provided to GPT to generate $K$ suggested goals, which are further encoded to vector embedding $g_{t}^{1:K}$ through SentenceBERT

Intrinsic Reward

To stimulate meaningful and effective exploration, the intrinsic reward in a model rollout is calculated as

r_{t}^{\mathrm{int}} = \alpha \sum_{k = 1}^{K} w(u_{t} \mid g_{k}) \cdot i_{k} \cdot \mathbb{I}(\exist t_{k} \wedge t = t_{k})

The semantic similarity between goal $g_{k}$ and language description embedding of transition $u_{t}$ is

w(u_{t} \mid g_{k}) = \frac{u_{t} \cdot g_{k}}{\| u_{t} \| \| g_{k} \|} \cdot \mathbb{I} \left[ \frac{u_{t} \cdot g_{k}}{\| u_{t} \| \| g_{k} \|} > M \right]

and $t_{k}$ represents the time step $t$ when $w(u_{t} \mid g_{k})$ first exceeds $M$ withing the rollout horizon. To decrease the intrinsic reward for previously encountered goals, the novelty measure $i_{k}$ is calculated by RND

i = \frac{e - \operatorname{mean}(e_{1:B,\ 1:L,\ 1:K})}{\operatorname{std}(e_{1:B,\ 1:L,\ 1:K})} \qquad e(g) = \| \hat{f}_{\theta}(g) - f(g) \|^{2}

where the $f : \mathcal{G} \mapsto \mathbb{R}$ is the target network and $\hat{f}_{\theta} : \mathcal{G} \mapsto \mathbb{R}$ is the predictor network. The latter is trained to approximate the former. The lower error reveals higher exploration frequency on semantically similar goals

World Model Learning

Following Dreamer v3, the RSSM-based world model in DLLM consists of the following components

Component	Type	Definition	Description
Seqeunce Model	Generation	$\hat{z}_{t},\ h_{t} = \operatorname{seq}(z_{t - 1},\ h_{t - 1},\ a_{t - 1})$	Recurrent + Transition (Prior)
Encoder	Inference	$z_{t} \sim \operatorname{enc}(o_{t},\ u_{t},\ h_{t})$	Representation (Posterior)
Decoder	Generation	$\hat{o}_{t},\ \hat{u}_{t},\ \hat{r}_{t},\ \hat{c}_{t} = \operatorname{dec}(z_{t},\ h_{t})$	Observation Predictor Transition Predictor Reward Predictor Continue Predictor

The entire world mode is trained through the following objective in an end-to-end manner

\mathcal{L}_{\mathrm{total}} = \mathcal{L}_{o} + \mathcal{L}_{u} + \mathcal{L}_{r} + \mathcal{L}_{c} + \beta_{1} \mathcal{L}_{\mathrm{pred}} + \beta_{2} \mathcal{L}_{\mathrm{reg}}

Loss	Definition	Loss	Definition
Observation	$\mathcal{L}_{o} = \\| \hat{o}_{t} - o_{t} \\|_{2}^{2}$	Transition	$\mathcal{L}_{u} = \operatorname{catxent}(\hat{u}_{t},\ u_{t})$
Reward	$\mathcal{L}_{r} = \operatorname{catxent}(\hat{r},\ \operatorname{twohot}(r_{t}))$	Continue	$\mathcal{L}_{c} = \operatorname{binxent}(\hat{c}_{t},\ c_{t})$
Prediction	$\mathcal{L}_{\mathrm{pred}} = \max \Big[ 1,\ D_{\mathrm{KL}}(\operatorname{sg}[z_{t}]\ \\|\ \hat{z}_{t}) \Big]$	Regularizer	$\mathcal{L}_{\mathrm{reg}} = \max \Big[ 1,\ D_{\mathrm{KL}}(z_{t}\ \\|\ \operatorname{sg}[\hat{z}_{t}]) \Big]$

Behavior Learning

The actor $\pi_{\theta}(a_{t} \mid z_{t},\ h_{t})$ and critic $V_{\psi}(z_{t},\ h_{t})$ are trained through behavior learning algorithm in Dreamer v3

where the reward for behavior learning in each step involves both extrinsic reward $r_{t}$ and intrinsic reward $r_{t}^{\mathrm{int}}$

RL > Language-Assistant

#DLLM

DLLM

http://example.com/2024/09/13/DLLM/

Author

木辛

Posted on

September 13, 2024

Licensed under

PALO Previous

RAP Next