IRIS

World Model Learning

The world model is composed of a discrete autoencoder $(E,\ D)$ to learn representation of observation and a GPT-like autoregressive Transformer $G$ to capture environment dynamics

Component	Type	Definition	Distribution Family
Observation Encoder	Representation	$E : \mathbb{R}^{h \times w \times 3} \mapsto \{ 1,\ 2,\ \cdots,\ N \}^{K}$	Deterministic
Observation Decoder	Representation	$D : \{ 1,\ 2,\ \cdots,\ N \}^{K} \mapsto \mathbb{R}^{h \times w \times 3}$	Deterministic
Transition Predictor	Dynamics	$z_{t + 1} \sim p_{G}(z_{t + 1} \mid z_{\le t},\ a_{\le t})$ $z_{t + 1}^{k} \sim p_{G}(z_{t + 1}^{k} \mid z_{\le t},\ a_{\le t},\ z_{t + 1}^{< k})$	Categorical
Reward Predictor	Dynamics	$r_{t} \sim p_{G}(r_{t} \mid z_{\le t},\ a_{\le t})$	Categorical / Deterministic
Termination Predictor	Dynamics	$d_{t} \sim p_{G}(\gamma_{t} \mid z_{\le t},\ a_{\le t})$	Bernoulli

Representation

The representation state $z_{t}$ consists of $K$ tokens from a vocabulary of size $N$ . The encoder $E$ first produces a group of vector $z_{e}(x_{t}) \in \mathbb{R}^{K \times d}$ , then obtains the output tokens through a codebook $\mathcal{E} = \{e_{i} \in \mathbb{R}^{d}\}_{i = 1}^{N}$

z_{t} = (z_{t}^{1},\ z_{t}^{2},\ \cdots,\ z_{t}^{K}) = \left[ \argmin_{i} \Big\| z_{e}^{k}(x_{t}) - e_{i} \Big\|_{2} \right]_{k = 1}^{K}

The discrete autoencoder $(E,\ D)$ is trained to maximize the ELBO of the log-likelihood

\ln p(x) = \ln \mathcal{E}_{z \sim q(z \mid x)} \left[ p(x \mid z) \frac{\prod_{k = 1}^{K} p(z^{k})}{\prod_{k = 1}^{K} q(z^{k} \mid x)} \right] \ge \mathcal{E}_{z \sim q(z \mid x)} \ln p(x \mid z) - \sum_{k = 1}^{K} D_{\mathrm{KL}} \Big( q(z^{k} \mid x)\ \|\ p(z^{k}) \Big)

where the posterior $q(z^{k} \mid x)$ is one-hot distributed and the prior $p(z^{k})$ is assumed to be uniformly distributed

\ln p(x) \ge \ln p(x \mid z_{q}(x)) - \underset{\mathrm{const}}{\underbrace{K \ln N}} \qquad \mathrm{s.t.} \quad z_{q}^{k}(x) = e_{i} \quad i = \argmin_{j} \Big\| z_{e}^{k}(x) - e_{i} \Big\|_{2}

where $z_{q}(x)$ is calculated via $z_{q}^{k}(x) = z_{e}^{k}(x) + \operatorname{sg} (e_{i} - z_{e}^{k}(x))$ in practice to introduce straight-through gradients. The overall objective for $(E,\ D,\ \mathcal{E})$ includes the aforementioned ELBO and several additional items

Loss	Definition	Target
Reconstruction Loss	$\mathcal{L}_{\mathrm{rec}} = \log p(x \mid z_{q}(x)) \Rightarrow \\| x - D(z) \\|_{1}$	encoder + decoder
Codebook Loss	$\mathcal{L}_{\mathrm{code}} = \sum_{k = 1}^{K} \Big\\| \operatorname{sg}(z_{e}^{k}(x)) - \mathcal{E}(z^{k}) \Big\\|_{2}^{2}$	codebook
Commitment Loss	$\mathcal{L}_{\mathrm{com}} = \sum_{k = 1}^{K} \Big\\| z_{e}^{k}(x) - \operatorname{sg}(\mathcal{E}(z^{k})) \Big\\|_{2}^{2}$	encoder
Perceptual Loss	$\mathcal{L}_{\mathrm{perceptual}}(x,\ D(z))$	encoder + decoder

Dynamics

The autoregressive Transformer $G$ is trained in a self-supervised manner on segments sampled from past experience to minimize the difference between prediction and ground truth. The overall objective includes

Loss	Target
Cross Entropy Loss	Transition Predictor
Cross Entropy Loss / MSE Loss	Reward Predictor
Cross Entropy Loss	Termination Predictor

Behavior Learning

Following Dreamer, the critic network $v(x_{t})$ is optimized through λ return, which is recursively defined as

\Lambda_{t} = \left\{ \begin{matrix} r_{t} + \gamma (1 - d_{t}) \Big[ (1 - \lambda) v(x_{t + 1}) + \lambda \Lambda_{t + 1} \Big] & t < H \\[5mm] v(x_{H}) & t = H \end{matrix} \right.

\mathcal{L}_{\mathrm{critic}} = \mathcal{E}_{\pi} \left[ \sum_{t = 0}^{H - 1} \Big( v(x_{t}) - \operatorname{sg}(\Lambda_{t}) \Big)^{2} \right]

The actor network $\pi(a_{t} \mid x_{\le t})$ is trained to minimize the REINFORCE objective over imagined trajectories

\mathcal{L}_{\pi} = - \mathcal{E}_{\pi} \left[ \sum_{t = 0}^{H - 1} \ln \pi(a_{t} \mid x_{\le t}) \operatorname{sg}(\Lambda_{t} - v(x_{t})) + \eta \mathcal{H}(\pi \mid x_{\le t}) \right]

RL > Model-Based

#IRIS

IRIS

http://example.com/2024/09/04/IRIS/

Author

木辛

Posted on

September 4, 2024

Licensed under

LanGWM Previous

MACD Next