PlaNet

Recurrent State Space Model

Model	Type	Definition	Distribution Family
Transition	Generation	$s_{t} \sim p(s_{t} \mid s_{t - 1},\ a_{t - 1})$	$\mathcal{N} \Big( \mu(s_{t - 1},\ a_{t - 1}),\ \mathrm{diag}(s_{t - 1},\ a_{t - 1}) \Big)$
Observation	Generation	$o_{t} \sim p(o_{t} \mid s_{t})$	$\mathcal{N} \Big( \mu(s_{t}),\ \boldsymbol{I} \Big)$
Reward	Generation	$r_{t} \sim p(r_{t} \mid s_{t})$	$\mathcal{N} \Big( \mu(s_{t}),\ 1 \Big)$
Posterior	Inference	$s_{t} \sim q(s_{t} \mid s_{t - 1},\ a_{t - 1},\ o_{t})$	$\mathcal{N} \Big( \mu(s_{t - 1},\ a_{t - 1},\ o_{t}),\ \mathrm{diag}(s_{t - 1},\ a_{t - 1},\ o_{t}) \Big)$

Given the action sequence, the POMDP model can be simplified as a non-stationary Markovian process

All components are trained jointly to maximize a variational lower bound (ELBO) instead of log-likelihood

\begin{aligned} &\ln p(o_{1:\mathrm{T}}, r_{1:\mathrm{T}} \mid a_{1:\mathrm{T}}) \\[7mm] =\ &\ln \sum_{s_{1:\mathrm{T}}} p(o_{1:\mathrm{T}},\ r_{1:\mathrm{T}},\ s_{1:\mathrm{T}} \mid a_{1:\mathrm{T}}) = \ln \sum_{s_{1:\mathrm{T}}} p(o_{1:\mathrm{T}},\ r_{1:\mathrm{T}} \mid s_{1:\mathrm{T}},\ a_{1:\mathrm{T}}) p(s_{1:\mathrm{T}} \mid a_{1:\mathrm{T}}) \frac{q(s_{1:\mathrm{T}} \mid o_{1:\mathrm{T}},\ a_{1:\mathrm{T}})}{q(s_{1:\mathrm{T}} \mid o_{1:\mathrm{T}},\ a_{1:\mathrm{T}})} \\[7mm] =\ &\ln \sum_{s_{1}} \sum_{s_{2}} \cdots \sum_{s_{\mathrm{T}}} \prod_{t = 1}^{\mathrm{T}} p(o_{t} \mid s_{t}) p(r_{t} \mid s_{t}) p(s_{t} \mid s_{t - 1},\ a_{t - 1}) \prod_{t = 1}^{\mathrm{T}} \frac{q(s_{t} \mid s_{t - 1},\ a_{t - 1},\ o_{t})}{q(s_{t} \mid s_{t - 1},\ a_{t - 1},\ o_{t})} \\[7mm] =\ &\ln \mathcal{E}_{s_{1} \sim q(\cdot \mid o_{1})} \mathcal{E}_{s_{2} \sim q(\cdot \mid s_{1},\ a_{1},\ o_{2})} \cdots \mathcal{E}_{s_{\mathrm{T}} \sim q(\cdot \mid s_{\mathrm{T} - 1},\ a_{\mathrm{T} - 1},\ o_{\mathrm{T}})} \left[ \prod_{t = 1}^{\mathrm{T}} \frac{p(o_{t} \mid s_{t}) p(r_{t} \mid s_{t}) p(s_{t} \mid s_{t - 1},\ a_{t - 1})}{q(s_{t} \mid s_{t - 1},\ a_{t - 1},\ o_{t})} \right] \\[7mm] \ge\ &\mathcal{E}_{s_{1}} \mathcal{E}_{s_{2}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \left[ \sum_{t = 1}^{\mathrm{T}} \ln p(o_{t} \mid s_{t}) + \ln p(r_{t} \mid s_{t}) + \ln p(s_{t} \mid s_{t - 1},\ a_{t - 1}) - \ln q(s_{t} \mid s_{t - 1},\ a_{t - 1},\ o_{t}) \right] \\[7mm] =\ &\sum_{t = 1}^{\mathrm{T}} \mathcal{E}_{s_{1}} \mathcal{E}_{s_{2}} \cdots \mathcal{E}_{s_{t}} \Big[ \ln p(o_{t} \mid s_{t}) + \ln p(r_{t} \mid s_{t}) \Big] - \mathcal{E}_{s_{1}} \mathcal{E}_{s_{2}} \cdots \mathcal{E}_{s_{t} \sim q(\cdot \mid s_{t - 1},\ a_{t - 1},\ o_{t})} \ln \frac{q(s_{t} \mid s_{t - 1},\ a_{t - 1},\ o_{t})}{p(s_{t} \mid s_{t - 1},\ a_{t - 1})} \\[7mm] =\ &\sum_{t = 1}^{\mathrm{T}} \mathcal{E}_{s_{1}} \mathcal{E}_{s_{2}} \cdots \mathcal{E}_{s_{t}}\Big[ \ln p(o_{t} \mid s_{t}) + \ln p(r_{t} \mid s_{t}) \Big] - \mathcal{E}_{s_{1}} \mathcal{E}_{s_{2}} \cdots \mathcal{E}_{s_{t - 1}} D_{\mathrm{KL}} \Big( q(\cdot \mid s_{t - 1},\ a_{t - 1},\ o_{t})\ \|\ p(\cdot \mid s_{t - 1},\ a_{t - 1}) \Big) \end{aligned}

Replace the posterior to condition on past observations only and the objective can be rewritten as

\begin{aligned} &\ln p(o_{1:\mathrm{T}}, r_{1:\mathrm{T}} \mid a_{1:\mathrm{T}}) \\[7mm] =\ &\ln \sum_{s_{1:\mathrm{T}}} p(o_{1:\mathrm{T}},\ r_{1:\mathrm{T}},\ s_{1:\mathrm{T}} \mid a_{1:\mathrm{T}}) = \ln \sum_{s_{1:\mathrm{T}}} p(o_{1:\mathrm{T}},\ r_{1:\mathrm{T}} \mid s_{1:\mathrm{T}},\ a_{1:\mathrm{T}}) p(s_{1:\mathrm{T}} \mid a_{1:\mathrm{T}}) \prod_{t = 1}^{\mathrm{T}} \frac{q(s_{t} \mid o_{\le t},\ a_{< t})}{q(s_{t} \mid o_{\le t},\ a_{< t})} \\[7mm] =\ &\ln \mathcal{E}_{s_{1} \sim q(\cdot \mid o_{1})} \mathcal{E}_{s_{2} \sim q(\cdot \mid o_{1},\ o_{2},\ a_{1})} \cdots \mathcal{E}_{s_{\mathrm{T}} \sim q(\cdot \mid o_{1:\mathrm{T}},\ a_{1:\mathrm{T} - 1})} \left[ \prod_{t = 1}^{\mathrm{T}} \frac{p(o_{t} \mid s_{t}) p(r_{t} \mid s_{t}) p(s_{t} \mid s_{t - 1},\ a_{t - 1})}{q(s_{t} \mid o_{\le t},\ a_{< t})} \right] \\[7mm] \ge\ &\mathcal{E}_{s_{1}} \mathcal{E}_{s_{2}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \left[ \sum_{t = 1}^{\mathrm{T}} \ln p(o_{t} \mid s_{t}) + \ln p(r_{t} \mid s_{t}) + \ln p(s_{t} \mid s_{t - 1},\ a_{t - 1}) - \ln q(s_{t} \mid o_{\le t},\ a_{< t}) \right] \\[7mm] =\ &\sum_{t = 1}^{\mathrm{T}} \mathcal{E}_{s_{1}} \mathcal{E}_{s_{2}} \cdots \mathcal{E}_{s_{t}} \Big[ \ln p(o_{t} \mid s_{t}) + \ln p(r_{t} \mid s_{t}) \Big] - \mathcal{E}_{s_{1}} \mathcal{E}_{s_{2}} \cdots \mathcal{E}_{s_{t - 1}} D_{\mathrm{KL}} \Big( q(\cdot \mid o_{\le t},\ a_{< t})\ \|\ p(\cdot \mid s_{t - 1},\ a_{t}) \Big) \end{aligned}

The parameters in the probabilistic model can be optimized by reparmeterized sample and gradient descent.

	RNN	SM	RSSM
Diagram
Generation	$\begin{gathered} h_{t} = f(h_{t - 1},\ a_{t - 1}) \\[3mm] o_{t} \sim p(o_{t} \mid h_{t}) \\[3mm] r_{t} \sim p(r_{t} \mid h_{t}) \end{gathered}$	$\begin{gathered} s_{t} \sim p(s_{t} \mid s_{t - 1},\ a_{t - 1}) \\[3mm] o_{t} \sim p(o_{t} \mid s_{t}) \\[3mm] r_{t} \sim p(r_{t} \mid s_{t}) \end{gathered}$	$\begin{gathered} h_{t} = f(h_{t - 1},\ s_{t - 1},\ a_{t - 1}) \\[3mm] s_{t} \sim p(s_{t} \mid h_{t}) \\[3mm] o_{t} \sim p(o_{t} \mid h_{t},\ s_{t}) \\[3mm] r_{t} \sim p(r_{t} \mid h_{t},\ s_{t}) \end{gathered}$
Inference		$\begin{gathered} s_{t} \sim q(s_{t} \mid s_{t - 1},\ a_{t - 1},\ o_{t}) \\[3mm] s_{t} \sim q(s_{t} \mid s_{t - 1},\ a_{t - 1},\ r_{t}) \end{gathered}$	$\begin{gathered} s_{t} \sim q(s_{t} \mid h_{t},\ o_{t}) \\[3mm] s_{t} \sim q(s_{t} \mid h_{t},\ r_{t}) \end{gathered}$

Transitions in SM are purely stochastic. This makes it difficult to remember information over multiple time steps. While the RSSM combines RNN and SM, spliting the state into stochastic and deterministic parts.

Latent Overshooting

Because of the limited capacity and restricted distributional family, training the model only on one-step predictions until convergence does in general not coincide with the model that is best at multi-step predictions

	Standard	Observation Overshooting	Latent Overshooting
Diagram
Description		Multi-Step Reconstruction	Multi-Step Prior Prediction

Generalize the standard latent variational lower bound from one-step prior prediction to $d$ step prediction

\begin{aligned} &\ln p_{d}(o_{1:\mathrm{T}} \mid a_{1:\mathrm{T}}) \not \Leftarrow \ln p(o_{1:\mathrm{T}} \mid a_{1:\mathrm{T}}) \\[5mm] =\ &\ln \sum_{s_{1:\mathrm{T}}} p(o_{1:\mathrm{T}} \mid s_{1:\mathrm{T}}) p(s_{1:\mathrm{T}} \mid a_{1:\mathrm{T}}) = \ln \sum_{s_{1:\mathrm{T}}} \prod_{t = 1}^{\mathrm{T}} p(o_{t} \mid s_{t}) p(s_{t} \mid s_{t - d},\ a_{t - d - 1 : t - 1}) \\[7mm] =\ &\ln \sum_{s_{1:\mathrm{T}}} \prod_{t = 1}^{\mathrm{T}} p(o_{t} \mid s_{t}) \left( \sum_{s_{t - d + 1}} \cdots \sum_{s_{t - 1}} \prod_{\tau = t - d + 1}^{t} p(s_{\tau} \mid s_{\tau - 1},\ a_{\tau - 1}) \right) \frac{q(s_{t} \mid o_{\le t},\ a_{< t})}{q(s_{t} \mid o_{\le t},\ a_{< t})} \\[7mm] =\ &\ln \mathcal{E}_{s_{1} \sim q(\cdot \mid o_{1})} \mathcal{E}_{s_{2} \sim q(\cdot \mid o_{1},\ o_{2},\ a_{1})} \cdots \mathcal{E}_{s_{\mathrm{T}} \sim q(\cdot \mid o_{\le \mathrm{T}},\ a_{< \mathrm{T}})} \left[ p(o_{t} \mid s_{t}) \sum_{s_{t - 1}} p(s_{t - 1} \mid s_{t - d},\ a_{t - d - 1 : t - 2}) \frac{p(s_{t} \mid s_{t - 1},\ a_{t - 1})}{q(s_{t} \mid o_{\le t},\ a_{< t})} \right] \\[7mm] \ge\ &\mathcal{E}_{s_{1}} \mathcal{E}_{s_{2}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \left[ \sum_{t = 1}^{\mathrm{T}} \ln p(o_{t} \mid s_{t}) + \ln \mathcal{E}_{s_{t - 1} \sim p(\cdot \mid s_{t - d},\ a_{t - d - 1 : t - 2})} \frac{p(s_{t} \mid s_{t - 1},\ a_{t - 1})}{q(s_{t} \mid o_{\le t},\ a_{< t})} \right] \\[7mm] =\ &\sum_{t = 1}^{\mathrm{T}} \mathcal{E}_{s_{t} \sim q(\cdot \mid o_{\le t},\ a_{< t})} \ln p(o_{t} \mid s_{t}) + \mathcal{E}_{s_{t - d} \sim q(\cdot \mid o_{\le t - d},\ a_{< t - d})} \mathcal{E}_{s_{t} \sim q(\cdot \mid o_{\le t},\ a_{< t})} \ln \mathcal{E}_{s_{t - 1} \sim p(\cdot \mid s_{t - d},\ a_{t - d - 1 : t - 2})} \frac{p(s_{t} \mid s_{t - 1},\ a_{t - 1})}{q(s_{t} \mid o_{\le t},\ a_{< t})} \\[7mm] \ge\ &\sum_{t = 1}^{\mathrm{T}} \mathcal{E}_{s_{t} \sim q(\cdot \mid o_{\le t},\ a_{< t})} \ln p(o_{t} \mid s_{t}) - \mathcal{E}_{s_{t - d} \sim q(\cdot \mid o_{\le t - d},\ a_{< t - d})} \mathcal{E}_{s_{t - 1} \sim p(\cdot \mid s_{t - d},\ a_{t - d - 1 : t - 2})} \mathcal{E}_{s_{t} \sim q(\cdot \mid o_{\le t},\ a_{< t})} \ln \frac{q(s_{t} \mid o_{\le t},\ a_{< t})}{p(s_{t} \mid s_{t - 1},\ a_{t - 1})} \\[7mm] =\ &\sum_{t = 1}^{\mathrm{T}} \mathcal{E}_{s_{t} \sim q(\cdot \mid o_{\le t},\ a_{< t})} \ln p(o_{t} \mid s_{t}) - \mathcal{E}_{s_{t - d} \sim q(\cdot \mid o_{\le t - d},\ a_{< t - d})} \mathcal{E}_{s_{t - 1} \sim p(\cdot \mid s_{t - d},\ a_{t - d - 1 : t - 2})} D_{\mathrm{KL}} \Big( q(\cdot \mid o_{\le t},\ a_{< t})\ \|\ p(\cdot \mid s_{t - 1},\ a_{t - 1}) \Big) \end{aligned}

The latent overshooting objective train the model on multi-step predictions of all distance $1 \le d \le D$

\frac{1}{D} \sum_{d = 1}^{D} \ln p_{d}(o_{1:\mathrm{T}}) \ge \sum_{t = 1}^{\mathrm{T}} \mathcal{E}_{s_{t}} \ln p(o_{t} \mid s_{t}) - \frac{1}{D} \sum_{d = 1}^{D} \beta_{d} \mathcal{E}_{s_{t - d}} \mathcal{E}_{s_{t - 1}} D_{\mathrm{KL}} \Big( q(\cdot \mid o_{\le t},\ a_{< t})\ \|\ p(\cdot \mid s_{t - 1},\ a_{t - 1}) \Big)

where $\{ \beta_{d} \}_{d = 1}^{D}$ is weighting factor for multi-step predictions analogously to the β-VAE

Learning and Planning

PlaNet fits model by maximizing the ELBO under collected dataset

With the learned model for generative process, the local action sequence can be optimized by MPC CEM with short-term rollout begin with a state randomly sampled from current posterior (belief)

RL > Model-Based

#PlaNet

PlaNet

http://example.com/2024/08/03/PlaNet/

Author

木辛

Posted on

August 3, 2024

Licensed under

Stochastic Game / Markov Game Previous

MBPO Next