TD-MPC

TD-MPC Series

TD-MPC v1

Model Predictive Path Integral（MPPI）

TD-MPC adapts MPPI as inference algorithm, where action trajectory is sampled from a time-dependent multivariate Gaussian with diagonal covariance over a horizon of length $H$

\Big\{ \mathcal{N}(\mu_{\tau},\ \sigma_{\tau}^{2}) \Big\}_{\tau = 0}^{H} \Leftarrow \mu_{\tau},\ \sigma_{\tau} \in \mathbb{R}^{|\mathcal{A}|}

Sample $N$ trajectories indepedently using rollouts generated by the learned environment model, and estimate the total return of trajectories with learned value function

\begin{aligned} \phi(a_{t : t + H}) &= \mathcal{E}_{z_{t + 1}} \mathcal{E}_{z_{t + 2}} \cdots \mathcal{E}_{z_{t + H}} \left[ \gamma^{H} Q_{\theta}(z_{H},\ a_{H}) + \sum_{\tau = 0}^{H - 1} \gamma^{\tau} R_{\theta}(z_{t + \tau},\ a_{t + \tau}) \right] \\[7mm] &= \gamma^{H} Q_{\theta}(z_{H},\ a_{H}) + \sum_{\tau = 0}^{H - 1} \gamma^{\tau} R_{\theta}(z_{t + \tau},\ a_{t + \tau}) \Leftarrow z_{t} = h_{\theta}(s_{t});\ z_{t + 1} = d_{\theta}(z_{t},\ a_{t}) \end{aligned}

The parameters of sampling distribution is updated via action trajectories with top- $k$ returns

\mu_{\tau} \leftarrow \frac{\sum_{i = 1}^{k} \Omega_{i} a_{\tau}^{(i)}}{\sum_{i = 1}^{k} \Omega_{i}} \qquad \sigma_{\tau} \leftarrow \sqrt{\frac{\sum_{i = 1}^{k} \Omega_{i} (a_{\tau}^{(i)} - \mu_{\tau})^{2}}{\sum_{i = 1}^{k} \Omega_{i}}}

where trajectories are weighted by corresponding returns as $\Omega_{i} = \exp (\kappa \phi_{i}) = \exp \left( \kappa \phi(a_{t : t + H}^{(i)}) \right)$ , $\kappa$ is a temperature parameter controlling the “sharpness” of the weighting.

After a fixed number of iterations J, the action of current decision step $t$ is sampled from $\mathcal{N}(\mu_{0},\ \sigma_{0}^{2})$ to be taken.

Parameter Initialization

To reduce the number of iterations required for convergence, TD-MPC reuses the 1-step shifted mean value $\mu_{\tau}$ obtained at the previous step, but always use a large initial variance to avoid local minima.

Exploration by Planning

To promote consistent exploration, TD-MPC constrains the std. deviation by updating as

\sigma_{\tau} \leftarrow \max \left( \sqrt{\frac{\sum_{i = 1}^{k} \Omega_{i} (a_{\tau}^{(i)} - \mu_{\tau})^{2}}{\sum_{i = 1}^{k} \Omega_{i}}},\ \epsilon \right)

where $\epsilon \in \mathbb{R}^{+}$ is a linearly decayed constant. Likewise, the planning horizon is increased linearly from 1 to H in the early stages of training as the model is initially inaccurate.

Policy-guided Trajectory Optimization

TD-MPC augments the sampling procedure with additional $N_{\pi}$ samples from learned policy $\pi_{\theta}$ .

Task-Oriented Latent Dynamics（TOLD）

TD-MPC leverages the following components of TOLD model during inferene：

Components	Definition
representation	$\hat{z}_{t} = h_{\theta}(s_{t})$
latent dynamics	$\hat{z}_{t}' = d_{\theta}(z_{t},\ a_{t})$
reward	$\hat{r}_{t} = R_{\theta}(z_{t},\ a_{t})$
value	$\hat{q}_{t} = Q_{\theta}(z_{t},\ a_{t})$
policy	$\hat{a}_{t} = \pi_{\theta}(z_{t})$

TOLD model is trained to minimize a temporally weighted objective

\min_{\theta} \mathcal{J}(\theta;\ \Gamma) = \sum_{\tau = t}^{t + H} \lambda^{\tau - t} \mathcal{L}(\theta;\ \Gamma_{\tau})

where $\Gamma = \Big\{ (s_{\tau},\ a_{\tau},\ r_{\tau},\ s_{\tau + 1}) \Big\}_{\tau = t}^{t + H} \sim \mathcal{B}$ is a trajectory sampled from replay buffer $\mathcal{B}$ , which consists of interaction data collected by TD-MPC during planning. A single-step loss is made up of

\mathcal{L}(\theta;\ \Gamma_{\tau}) = c_{r} \mathcal{L}_{r}(\theta;\ \Gamma_{\tau}) + c_{v} \mathcal{L}_{v}(\theta;\ \Gamma_{\tau}) + c_{\pi} \mathcal{L}_{\pi}(\theta;\ \Gamma_{\tau}) + c_{c} \mathcal{L}_{c}(\theta;\ \Gamma_{\tau})

Error Type	Definition
reward prediction error	$\mathcal{L}_{r}(\theta;\ \Gamma_{\tau}) = \Big[ R_{\theta}(z_{\tau},\ a_{\tau}) - r_{\tau} \Big]^{2}$
TD error of value function	$\mathcal{L}_{v}(\theta;\ \Gamma_{\tau}) = \Big[ Q_{\theta}(z_{\tau},\ a_{\tau}) - r_{\tau} - \gamma q_{\theta^{-}} \Big( z_{\tau + 1},\ \operatorname{sg} \big[ \pi_{\theta}(z_{t + 1}) \big] \Big) \Big]^{2}$
critic target for actor	$\mathcal{L}_{\pi}(\theta;\ \Gamma_{\tau}) = -q_{\operatorname{sg}[\theta]}(z_{\tau},\ \pi_{\theta}(z_{\tau}))$
latent state consistency loss	$\mathcal{L}_{c}(\theta;\ \Gamma_{\tau}) = \Big\\| d_{\theta}(z_{\tau},\ a_{\tau}) - h_{\theta^{-}}(s_{\tau + 1}) \Big\\|_{2}^{2}$

where $\theta^{-}$ is parameter of target net to improve the stability during training.

TD-MPC v2

TD-MPC v2 uses a learnable task embedding $e$ (constrained by $\| e \|_{2} \le 1$ ) to represent compact task semantics. For a new task, $e$ can be initialized as the embedding of a semantically similar task for subsequent fine-tune

Components	Definition
encoder	$\hat{z}_{t} = h_{\theta}(s_{t},\ e)$
latent dynamics	$\hat{z}_{t}' = d_{\theta}(z_{t},\ a_{t},\ e)$
reward (discrete)	$\hat{r}_{t} = R_{\theta}(z_{t},\ a_{t},\ e)$
terminal value (discrete)	$\hat{q}_{t} = Q_{\theta}(z_{t},\ a_{t},\ e)$
policy prior	$\hat{a}_{t} = \pi_{\theta}(z_{t},\ e)$

The latent representation is normalized by SimNorm (project $z$ into $L$ fixed-dimensional simplices via softmax)

z^{\circ} = [g_{1},\ g_{2},\ \cdots,\ g_{L}] \quad g_{i} = \operatorname{softmax}_{\tau} (z_{i:i + V})

which can naturally bias the representation towards sparsity without enforcing hard constraints.

The $h,\ d,\ R,\ Q$ components are jointly optimized to minimize the model objective under a replay buffer $\mathcal{B}$

\mathcal{L}(\theta) = \mathcal{E}_{(s,\ a,\ r,\ s')_{0:\mathrm{T}} \sim \mathcal{B}} \left[ \sum_{t = 0}^{\mathrm{T}} \lambda^{t} \left( \Big\| \hat{z}_{t}' - \operatorname{sg}(h_{\theta}(s_{t}')) \Big\|^{2} + \operatorname{CE}(\hat{r}_{t},\ r_{t}) + \operatorname{CE} \Big( \hat{q}_{t},\ r_{t} + \gamma Q_{\theta^{-}}(\hat{z}_{t}',\ \pi_{\theta}(\hat{z}_{t}')) \Big) \right) \right]

The policy prior learns to maximize the maximum entropy objective, whose gradient are taken w.r.t. policy parameters only

\mathcal{L}_{p}(\theta) = \mathcal{E}_{(s,\ a)_{0:\mathrm{T}} \sim \mathcal{B}} \left[ \sum_{t = 0}^{\mathrm{T}} \lambda^{t} \Big( \alpha Q_{\theta}(z_{t},\ \pi_{\theta}(z_{t})) + \beta \mathcal{H}(\pi_{\theta} \mid z_{t}) \Big) \right] \quad \mathrm{s.t.}\ z_{t + 1} = d_{\theta}(z_{t},\ a_{t}),\ z_{0} = h_{\theta}(s_{0})

Similar to TD-MPC v1, TD-MPC v2 leverages MPPI for local trajectory optimization with terminal value

\mu^{\star},\ \sigma^{\star} = \argmax_{\mu,\ \sigma} \mathcal{E}_{a_{t:t + \mathrm{H}} \sim \mathcal{N}(\mu,\ \sigma)} \left[ \sum_{h = t}^{\mathrm{H} - 1} \gamma^{h} R_{\theta}(z_{h},\ a_{h}) + \gamma^{\mathrm{H}} Q_{\theta}(z_{t + \mathrm{H}},\ a_{t + \mathrm{H}}) \right]

To accelerate convergence of planning, a fraction of action sequences originate from the policy prior is used for data augment, and 1-step shifted parameter initialization is used to warm-start planning

RL > Model-Based

#MPC #TD-MPC

TD-MPC

http://example.com/2024/07/31/TD-MPC/

Author

木辛

Posted on

July 31, 2024

Licensed under

Dreamer Previous

Soft Actor-Critic Next