MBPO

Performance Bound

Complete Rollout

Suppose the expected TVD between two dynamics $p$ and $\hat{p}$ under data-collecting policy $\pi_{D}$ is bounded as

\max_{t} \mathcal{E}_{s_{t} \sim b_{D}^{t}(\cdot)} \mathcal{E}_{a_{t} \sim \pi_{D}(\cdot \mid s_{t})} D_{\mathrm{TV}} \Big( p(\cdot \mid s_{t},\ a_{t})\ \|\ \hat{p}(\cdot \mid s_{t},\ a_{t}) \Big) \le \epsilon_{m}

and the policy shift between data-collecting policy $\pi_{D}$ and new policy $\pi$ is bounded as

\max_{s} D_{\mathrm{TV}} \Big( \pi_{D}(\cdot \mid s)\ \|\ \pi(\cdot \mid s) \Big) \le \epsilon_{\pi}

Then the difference between returns can be bounded through Lemma ③

\eta[\pi] - \hat{\eta}[\pi] = \underbrace{\eta[\pi] - \eta[\pi_{D}]} + \underbrace{\eta[\pi_{D}] - \hat{\eta}[\pi]} \ge -2 r_{\max} \left[ \frac{1}{1 - \gamma} \epsilon_{\pi} + \frac{\gamma}{(1 - \gamma)^{2}} \epsilon_{\pi} \right] - 2 r_{\max} \left[ \frac{1}{1 - \gamma} \epsilon_{\pi} + \frac{\gamma}{(1 - \gamma)^{2}} (\epsilon_{m} + \epsilon_{\pi}) \right]

Thus

\eta[\pi] \ge \hat{\eta}[\pi] - 2 r_{\max} \left[ \frac{\gamma}{(1 - \gamma)^{2}} (\epsilon_{m} + 2 \epsilon_{\pi}) - \frac{2}{1 - \gamma} \epsilon_{\pi} \right]

Branched Rollout

Under the branched rollouts scheme with a branch length of $k$ , consider three trajectories generated through

Returns	Pre.Dynamics	Pre.Policy	Post.Dynamics	Post.Policy
$\eta[\pi]$	$p(s_{t + 1} \mid s_{t},\ a_{t})$	$\pi(a_{t} \mid s_{t})$	$p(s_{t + 1} \mid s_{t},\ a_{t})$	$\pi(a_{t} \mid s_{t})$
$\eta^{\mathrm{branch}}[\pi_{D},\ \pi]$	$p(s_{t + 1} \mid s_{t},\ a_{t})$	$\pi_{D}(a_{t} \mid s_{t})$	$\hat{p}(s_{t + 1} \mid s_{t},\ a_{t})$	$\pi(a_{t} \mid s_{t})$
$\eta[\pi_{D},\ \pi]$	$p(s_{t + 1} \mid s_{t},\ a_{t})$	$\pi_{D}(a_{t} \mid s_{t})$	$p(s_{t + 1} \mid s_{t},\ a_{t})$	$\pi(a_{t} \mid s_{t})$

Suppose the expected TVD between two dynamics $p$ and $\hat{p}$ under new policy $\pi$ is bounded as

\max_{t} \mathcal{E}_{s_{t} \sim b^{t}(\cdot)} \mathcal{E}_{a_{t} \sim \pi(\cdot \mid s_{t})} D_{\mathrm{TV}} \Big( p(\cdot \mid s_{t},\ a_{t})\ \|\ \hat{p}(\cdot \mid s_{t},\ a_{t}) \Big) \le \epsilon_{m'}

and the policy shift between data-collecting policy $\pi_{D}$ and new policy $\pi$ is bounded as

\max_{s} D_{\mathrm{TV}} \Big( \pi_{D}(\cdot \mid s)\ \|\ \pi(\cdot \mid s) \Big) \le \epsilon_{\pi}

The difference between returns derived from the occupancy measures can be bounded through Lemma ④

\eta[\pi] - \eta^{\mathrm{branch}}[\pi_{D},\ \pi] = \underbrace{\eta[\pi] - \eta[\pi_{D},\ \pi]} + \underbrace{\eta[\pi_{D},\ \pi] - \eta^{\mathrm{branch}}[\pi]} \ge -2 r_{\max} \frac{\gamma^{k + 1}}{(1 - \gamma)^{2}} \epsilon_{\pi} - 2 r_{\max} \frac{k}{1 - \gamma} \epsilon_{m'}

Thus

\eta[\pi] \ge \eta^{\mathrm{branch}}[\pi_{D},\ \pi] - 2 r_{\max} \left[ \frac{\gamma^{k + 1}}{(1 - \gamma)^{2}} \epsilon_{\pi} + \frac{k}{1 - \gamma} \epsilon_{m'} \right]

When $\epsilon_{m'}$ is sufficiently low, the optimal rollout length statisfies $k^{\star} = \argmin_{k} \left[ \dfrac{\gamma^{k + 1}}{(1 - \gamma)^{2}} \epsilon_{\pi} + \dfrac{k}{1 - \gamma} \epsilon_{m'} \right] \ge 0$

MBPO with DRL

The theoretical results suggest that a method should make use of truncated but nonzero-length model rollouts

Predictive Model

Use a bootstrap ensemble of dynamics models $\{ p_{\theta}^{1},\ p_{\theta}^{2},\ \cdots,\ p_{\theta}^{B} \}$ , where

p_{\theta}^{i}(s',\ r \mid s,\ a) = \mathcal{N} \Big[ \mu_{\theta}^{i}(s,\ a),\ \Sigma_{\theta}^{i} (s,\ a) = \mathrm{diag}_{\theta}^{i}(s,\ a) \Big]

Policy Optimization

Use SAC as policy optimization algorithm, which trains an actor $\pi_{\phi}$ by minimizing the expected KL-divergence

\min_{\phi} J_{\pi}(\phi;\ \mathcal{D}) = \mathcal{E}_{s \sim \mathcal{D}} D_\mathrm{KL} \Big[ \pi \mid \mid \exp(Q^{\pi} - V^{\pi}) \Big]

Model Usage

Branching replaces few long rollouts from the initial state distribution with many short rollouts starting from replay buffer states, which effectively relieves the limitation caused by compounding model errors

Useful Lemma

Lemma ① Joint Distribution TVD Bound

For two joint distribution $p_{1}(x,\ y)$ and $p_{2}(x,\ y)$ , the total variance distance of them can be bounded as

\begin{aligned} D_{\mathrm{TV}} \Big( p_{1}(\cdot,\ \cdot)\ \|\ p_{2}(\cdot,\ \cdot) \Big) &= \frac{1}{2} \sum_{x} \sum_{y} \Big| p_{1}(x,\ y) - p_{2}(x,\ y) \Big| = \frac{1}{2} \sum_{x} \sum_{y} \Big| p_{1}(y \mid x) p_{1}(x) - p_{2}(y \mid x) p_{2}(x) \Big| \\[7mm] &= \frac{1}{2} \sum_{x} \sum_{y} \Big| p_{1}(y \mid x) p_{1}(x) - p_{2}(y \mid x) p_{1}(x) + p_{2}(y \mid x) p_{1}(x) - p_{2}(y \mid x) p_{2}(x) \Big| \\[7mm] &\le \frac{1}{2} \sum_{x} \sum_{y} p_{1}(x) \Big| p_{1}(y \mid x) - p_{2}(y \mid x) \Big| + \frac{1}{2} \sum_{x} \sum_{y} p_{2}(y \mid x) \Big| p_{1}(x) - p_{2}(x) \Big| \\[7mm] &= \sum_{x} p_{1}(x) \frac{1}{2} \sum_{y} \Big| p_{1}(y \mid x) - p_{2}(y \mid x) \Big| + \frac{1}{2} \sum_{x} \Big| p_{1}(x) - p_{2}(x) \Big| \sum_{y} p_{2}(y \mid x) \\[7mm] &= \mathcal{E}_{x \sim p_{1}(\cdot)} D_{\mathrm{TV}} \Big( p_{1}(\cdot \mid x)\ \|\ p_{2}(\cdot \mid x) \Big) + D_{\mathrm{TV}} \Big( p_{1}(\cdot)\ \|\ p_{2}(\cdot) \Big) \\[7mm] &\le \max_{x} D_{\mathrm{TV}} \Big( p_{1}(\cdot \mid x)\ \|\ p_{2}(\cdot \mid x) \Big) + D_{\mathrm{TV}} \Big( p_{1}(\cdot)\ \|\ p_{2}(\cdot) \Big) \end{aligned}

Lemma ② Markov Chain TVD Bound

For two Markov Chain $p_{1}(s_{t + 1} \mid s_{t})$ and $p_{2}(s_{t + 1} \mid s_{t})$ with the same initial distribution $p_{1}^{0}(s_{0}) = p_{2}^{0}(s_{0})$

\begin{aligned} \Big| p_{1}^{t}(s_{t}) - p_{2}^{t}(s_{t}) \Big| &= \Big| \sum_{s_{t - 1}} p_{1}^{t - 1}(s_{t - 1}) p_{1}(s_{t} \mid s_{t - 1}) - \sum_{s_{t - 1}} p_{2}^{t - 1}(s_{t - 1}) p_{2}(s_{t} \mid s_{t - 1}) \Big| \\[7mm] &\le \sum_{s_{t - 1}} p_{1}^{t - 1}(s_{t - 1}) \Big| p_{1}(s_{t} \mid s_{t - 1}) - p_{2}(s_{t} \mid s_{t - 1}) \Big| + \sum_{s_{t - 1}} p_{2}(s_{t} \mid s_{t - 1}) \Big| p_{1}^{t - 1}(s_{t - 1}) - p_{2}^{t - 1}(s_{t - 1}) \Big| \\[7mm] &= \mathcal{E}_{s_{t - 1} \sim p_{1}^{t - 1}(\cdot)} \Big| p_{1}(s_{t} \mid s_{t - 1}) - p_{2}(s_{t} \mid s_{t - 1}) \Big| + \sum_{s_{t - 1}} p_{2}(s_{t} \mid s_{t - 1}) \Big| p_{1}^{t - 1}(s_{t - 1}) - p_{2}^{t - 1}(s_{t - 1}) \Big| \end{aligned}

Then the total variance distance of the state marginal distribution is bounded as

\begin{aligned} \epsilon_{t} &= D_{\mathrm{TV}} \Big( p_{1}^{t}(\cdot)\ \|\ p_{2}^{t}(\cdot) \Big) = \frac{1}{2} \sum_{s_{t}} \Big| p_{1}^{t}(s_{t}) - p_{2}^{t}(s_{t}) \Big| \\[7mm] &\le \mathcal{E}_{s_{t - 1} \sim p_{1}^{t - 1}(\cdot)} \frac{1}{2} \sum_{s_{t}} \Big| p_{1}(s_{t} \mid s_{t - 1}) - p_{2}(s_{t} \mid s_{t - 1}) \Big| + \frac{1}{2} \sum_{s_{t - 1}} \Big| p_{1}^{t - 1}(s_{t - 1}) - p_{2}^{t - 1}(s_{t - 1}) \Big| \sum_{s_{t}} p_{2}(s_{t} \mid s_{t - 1}) \\[7mm] &= \mathcal{E}_{s_{t - 1} \sim p_{1}^{t - 1}(\cdot)} D_{\mathrm{TV}} \Big( p_{1}^{t}(\cdot \mid s_{t - 1})\ \|\ p_{2}^{t}(\cdot \mid s_{t - 1}) \Big) + D_{\mathrm{TV}} \Big( p_{1}^{t - 1}(\cdot)\ \|\ p_{2}^{t - 1}(\cdot) \Big) \\[7mm] &= \delta_{t} + \epsilon_{t - 1} = \delta_{t} + \delta_{t - 1} + \epsilon_{t - 2} = \cdots = \sum_{\tau = 1}^{t} \delta_{\tau} + \epsilon_{0} = \sum_{\tau = 1}^{t} \delta_{\tau} \le t \delta \end{aligned}

where $\delta_{t}$ is assumed to be upper bounded by $\delta$

\max_{t} \delta_{t} = \max_{t} \delta_{t + 1} = \max_{t} \mathcal{E}_{s_{t} \sim p_{1}^{t}(\cdot)} D_{\mathrm{TV}} \Big( p_{1}^{t}(\cdot \mid s_{t})\ \|\ p_{2}^{t}(\cdot \mid s_{t}) \Big) \le \delta

Lemma ③ Model Returns Bound

For two dynamics model $p_{1}(s_{t + 1} \mid s_{t},\ a_{t})$ , $p_{2}(s_{t + 1} \mid s_{t},\ a_{t})$ and their corresponding policy $\pi_{1}(a_{t} \mid s_{t})$ , $\pi_{2}(a_{t} \mid s_{t})$

\begin{aligned} |\eta_{1} - \eta_{2}| &= \left| \sum_{t = 0}^{\infty} \gamma^{t} \sum_{s_{t}} \sum_{a_{t}} \Big[ b_{1}^{t}(s_{t}) \pi_{1}(a_{t} \mid s_{t}) - b_{2}^{t}(s_{t}) \pi_{2}(a_{t} \mid s_{t}) \Big] \mathcal{R}(s_{t},\ a_{t}) \right| \\[7mm] &\le \sum_{t = 0}^{\infty} \gamma^{t} \sum_{s_{t}} \sum_{a_{t}} \Big| b_{1}^{t}(s_{t}) \pi_{1}(a_{t} \mid s_{t}) - b_{2}^{t}(s_{t}) \pi_{2}(a_{t} \mid s_{t}) \Big| \cdot \Big| \mathcal{R}(s_{t},\ a_{t}) \Big| \\[7mm] &\le r_{\max} \sum_{t = 0}^{\infty} \gamma^{t} \sum_{s_{t}} \sum_{a_{t}} \Big| p_{1}^{t}(s_{t},\ a_{t}) - p_{2}^{t}(s_{t},\ a_{t}) \Big| = 2 r_{\max} \sum_{t = 0}^{\infty} \gamma^{t} D_{\mathrm{TV}} \Big( p_{1}^{t}(\cdot,\ \cdot)\ \|\ p_{2}^{t}(\cdot,\ \cdot) \Big) \\[7mm] &\le 2 r_{\max} \sum_{t = 0}^{\infty} \gamma^{t} \left[ \max_{s_{t}} D_{\mathrm{TV}} \Big( \pi_{1}(\cdot \mid s_{t})\ \|\ \pi_{2}(\cdot \mid s_{t}) \Big) + D_{\mathrm{TV}} \Big( b_{1}^{t}(\cdot)\ \|\ b_{2}^{t}(\cdot) \Big) \right] \end{aligned}

Suppose the first item is bounded as $\max_{s_{t}} D_{\mathrm{TV}} \Big( \pi_{1}(\cdot \mid s_{t})\ \|\ \pi_{2}(\cdot \mid s_{t}) \Big) \le \epsilon_{\pi}$ , the second item can be bounded as

D_{\mathrm{TV}} \Big( b_{1}^{t}(\cdot)\ \|\ b_{2}^{t}(\cdot) \Big) \le t \max_{t} \mathcal{E}_{s_{t - 1} \sim b_{1}^{t - 1}(\cdot)} D_{\mathrm{TV}} \Big( p_{1}(\cdot \mid s_{t - 1})\ \|\ p_{2}(\cdot \mid s_{t - 1}) \Big)

where

\begin{aligned} &D_{\mathrm{TV}} \Big( p_{1}(\cdot \mid s_{t - 1})\ \|\ p_{2}(\cdot \mid s_{t - 1}) \Big) \\[7mm] =\ &\frac{1}{2} \sum_{s_{t}} \Big| p_{1}(s_{t} \mid s_{t - 1}) - p_{2}(s_{t} \mid s_{t - 1}) \Big| = \frac{1}{2} \sum_{s_{t}} \Big| \sum_{a_{t - 1}} p_{1}(s_{t},\ a_{t - 1} \mid s_{t - 1}) - \sum_{a_{t - 1}} p_{2}(s_{t},\ a_{t - 1} \mid s_{t - 1}) \Big| \\[7mm] \le\ &\frac{1}{2} \sum_{s_{t}} \sum_{a_{t - 1}} \Big| p_{1}(s_{t},\ a_{t - 1} \mid s_{t - 1}) - p_{2}(s_{t},\ a_{t - 1} \mid s_{t - 1}) \Big| = D_{\mathrm{TV}} \Big( p_{1}(\cdot,\ \cdot \mid s_{t - 1})\ \|\ p_{2}(\cdot,\ \cdot \mid s_{t - 1}) \Big) \\[7mm] \le\ &\mathcal{E}_{a_{t - 1} \sim \pi_{1}(\cdot \mid s_{t - 1})} D_{\mathrm{TV}} \Big( p_{1}(\cdot \mid s_{t - 1},\ a_{t - 1})\ \|\ p_{2}(\cdot \mid s_{t - 1},\ a_{t - 1}) \Big) + D_{\mathrm{TV}} \Big( \pi_{1}(\cdot \mid s_{t - 1})\ \|\ \pi_{2}(\cdot \mid s_{t - 1}) \Big) \\[7mm] \le\ &\mathcal{E}_{a_{t - 1} \sim \pi_{1}(\cdot \mid s_{t - 1})} D_{\mathrm{TV}} \Big( p_{1}(\cdot \mid s_{t - 1},\ a_{t - 1})\ \|\ p_{2}(\cdot \mid s_{t - 1},\ a_{t - 1}) \Big) + \epsilon_{\pi} \end{aligned}

Suppose the expected total variance distance between two dynamics model can be bounded as

\max_{t} \mathcal{E}_{s_{t} \sim b_{1}^{t}(\cdot)} \mathcal{E}_{a_{t} \sim \pi_{1}(\cdot \mid s_{t})} D_{\mathrm{TV}} \Big( p_{1}(\cdot \mid s_{t},\ a_{t})\ \|\ p_{2}(\cdot \mid s_{t},\ a_{t}) \Big) \le \epsilon_{m}

then

D_{\mathrm{TV}} \Big( b_{1}^{t}(\cdot)\ \|\ b_{2}^{t}(\cdot) \Big) \le t (\epsilon_{m} + \epsilon_{\pi})

Thus, plugging this back and get

|\eta_{1} - \eta_{2}| \le 2 r_{\max} \sum_{t = 0}^{\infty} \gamma^{t} \Big[ \epsilon_{\pi} + t(\epsilon_{m} + \epsilon_{\pi}) \Big] = 2 r_{\max} \left[ \frac{1}{1 - \gamma} \epsilon_{\pi} + \frac{\gamma}{(1 - \gamma)^{2}} (\epsilon_{m} + \epsilon_{\pi}) \right]

Lemma ④ Branched Rollout Occupancy Measurement TVD Bound

Run a branched $k$ step rollout started from a branch point $t - k$ and generate two trajectories through

Pre.Dynamics	Pre.Policy	Post.Dynamics	Post.Policy
$p_{1}^{\mathrm{pre}}(s_{t + 1} \mid s_{t},\ a_{t})$	$\pi_{1}^{\mathrm{pre}}(a_{t} \mid s_{t})$	$p_{1}^{\mathrm{post}}(s_{t + 1} \mid s_{t},\ a_{t})$	$\pi_{1}^{\mathrm{post}}(a_{t} \mid s_{t})$
$p_{2}^{\mathrm{pre}}(s_{t + 1} \mid s_{t},\ a_{t})$	$\pi_{2}^{\mathrm{pre}}(a_{t} \mid s_{t})$	$p_{2}^{\mathrm{post}}(s_{t + 1} \mid s_{t},\ a_{t})$	$\pi_{2}^{\mathrm{post}}(a_{t} \mid s_{t})$

For $t \ge k$ , the total variance distance between state-action marginals at time step $t$ is bounded as

D_{\mathrm{TV}} \Big( p_{1}^{t}(\cdot,\ \cdot)\ \|\ p_{2}^{t}(\cdot,\ \cdot) \Big) \le \max_{s} D_{\mathrm{TV}} \Big( \pi_{1}^{\mathrm{post}}(\cdot \mid s)\ \|\ \pi_{2}^{\mathrm{post}}(\cdot \mid s) \Big) + D_{\mathrm{TV}} \Big( b_{1}^{t}(\cdot)\ \|\ b_{2}^{t}(\cdot) \Big)

The first item is supposed to be bounded by $\epsilon_{\pi}^{\mathrm{post}}$ , and the second item can be bounded by

\epsilon_{t}^{\mathrm{post}} = D_{\mathrm{TV}} \Big( b_{1}^{t}(\cdot)\ \|\ b_{2}^{t}(\cdot) \Big) \le \epsilon_{m}^{\mathrm{post}} + \epsilon_{\pi}^{\mathrm{post}} + \epsilon_{t - 1}^{\mathrm{post}} \le \cdots \le k (\epsilon_{m}^{\mathrm{post}} + \epsilon_{\pi}^{\mathrm{post}}) + \epsilon_{t - k}^{\mathrm{post}}

where $\epsilon_{m}^{\mathrm{post}}$ bounds the expectation of post-branch dynamics distribution as

\max_{t - k < \tau \le t} \mathcal{E}_{s_{\tau - 1} \sim b_{1}^{\tau - 1}(\cdot)} \mathcal{E}_{a_{\tau - 1} \sim \pi_{1}(\cdot \mid s_{\tau - 1})} D_{\mathrm{TV}} \Big( p_{1}^{\mathrm{post}}(\cdot \mid s_{\tau - 1},\ a_{\tau - 1})\ \|\ p_{2}^{\mathrm{post}}(\cdot \mid s_{\tau - 1},\ a_{\tau - 1}) \Big) \le \epsilon_{m}^{\mathrm{post}}

The $\epsilon_{t - k}^{\mathrm{post}}$ can be further bounded through pre-branch error bounds analogously

\epsilon_{t - k}^{\mathrm{post}} = \epsilon_{t - k}^{\mathrm{pre}} = D_{\mathrm{TV}} \Big( b_{1}^{t - k}(\cdot)\ \|\ b_{2}^{t - k}(\cdot) \Big) \le \epsilon_{m}^{\mathrm{pre}} + \epsilon_{\pi}^{\mathrm{pre}} + \epsilon_{t - k - 1}^{\mathrm{pre}} \le \cdots \le (t - k)(\epsilon_{m}^{\mathrm{pre}} + \epsilon_{\pi}^{\mathrm{pre}}) + \epsilon_{0}^{\mathrm{pre}} = (t - k)(\epsilon_{m}^{\mathrm{pre}} + \epsilon_{\pi}^{\mathrm{pre}})

where $\epsilon_{m}^{\mathrm{pre}}$ and $\epsilon_{\pi}^{\mathrm{pre}}$ is defined similarly as $\epsilon_{m}^{\mathrm{post}}$ and $\epsilon_{\pi}^{\mathrm{pre}}$ respectively. The origin inequaion can be rewritten as

D_{\mathrm{TV}} \Big( p_{1}^{t}(\cdot,\ \cdot)\ \|\ p_{2}^{t}(\cdot,\ \cdot) \Big) \le (t - k) (\epsilon_{m}^{\mathrm{pre}} + \epsilon_{\pi}^{\mathrm{pre}}) + k (\epsilon_{m}^{\mathrm{post}} + \epsilon_{\pi}^{\mathrm{post}}) + \epsilon_{\pi}^{\mathrm{post}}

For $t < k$ , the trajectories are generated completely by post-branch dynamics and policy, then

D_{\mathrm{TV}} \Big( p_{1}^{t}(\cdot,\ \cdot)\ \|\ p_{2}^{t}(\cdot,\ \cdot) \Big) \le t (\epsilon_{m}^{\mathrm{post}} + \epsilon_{\pi}^{\mathrm{post}}) + \epsilon_{\pi}^{\mathrm{post}} \le k (\epsilon_{m}^{\mathrm{post}} + \epsilon_{\pi}^{\mathrm{post}}) + \epsilon_{\pi}^{\mathrm{post}}

The difference of occupancy measures derived from the aforementioned state-action marginals is bounded as

\begin{aligned} &D_{\mathrm{TV}} \Big( \rho_{1}(\cdot,\ \cdot)\ \|\ \rho_{2}(\cdot,\ \cdot) \Big) = \frac{1}{2} (1 - \gamma) \left| \sum_{t = 0}^{\infty} \gamma^{t} \sum_{s_{t}} \sum_{a_{t}} \Big[ p_{1}^{t}(s_{t},\ a_{t}) - p_{2}^{t}(s_{t},\ a_{t}) \Big] \right| \\[7mm] \le\ &(1 - \gamma) \left[ \sum_{t = 0}^{k - 1} \gamma^{t} D_{\mathrm{TV}} \Big( p_{1}^{t}(\cdot,\ \cdot)\ \|\ p_{2}^{t}(\cdot,\ \cdot) \Big) + \sum_{t = k}^{\infty} \gamma^{t} D_{\mathrm{TV}} \Big( p_{1}^{t}(\cdot,\ \cdot)\ \|\ p_{2}^{t}(\cdot,\ \cdot) \Big) \right] \\[7mm] \le\ &(1 - \gamma) \left[ \sum_{t = 0}^{k - 1} \gamma^{t} \Big[ k (\epsilon_{m}^{\mathrm{post}} + \epsilon_{\pi}^{\mathrm{post}}) + \epsilon_{\pi}^{\mathrm{post}} \Big] + \sum_{t = k}^{\infty} \gamma_{t} \Big[ (t - k) (\epsilon_{m}^{\mathrm{pre}} + \epsilon_{\pi}^{\mathrm{pre}}) + k (\epsilon_{m}^{\mathrm{post}} + \epsilon_{\pi}^{\mathrm{post}}) + \epsilon_{\pi}^{\mathrm{post}} \Big] \right] \\[7mm] =\ &(1 - \gamma) \left[ \underset{t < k}{\underbrace{\frac{1 - \gamma^{k}}{1 - \gamma} k (\epsilon_{m}^{\mathrm{post}} + \epsilon_{\pi}^{\mathrm{post}}) + \frac{1 - \gamma^{k}}{1 - \gamma} \epsilon_{\pi}^{\mathrm{post}}}} + \underset{t \ge k}{\underbrace{\frac{\gamma^{k + 1}}{(1 - \gamma)^{2}} (\epsilon_{m}^{\mathrm{pre}} + \epsilon_{\pi}^{\mathrm{pre}}) + \frac{\gamma^{k}}{1 - \gamma} k (\epsilon_{m}^{\mathrm{post}} + \epsilon_{\pi}^{\mathrm{post}}) + \frac{\gamma^{k}}{1 - \gamma} \epsilon_{\pi}^{\mathrm{post}}}} \right] \end{aligned}

Unite the like terms and get the final result

D_{\mathrm{TV}} \Big( \rho_{1}(\cdot,\ \cdot)\ \|\ \rho_{2}(\cdot,\ \cdot) \Big) \le k(\epsilon_{m}^{\mathrm{post}} + \epsilon_{\pi}^{\mathrm{post}}) + \epsilon_{\pi}^{\mathrm{post}} + \frac{\gamma^{k + 1}}{1 - \gamma} (\epsilon_{m}^{\mathrm{pre}} + \epsilon_{\pi}^{\mathrm{pre}})

RL > Model-Based

#MBPO

MBPO

http://example.com/2024/08/03/MBPO/

Author

木辛

Posted on

August 3, 2024

Licensed under

PlaNet Previous

MuZero Next