MBPO

MBPO

Performance Bound

Complete Rollout

Suppose the expected TVD between two dynamics pp and p^\hat{p} under data-collecting policy πD\pi_{D} is bounded as

maxtEstbDt()EatπD(st)DTV(p(st, at)  p^(st, at))ϵm\max_{t} \mathcal{E}_{s_{t} \sim b_{D}^{t}(\cdot)} \mathcal{E}_{a_{t} \sim \pi_{D}(\cdot \mid s_{t})} D_{\mathrm{TV}} \Big( p(\cdot \mid s_{t},\ a_{t})\ \|\ \hat{p}(\cdot \mid s_{t},\ a_{t}) \Big) \le \epsilon_{m}

and the policy shift between data-collecting policy πD\pi_{D} and new policy π\pi is bounded as

maxsDTV(πD(s)  π(s))ϵπ\max_{s} D_{\mathrm{TV}} \Big( \pi_{D}(\cdot \mid s)\ \|\ \pi(\cdot \mid s) \Big) \le \epsilon_{\pi}

Then the difference between returns can be bounded through Lemma ③

η[π]η^[π]=η[π]η[πD]+η[πD]η^[π]2rmax[11γϵπ+γ(1γ)2ϵπ]2rmax[11γϵπ+γ(1γ)2(ϵm+ϵπ)]\eta[\pi] - \hat{\eta}[\pi] = \underbrace{\eta[\pi] - \eta[\pi_{D}]} + \underbrace{\eta[\pi_{D}] - \hat{\eta}[\pi]} \ge -2 r_{\max} \left[ \frac{1}{1 - \gamma} \epsilon_{\pi} + \frac{\gamma}{(1 - \gamma)^{2}} \epsilon_{\pi} \right] - 2 r_{\max} \left[ \frac{1}{1 - \gamma} \epsilon_{\pi} + \frac{\gamma}{(1 - \gamma)^{2}} (\epsilon_{m} + \epsilon_{\pi}) \right]

Thus

η[π]η^[π]2rmax[γ(1γ)2(ϵm+2ϵπ)21γϵπ]\eta[\pi] \ge \hat{\eta}[\pi] - 2 r_{\max} \left[ \frac{\gamma}{(1 - \gamma)^{2}} (\epsilon_{m} + 2 \epsilon_{\pi}) - \frac{2}{1 - \gamma} \epsilon_{\pi} \right]

Branched Rollout

Under the branched rollouts scheme with a branch length of kk, consider three trajectories generated through

Returns Pre.Dynamics Pre.Policy Post.Dynamics Post.Policy
η[π]\eta[\pi] p(st+1st, at)p(s_{t + 1} \mid s_{t},\ a_{t}) π(atst)\pi(a_{t} \mid s_{t}) p(st+1st, at)p(s_{t + 1} \mid s_{t},\ a_{t}) π(atst)\pi(a_{t} \mid s_{t})
ηbranch[πD, π]\eta^{\mathrm{branch}}[\pi_{D},\ \pi] p(st+1st, at)p(s_{t + 1} \mid s_{t},\ a_{t}) πD(atst)\pi_{D}(a_{t} \mid s_{t}) p^(st+1st, at)\hat{p}(s_{t + 1} \mid s_{t},\ a_{t}) π(atst)\pi(a_{t} \mid s_{t})
η[πD, π]\eta[\pi_{D},\ \pi] p(st+1st, at)p(s_{t + 1} \mid s_{t},\ a_{t}) πD(atst)\pi_{D}(a_{t} \mid s_{t}) p(st+1st, at)p(s_{t + 1} \mid s_{t},\ a_{t}) π(atst)\pi(a_{t} \mid s_{t})

Suppose the expected TVD between two dynamics pp and p^\hat{p} under new policy π\pi is bounded as

maxtEstbt()Eatπ(st)DTV(p(st, at)  p^(st, at))ϵm\max_{t} \mathcal{E}_{s_{t} \sim b^{t}(\cdot)} \mathcal{E}_{a_{t} \sim \pi(\cdot \mid s_{t})} D_{\mathrm{TV}} \Big( p(\cdot \mid s_{t},\ a_{t})\ \|\ \hat{p}(\cdot \mid s_{t},\ a_{t}) \Big) \le \epsilon_{m'}

and the policy shift between data-collecting policy πD\pi_{D} and new policy π\pi is bounded as

maxsDTV(πD(s)  π(s))ϵπ\max_{s} D_{\mathrm{TV}} \Big( \pi_{D}(\cdot \mid s)\ \|\ \pi(\cdot \mid s) \Big) \le \epsilon_{\pi}

The difference between returns derived from the occupancy measures can be bounded through Lemma ④

η[π]ηbranch[πD, π]=η[π]η[πD, π]+η[πD, π]ηbranch[π]2rmaxγk+1(1γ)2ϵπ2rmaxk1γϵm\eta[\pi] - \eta^{\mathrm{branch}}[\pi_{D},\ \pi] = \underbrace{\eta[\pi] - \eta[\pi_{D},\ \pi]} + \underbrace{\eta[\pi_{D},\ \pi] - \eta^{\mathrm{branch}}[\pi]} \ge -2 r_{\max} \frac{\gamma^{k + 1}}{(1 - \gamma)^{2}} \epsilon_{\pi} - 2 r_{\max} \frac{k}{1 - \gamma} \epsilon_{m'}

Thus

η[π]ηbranch[πD, π]2rmax[γk+1(1γ)2ϵπ+k1γϵm]\eta[\pi] \ge \eta^{\mathrm{branch}}[\pi_{D},\ \pi] - 2 r_{\max} \left[ \frac{\gamma^{k + 1}}{(1 - \gamma)^{2}} \epsilon_{\pi} + \frac{k}{1 - \gamma} \epsilon_{m'} \right]

When ϵm\epsilon_{m'} is sufficiently low, the optimal rollout length statisfies k=arg mink[γk+1(1γ)2ϵπ+k1γϵm]0k^{\star} = \argmin_{k} \left[ \dfrac{\gamma^{k + 1}}{(1 - \gamma)^{2}} \epsilon_{\pi} + \dfrac{k}{1 - \gamma} \epsilon_{m'} \right] \ge 0

MBPO with DRL

The theoretical results suggest that a method should make use of truncated but nonzero-length model rollouts

Predictive Model

Use a bootstrap ensemble of dynamics models {pθ1, pθ2, , pθB}\{ p_{\theta}^{1},\ p_{\theta}^{2},\ \cdots,\ p_{\theta}^{B} \}, where

pθi(s, rs, a)=N[μθi(s, a), Σθi(s, a)=diagθi(s, a)]p_{\theta}^{i}(s',\ r \mid s,\ a) = \mathcal{N} \Big[ \mu_{\theta}^{i}(s,\ a),\ \Sigma_{\theta}^{i} (s,\ a) = \mathrm{diag}_{\theta}^{i}(s,\ a) \Big]

Policy Optimization

Use SAC as policy optimization algorithm, which trains an actor πϕ\pi_{\phi} by minimizing the expected KL-divergence

minϕJπ(ϕ; D)=EsDDKL[πexp(QπVπ)]\min_{\phi} J_{\pi}(\phi;\ \mathcal{D}) = \mathcal{E}_{s \sim \mathcal{D}} D_\mathrm{KL} \Big[ \pi \mid \mid \exp(Q^{\pi} - V^{\pi}) \Big]

Model Usage

Branching replaces few long rollouts from the initial state distribution with many short rollouts starting from replay buffer states, which effectively relieves the limitation caused by compounding model errors

Useful Lemma

Lemma ① Joint Distribution TVD Bound

For two joint distribution p1(x, y)p_{1}(x,\ y) and p2(x, y)p_{2}(x,\ y), the total variance distance of them can be bounded as

DTV(p1(, )  p2(, ))=12xyp1(x, y)p2(x, y)=12xyp1(yx)p1(x)p2(yx)p2(x)=12xyp1(yx)p1(x)p2(yx)p1(x)+p2(yx)p1(x)p2(yx)p2(x)12xyp1(x)p1(yx)p2(yx)+12xyp2(yx)p1(x)p2(x)=xp1(x)12yp1(yx)p2(yx)+12xp1(x)p2(x)yp2(yx)=Exp1()DTV(p1(x)  p2(x))+DTV(p1()  p2())maxxDTV(p1(x)  p2(x))+DTV(p1()  p2())\begin{aligned} D_{\mathrm{TV}} \Big( p_{1}(\cdot,\ \cdot)\ \|\ p_{2}(\cdot,\ \cdot) \Big) &= \frac{1}{2} \sum_{x} \sum_{y} \Big| p_{1}(x,\ y) - p_{2}(x,\ y) \Big| = \frac{1}{2} \sum_{x} \sum_{y} \Big| p_{1}(y \mid x) p_{1}(x) - p_{2}(y \mid x) p_{2}(x) \Big| \\[7mm] &= \frac{1}{2} \sum_{x} \sum_{y} \Big| p_{1}(y \mid x) p_{1}(x) - p_{2}(y \mid x) p_{1}(x) + p_{2}(y \mid x) p_{1}(x) - p_{2}(y \mid x) p_{2}(x) \Big| \\[7mm] &\le \frac{1}{2} \sum_{x} \sum_{y} p_{1}(x) \Big| p_{1}(y \mid x) - p_{2}(y \mid x) \Big| + \frac{1}{2} \sum_{x} \sum_{y} p_{2}(y \mid x) \Big| p_{1}(x) - p_{2}(x) \Big| \\[7mm] &= \sum_{x} p_{1}(x) \frac{1}{2} \sum_{y} \Big| p_{1}(y \mid x) - p_{2}(y \mid x) \Big| + \frac{1}{2} \sum_{x} \Big| p_{1}(x) - p_{2}(x) \Big| \sum_{y} p_{2}(y \mid x) \\[7mm] &= \mathcal{E}_{x \sim p_{1}(\cdot)} D_{\mathrm{TV}} \Big( p_{1}(\cdot \mid x)\ \|\ p_{2}(\cdot \mid x) \Big) + D_{\mathrm{TV}} \Big( p_{1}(\cdot)\ \|\ p_{2}(\cdot) \Big) \\[7mm] &\le \max_{x} D_{\mathrm{TV}} \Big( p_{1}(\cdot \mid x)\ \|\ p_{2}(\cdot \mid x) \Big) + D_{\mathrm{TV}} \Big( p_{1}(\cdot)\ \|\ p_{2}(\cdot) \Big) \end{aligned}

Lemma ② Markov Chain TVD Bound

For two Markov Chain p1(st+1st)p_{1}(s_{t + 1} \mid s_{t}) and p2(st+1st)p_{2}(s_{t + 1} \mid s_{t}) with the same initial distribution p10(s0)=p20(s0)p_{1}^{0}(s_{0}) = p_{2}^{0}(s_{0})

p1t(st)p2t(st)=st1p1t1(st1)p1(stst1)st1p2t1(st1)p2(stst1)st1p1t1(st1)p1(stst1)p2(stst1)+st1p2(stst1)p1t1(st1)p2t1(st1)=Est1p1t1()p1(stst1)p2(stst1)+st1p2(stst1)p1t1(st1)p2t1(st1)\begin{aligned} \Big| p_{1}^{t}(s_{t}) - p_{2}^{t}(s_{t}) \Big| &= \Big| \sum_{s_{t - 1}} p_{1}^{t - 1}(s_{t - 1}) p_{1}(s_{t} \mid s_{t - 1}) - \sum_{s_{t - 1}} p_{2}^{t - 1}(s_{t - 1}) p_{2}(s_{t} \mid s_{t - 1}) \Big| \\[7mm] &\le \sum_{s_{t - 1}} p_{1}^{t - 1}(s_{t - 1}) \Big| p_{1}(s_{t} \mid s_{t - 1}) - p_{2}(s_{t} \mid s_{t - 1}) \Big| + \sum_{s_{t - 1}} p_{2}(s_{t} \mid s_{t - 1}) \Big| p_{1}^{t - 1}(s_{t - 1}) - p_{2}^{t - 1}(s_{t - 1}) \Big| \\[7mm] &= \mathcal{E}_{s_{t - 1} \sim p_{1}^{t - 1}(\cdot)} \Big| p_{1}(s_{t} \mid s_{t - 1}) - p_{2}(s_{t} \mid s_{t - 1}) \Big| + \sum_{s_{t - 1}} p_{2}(s_{t} \mid s_{t - 1}) \Big| p_{1}^{t - 1}(s_{t - 1}) - p_{2}^{t - 1}(s_{t - 1}) \Big| \end{aligned}

Then the total variance distance of the state marginal distribution is bounded as

ϵt=DTV(p1t()  p2t())=12stp1t(st)p2t(st)Est1p1t1()12stp1(stst1)p2(stst1)+12st1p1t1(st1)p2t1(st1)stp2(stst1)=Est1p1t1()DTV(p1t(st1)  p2t(st1))+DTV(p1t1()  p2t1())=δt+ϵt1=δt+δt1+ϵt2==τ=1tδτ+ϵ0=τ=1tδτtδ\begin{aligned} \epsilon_{t} &= D_{\mathrm{TV}} \Big( p_{1}^{t}(\cdot)\ \|\ p_{2}^{t}(\cdot) \Big) = \frac{1}{2} \sum_{s_{t}} \Big| p_{1}^{t}(s_{t}) - p_{2}^{t}(s_{t}) \Big| \\[7mm] &\le \mathcal{E}_{s_{t - 1} \sim p_{1}^{t - 1}(\cdot)} \frac{1}{2} \sum_{s_{t}} \Big| p_{1}(s_{t} \mid s_{t - 1}) - p_{2}(s_{t} \mid s_{t - 1}) \Big| + \frac{1}{2} \sum_{s_{t - 1}} \Big| p_{1}^{t - 1}(s_{t - 1}) - p_{2}^{t - 1}(s_{t - 1}) \Big| \sum_{s_{t}} p_{2}(s_{t} \mid s_{t - 1}) \\[7mm] &= \mathcal{E}_{s_{t - 1} \sim p_{1}^{t - 1}(\cdot)} D_{\mathrm{TV}} \Big( p_{1}^{t}(\cdot \mid s_{t - 1})\ \|\ p_{2}^{t}(\cdot \mid s_{t - 1}) \Big) + D_{\mathrm{TV}} \Big( p_{1}^{t - 1}(\cdot)\ \|\ p_{2}^{t - 1}(\cdot) \Big) \\[7mm] &= \delta_{t} + \epsilon_{t - 1} = \delta_{t} + \delta_{t - 1} + \epsilon_{t - 2} = \cdots = \sum_{\tau = 1}^{t} \delta_{\tau} + \epsilon_{0} = \sum_{\tau = 1}^{t} \delta_{\tau} \le t \delta \end{aligned}

where δt\delta_{t} is assumed to be upper bounded by δ\delta

maxtδt=maxtδt+1=maxtEstp1t()DTV(p1t(st)  p2t(st))δ\max_{t} \delta_{t} = \max_{t} \delta_{t + 1} = \max_{t} \mathcal{E}_{s_{t} \sim p_{1}^{t}(\cdot)} D_{\mathrm{TV}} \Big( p_{1}^{t}(\cdot \mid s_{t})\ \|\ p_{2}^{t}(\cdot \mid s_{t}) \Big) \le \delta

Lemma ③ Model Returns Bound

For two dynamics model p1(st+1st, at)p_{1}(s_{t + 1} \mid s_{t},\ a_{t}), p2(st+1st, at)p_{2}(s_{t + 1} \mid s_{t},\ a_{t}) and their corresponding policy π1(atst)\pi_{1}(a_{t} \mid s_{t}), π2(atst)\pi_{2}(a_{t} \mid s_{t})

η1η2=t=0γtstat[b1t(st)π1(atst)b2t(st)π2(atst)]R(st, at)t=0γtstatb1t(st)π1(atst)b2t(st)π2(atst)R(st, at)rmaxt=0γtstatp1t(st, at)p2t(st, at)=2rmaxt=0γtDTV(p1t(, )  p2t(, ))2rmaxt=0γt[maxstDTV(π1(st)  π2(st))+DTV(b1t()  b2t())]\begin{aligned} |\eta_{1} - \eta_{2}| &= \left| \sum_{t = 0}^{\infty} \gamma^{t} \sum_{s_{t}} \sum_{a_{t}} \Big[ b_{1}^{t}(s_{t}) \pi_{1}(a_{t} \mid s_{t}) - b_{2}^{t}(s_{t}) \pi_{2}(a_{t} \mid s_{t}) \Big] \mathcal{R}(s_{t},\ a_{t}) \right| \\[7mm] &\le \sum_{t = 0}^{\infty} \gamma^{t} \sum_{s_{t}} \sum_{a_{t}} \Big| b_{1}^{t}(s_{t}) \pi_{1}(a_{t} \mid s_{t}) - b_{2}^{t}(s_{t}) \pi_{2}(a_{t} \mid s_{t}) \Big| \cdot \Big| \mathcal{R}(s_{t},\ a_{t}) \Big| \\[7mm] &\le r_{\max} \sum_{t = 0}^{\infty} \gamma^{t} \sum_{s_{t}} \sum_{a_{t}} \Big| p_{1}^{t}(s_{t},\ a_{t}) - p_{2}^{t}(s_{t},\ a_{t}) \Big| = 2 r_{\max} \sum_{t = 0}^{\infty} \gamma^{t} D_{\mathrm{TV}} \Big( p_{1}^{t}(\cdot,\ \cdot)\ \|\ p_{2}^{t}(\cdot,\ \cdot) \Big) \\[7mm] &\le 2 r_{\max} \sum_{t = 0}^{\infty} \gamma^{t} \left[ \max_{s_{t}} D_{\mathrm{TV}} \Big( \pi_{1}(\cdot \mid s_{t})\ \|\ \pi_{2}(\cdot \mid s_{t}) \Big) + D_{\mathrm{TV}} \Big( b_{1}^{t}(\cdot)\ \|\ b_{2}^{t}(\cdot) \Big) \right] \end{aligned}

Suppose the first item is bounded as maxstDTV(π1(st)  π2(st))ϵπ\max_{s_{t}} D_{\mathrm{TV}} \Big( \pi_{1}(\cdot \mid s_{t})\ \|\ \pi_{2}(\cdot \mid s_{t}) \Big) \le \epsilon_{\pi}, the second item can be bounded as

DTV(b1t()  b2t())tmaxtEst1b1t1()DTV(p1(st1)  p2(st1))D_{\mathrm{TV}} \Big( b_{1}^{t}(\cdot)\ \|\ b_{2}^{t}(\cdot) \Big) \le t \max_{t} \mathcal{E}_{s_{t - 1} \sim b_{1}^{t - 1}(\cdot)} D_{\mathrm{TV}} \Big( p_{1}(\cdot \mid s_{t - 1})\ \|\ p_{2}(\cdot \mid s_{t - 1}) \Big)

where

DTV(p1(st1)  p2(st1))= 12stp1(stst1)p2(stst1)=12stat1p1(st, at1st1)at1p2(st, at1st1) 12stat1p1(st, at1st1)p2(st, at1st1)=DTV(p1(, st1)  p2(, st1)) Eat1π1(st1)DTV(p1(st1, at1)  p2(st1, at1))+DTV(π1(st1)  π2(st1)) Eat1π1(st1)DTV(p1(st1, at1)  p2(st1, at1))+ϵπ\begin{aligned} &D_{\mathrm{TV}} \Big( p_{1}(\cdot \mid s_{t - 1})\ \|\ p_{2}(\cdot \mid s_{t - 1}) \Big) \\[7mm] =\ &\frac{1}{2} \sum_{s_{t}} \Big| p_{1}(s_{t} \mid s_{t - 1}) - p_{2}(s_{t} \mid s_{t - 1}) \Big| = \frac{1}{2} \sum_{s_{t}} \Big| \sum_{a_{t - 1}} p_{1}(s_{t},\ a_{t - 1} \mid s_{t - 1}) - \sum_{a_{t - 1}} p_{2}(s_{t},\ a_{t - 1} \mid s_{t - 1}) \Big| \\[7mm] \le\ &\frac{1}{2} \sum_{s_{t}} \sum_{a_{t - 1}} \Big| p_{1}(s_{t},\ a_{t - 1} \mid s_{t - 1}) - p_{2}(s_{t},\ a_{t - 1} \mid s_{t - 1}) \Big| = D_{\mathrm{TV}} \Big( p_{1}(\cdot,\ \cdot \mid s_{t - 1})\ \|\ p_{2}(\cdot,\ \cdot \mid s_{t - 1}) \Big) \\[7mm] \le\ &\mathcal{E}_{a_{t - 1} \sim \pi_{1}(\cdot \mid s_{t - 1})} D_{\mathrm{TV}} \Big( p_{1}(\cdot \mid s_{t - 1},\ a_{t - 1})\ \|\ p_{2}(\cdot \mid s_{t - 1},\ a_{t - 1}) \Big) + D_{\mathrm{TV}} \Big( \pi_{1}(\cdot \mid s_{t - 1})\ \|\ \pi_{2}(\cdot \mid s_{t - 1}) \Big) \\[7mm] \le\ &\mathcal{E}_{a_{t - 1} \sim \pi_{1}(\cdot \mid s_{t - 1})} D_{\mathrm{TV}} \Big( p_{1}(\cdot \mid s_{t - 1},\ a_{t - 1})\ \|\ p_{2}(\cdot \mid s_{t - 1},\ a_{t - 1}) \Big) + \epsilon_{\pi} \end{aligned}

Suppose the expected total variance distance between two dynamics model can be bounded as

maxtEstb1t()Eatπ1(st)DTV(p1(st, at)  p2(st, at))ϵm\max_{t} \mathcal{E}_{s_{t} \sim b_{1}^{t}(\cdot)} \mathcal{E}_{a_{t} \sim \pi_{1}(\cdot \mid s_{t})} D_{\mathrm{TV}} \Big( p_{1}(\cdot \mid s_{t},\ a_{t})\ \|\ p_{2}(\cdot \mid s_{t},\ a_{t}) \Big) \le \epsilon_{m}

then

DTV(b1t()  b2t())t(ϵm+ϵπ)D_{\mathrm{TV}} \Big( b_{1}^{t}(\cdot)\ \|\ b_{2}^{t}(\cdot) \Big) \le t (\epsilon_{m} + \epsilon_{\pi})

Thus, plugging this back and get

η1η22rmaxt=0γt[ϵπ+t(ϵm+ϵπ)]=2rmax[11γϵπ+γ(1γ)2(ϵm+ϵπ)]|\eta_{1} - \eta_{2}| \le 2 r_{\max} \sum_{t = 0}^{\infty} \gamma^{t} \Big[ \epsilon_{\pi} + t(\epsilon_{m} + \epsilon_{\pi}) \Big] = 2 r_{\max} \left[ \frac{1}{1 - \gamma} \epsilon_{\pi} + \frac{\gamma}{(1 - \gamma)^{2}} (\epsilon_{m} + \epsilon_{\pi}) \right]

Lemma ④ Branched Rollout Occupancy Measurement TVD Bound

Run a branched kk step rollout started from a branch point tkt - k and generate two trajectories through

Pre.Dynamics Pre.Policy Post.Dynamics Post.Policy
p1pre(st+1st, at)p_{1}^{\mathrm{pre}}(s_{t + 1} \mid s_{t},\ a_{t}) π1pre(atst)\pi_{1}^{\mathrm{pre}}(a_{t} \mid s_{t}) p1post(st+1st, at)p_{1}^{\mathrm{post}}(s_{t + 1} \mid s_{t},\ a_{t}) π1post(atst)\pi_{1}^{\mathrm{post}}(a_{t} \mid s_{t})
p2pre(st+1st, at)p_{2}^{\mathrm{pre}}(s_{t + 1} \mid s_{t},\ a_{t}) π2pre(atst)\pi_{2}^{\mathrm{pre}}(a_{t} \mid s_{t}) p2post(st+1st, at)p_{2}^{\mathrm{post}}(s_{t + 1} \mid s_{t},\ a_{t}) π2post(atst)\pi_{2}^{\mathrm{post}}(a_{t} \mid s_{t})

For tkt \ge k, the total variance distance between state-action marginals at time step tt is bounded as

DTV(p1t(, )  p2t(, ))maxsDTV(π1post(s)  π2post(s))+DTV(b1t()  b2t())D_{\mathrm{TV}} \Big( p_{1}^{t}(\cdot,\ \cdot)\ \|\ p_{2}^{t}(\cdot,\ \cdot) \Big) \le \max_{s} D_{\mathrm{TV}} \Big( \pi_{1}^{\mathrm{post}}(\cdot \mid s)\ \|\ \pi_{2}^{\mathrm{post}}(\cdot \mid s) \Big) + D_{\mathrm{TV}} \Big( b_{1}^{t}(\cdot)\ \|\ b_{2}^{t}(\cdot) \Big)

The first item is supposed to be bounded by ϵπpost\epsilon_{\pi}^{\mathrm{post}}, and the second item can be bounded by

ϵtpost=DTV(b1t()  b2t())ϵmpost+ϵπpost+ϵt1postk(ϵmpost+ϵπpost)+ϵtkpost\epsilon_{t}^{\mathrm{post}} = D_{\mathrm{TV}} \Big( b_{1}^{t}(\cdot)\ \|\ b_{2}^{t}(\cdot) \Big) \le \epsilon_{m}^{\mathrm{post}} + \epsilon_{\pi}^{\mathrm{post}} + \epsilon_{t - 1}^{\mathrm{post}} \le \cdots \le k (\epsilon_{m}^{\mathrm{post}} + \epsilon_{\pi}^{\mathrm{post}}) + \epsilon_{t - k}^{\mathrm{post}}

where ϵmpost\epsilon_{m}^{\mathrm{post}} bounds the expectation of post-branch dynamics distribution as

maxtk<τtEsτ1b1τ1()Eaτ1π1(sτ1)DTV(p1post(sτ1, aτ1)  p2post(sτ1, aτ1))ϵmpost\max_{t - k < \tau \le t} \mathcal{E}_{s_{\tau - 1} \sim b_{1}^{\tau - 1}(\cdot)} \mathcal{E}_{a_{\tau - 1} \sim \pi_{1}(\cdot \mid s_{\tau - 1})} D_{\mathrm{TV}} \Big( p_{1}^{\mathrm{post}}(\cdot \mid s_{\tau - 1},\ a_{\tau - 1})\ \|\ p_{2}^{\mathrm{post}}(\cdot \mid s_{\tau - 1},\ a_{\tau - 1}) \Big) \le \epsilon_{m}^{\mathrm{post}}

The ϵtkpost\epsilon_{t - k}^{\mathrm{post}} can be further bounded through pre-branch error bounds analogously

ϵtkpost=ϵtkpre=DTV(b1tk()  b2tk())ϵmpre+ϵπpre+ϵtk1pre(tk)(ϵmpre+ϵπpre)+ϵ0pre=(tk)(ϵmpre+ϵπpre)\epsilon_{t - k}^{\mathrm{post}} = \epsilon_{t - k}^{\mathrm{pre}} = D_{\mathrm{TV}} \Big( b_{1}^{t - k}(\cdot)\ \|\ b_{2}^{t - k}(\cdot) \Big) \le \epsilon_{m}^{\mathrm{pre}} + \epsilon_{\pi}^{\mathrm{pre}} + \epsilon_{t - k - 1}^{\mathrm{pre}} \le \cdots \le (t - k)(\epsilon_{m}^{\mathrm{pre}} + \epsilon_{\pi}^{\mathrm{pre}}) + \epsilon_{0}^{\mathrm{pre}} = (t - k)(\epsilon_{m}^{\mathrm{pre}} + \epsilon_{\pi}^{\mathrm{pre}})

where ϵmpre\epsilon_{m}^{\mathrm{pre}} and ϵπpre\epsilon_{\pi}^{\mathrm{pre}} is defined similarly as ϵmpost\epsilon_{m}^{\mathrm{post}} and ϵπpre\epsilon_{\pi}^{\mathrm{pre}} respectively. The origin inequaion can be rewritten as

DTV(p1t(, )  p2t(, ))(tk)(ϵmpre+ϵπpre)+k(ϵmpost+ϵπpost)+ϵπpostD_{\mathrm{TV}} \Big( p_{1}^{t}(\cdot,\ \cdot)\ \|\ p_{2}^{t}(\cdot,\ \cdot) \Big) \le (t - k) (\epsilon_{m}^{\mathrm{pre}} + \epsilon_{\pi}^{\mathrm{pre}}) + k (\epsilon_{m}^{\mathrm{post}} + \epsilon_{\pi}^{\mathrm{post}}) + \epsilon_{\pi}^{\mathrm{post}}

For t<kt < k, the trajectories are generated completely by post-branch dynamics and policy, then

DTV(p1t(, )  p2t(, ))t(ϵmpost+ϵπpost)+ϵπpostk(ϵmpost+ϵπpost)+ϵπpostD_{\mathrm{TV}} \Big( p_{1}^{t}(\cdot,\ \cdot)\ \|\ p_{2}^{t}(\cdot,\ \cdot) \Big) \le t (\epsilon_{m}^{\mathrm{post}} + \epsilon_{\pi}^{\mathrm{post}}) + \epsilon_{\pi}^{\mathrm{post}} \le k (\epsilon_{m}^{\mathrm{post}} + \epsilon_{\pi}^{\mathrm{post}}) + \epsilon_{\pi}^{\mathrm{post}}

The difference of occupancy measures derived from the aforementioned state-action marginals is bounded as

DTV(ρ1(, )  ρ2(, ))=12(1γ)t=0γtstat[p1t(st, at)p2t(st, at)] (1γ)[t=0k1γtDTV(p1t(, )  p2t(, ))+t=kγtDTV(p1t(, )  p2t(, ))] (1γ)[t=0k1γt[k(ϵmpost+ϵπpost)+ϵπpost]+t=kγt[(tk)(ϵmpre+ϵπpre)+k(ϵmpost+ϵπpost)+ϵπpost]]= (1γ)[1γk1γk(ϵmpost+ϵπpost)+1γk1γϵπpostt<k+γk+1(1γ)2(ϵmpre+ϵπpre)+γk1γk(ϵmpost+ϵπpost)+γk1γϵπposttk]\begin{aligned} &D_{\mathrm{TV}} \Big( \rho_{1}(\cdot,\ \cdot)\ \|\ \rho_{2}(\cdot,\ \cdot) \Big) = \frac{1}{2} (1 - \gamma) \left| \sum_{t = 0}^{\infty} \gamma^{t} \sum_{s_{t}} \sum_{a_{t}} \Big[ p_{1}^{t}(s_{t},\ a_{t}) - p_{2}^{t}(s_{t},\ a_{t}) \Big] \right| \\[7mm] \le\ &(1 - \gamma) \left[ \sum_{t = 0}^{k - 1} \gamma^{t} D_{\mathrm{TV}} \Big( p_{1}^{t}(\cdot,\ \cdot)\ \|\ p_{2}^{t}(\cdot,\ \cdot) \Big) + \sum_{t = k}^{\infty} \gamma^{t} D_{\mathrm{TV}} \Big( p_{1}^{t}(\cdot,\ \cdot)\ \|\ p_{2}^{t}(\cdot,\ \cdot) \Big) \right] \\[7mm] \le\ &(1 - \gamma) \left[ \sum_{t = 0}^{k - 1} \gamma^{t} \Big[ k (\epsilon_{m}^{\mathrm{post}} + \epsilon_{\pi}^{\mathrm{post}}) + \epsilon_{\pi}^{\mathrm{post}} \Big] + \sum_{t = k}^{\infty} \gamma_{t} \Big[ (t - k) (\epsilon_{m}^{\mathrm{pre}} + \epsilon_{\pi}^{\mathrm{pre}}) + k (\epsilon_{m}^{\mathrm{post}} + \epsilon_{\pi}^{\mathrm{post}}) + \epsilon_{\pi}^{\mathrm{post}} \Big] \right] \\[7mm] =\ &(1 - \gamma) \left[ \underset{t < k}{\underbrace{\frac{1 - \gamma^{k}}{1 - \gamma} k (\epsilon_{m}^{\mathrm{post}} + \epsilon_{\pi}^{\mathrm{post}}) + \frac{1 - \gamma^{k}}{1 - \gamma} \epsilon_{\pi}^{\mathrm{post}}}} + \underset{t \ge k}{\underbrace{\frac{\gamma^{k + 1}}{(1 - \gamma)^{2}} (\epsilon_{m}^{\mathrm{pre}} + \epsilon_{\pi}^{\mathrm{pre}}) + \frac{\gamma^{k}}{1 - \gamma} k (\epsilon_{m}^{\mathrm{post}} + \epsilon_{\pi}^{\mathrm{post}}) + \frac{\gamma^{k}}{1 - \gamma} \epsilon_{\pi}^{\mathrm{post}}}} \right] \end{aligned}

Unite the like terms and get the final result

DTV(ρ1(, )  ρ2(, ))k(ϵmpost+ϵπpost)+ϵπpost+γk+11γ(ϵmpre+ϵπpre)D_{\mathrm{TV}} \Big( \rho_{1}(\cdot,\ \cdot)\ \|\ \rho_{2}(\cdot,\ \cdot) \Big) \le k(\epsilon_{m}^{\mathrm{post}} + \epsilon_{\pi}^{\mathrm{post}}) + \epsilon_{\pi}^{\mathrm{post}} + \frac{\gamma^{k + 1}}{1 - \gamma} (\epsilon_{m}^{\mathrm{pre}} + \epsilon_{\pi}^{\mathrm{pre}})


MBPO
http://example.com/2024/08/03/MBPO/
Author
木辛
Posted on
August 3, 2024
Licensed under