MBPO
Complete Rollout
Suppose the expected TVD between two dynamics p p p and p ^ \hat{p} p ^ under data-collecting policy π D \pi_{D} π D is bounded as
max t E s t ∼ b D t ( ⋅ ) E a t ∼ π D ( ⋅ ∣ s t ) D T V ( p ( ⋅ ∣ s t , a t ) ∥ p ^ ( ⋅ ∣ s t , a t ) ) ≤ ϵ m \max_{t} \mathcal{E}_{s_{t} \sim b_{D}^{t}(\cdot)} \mathcal{E}_{a_{t} \sim \pi_{D}(\cdot \mid s_{t})} D_{\mathrm{TV}} \Big( p(\cdot \mid s_{t},\ a_{t})\ \|\ \hat{p}(\cdot \mid s_{t},\ a_{t}) \Big) \le \epsilon_{m}
t max E s t ∼ b D t ( ⋅ ) E a t ∼ π D ( ⋅ ∣ s t ) D T V ( p ( ⋅ ∣ s t , a t ) ∥ p ^ ( ⋅ ∣ s t , a t ) ) ≤ ϵ m
and the policy shift between data-collecting policy π D \pi_{D} π D and new policy π \pi π is bounded as
max s D T V ( π D ( ⋅ ∣ s ) ∥ π ( ⋅ ∣ s ) ) ≤ ϵ π \max_{s} D_{\mathrm{TV}} \Big( \pi_{D}(\cdot \mid s)\ \|\ \pi(\cdot \mid s) \Big) \le \epsilon_{\pi}
s max D T V ( π D ( ⋅ ∣ s ) ∥ π ( ⋅ ∣ s ) ) ≤ ϵ π
Then the difference between returns can be bounded through Lemma ③
η [ π ] − η ^ [ π ] = η [ π ] − η [ π D ] ⏟ + η [ π D ] − η ^ [ π ] ⏟ ≥ − 2 r max [ 1 1 − γ ϵ π + γ ( 1 − γ ) 2 ϵ π ] − 2 r max [ 1 1 − γ ϵ π + γ ( 1 − γ ) 2 ( ϵ m + ϵ π ) ] \eta[\pi] - \hat{\eta}[\pi] = \underbrace{\eta[\pi] - \eta[\pi_{D}]} + \underbrace{\eta[\pi_{D}] - \hat{\eta}[\pi]} \ge -2 r_{\max} \left[ \frac{1}{1 - \gamma} \epsilon_{\pi} + \frac{\gamma}{(1 - \gamma)^{2}} \epsilon_{\pi} \right] - 2 r_{\max} \left[ \frac{1}{1 - \gamma} \epsilon_{\pi} + \frac{\gamma}{(1 - \gamma)^{2}} (\epsilon_{m} + \epsilon_{\pi}) \right]
η [ π ] − η ^ [ π ] = η [ π ] − η [ π D ] + η [ π D ] − η ^ [ π ] ≥ − 2 r m a x [ 1 − γ 1 ϵ π + ( 1 − γ ) 2 γ ϵ π ] − 2 r m a x [ 1 − γ 1 ϵ π + ( 1 − γ ) 2 γ ( ϵ m + ϵ π ) ]
Thus
η [ π ] ≥ η ^ [ π ] − 2 r max [ γ ( 1 − γ ) 2 ( ϵ m + 2 ϵ π ) − 2 1 − γ ϵ π ] \eta[\pi] \ge \hat{\eta}[\pi] - 2 r_{\max} \left[ \frac{\gamma}{(1 - \gamma)^{2}} (\epsilon_{m} + 2 \epsilon_{\pi}) - \frac{2}{1 - \gamma} \epsilon_{\pi} \right]
η [ π ] ≥ η ^ [ π ] − 2 r m a x [ ( 1 − γ ) 2 γ ( ϵ m + 2 ϵ π ) − 1 − γ 2 ϵ π ]
Branched Rollout
Under the branched rollouts scheme with a branch length of k k k , consider three trajectories generated through
Returns
Pre.Dynamics
Pre.Policy
Post.Dynamics
Post.Policy
η [ π ] \eta[\pi] η [ π ]
p ( s t + 1 ∣ s t , a t ) p(s_{t + 1} \mid s_{t},\ a_{t}) p ( s t + 1 ∣ s t , a t )
π ( a t ∣ s t ) \pi(a_{t} \mid s_{t}) π ( a t ∣ s t )
p ( s t + 1 ∣ s t , a t ) p(s_{t + 1} \mid s_{t},\ a_{t}) p ( s t + 1 ∣ s t , a t )
π ( a t ∣ s t ) \pi(a_{t} \mid s_{t}) π ( a t ∣ s t )
η b r a n c h [ π D , π ] \eta^{\mathrm{branch}}[\pi_{D},\ \pi] η b r a n c h [ π D , π ]
p ( s t + 1 ∣ s t , a t ) p(s_{t + 1} \mid s_{t},\ a_{t}) p ( s t + 1 ∣ s t , a t )
π D ( a t ∣ s t ) \pi_{D}(a_{t} \mid s_{t}) π D ( a t ∣ s t )
p ^ ( s t + 1 ∣ s t , a t ) \hat{p}(s_{t + 1} \mid s_{t},\ a_{t}) p ^ ( s t + 1 ∣ s t , a t )
π ( a t ∣ s t ) \pi(a_{t} \mid s_{t}) π ( a t ∣ s t )
η [ π D , π ] \eta[\pi_{D},\ \pi] η [ π D , π ]
p ( s t + 1 ∣ s t , a t ) p(s_{t + 1} \mid s_{t},\ a_{t}) p ( s t + 1 ∣ s t , a t )
π D ( a t ∣ s t ) \pi_{D}(a_{t} \mid s_{t}) π D ( a t ∣ s t )
p ( s t + 1 ∣ s t , a t ) p(s_{t + 1} \mid s_{t},\ a_{t}) p ( s t + 1 ∣ s t , a t )
π ( a t ∣ s t ) \pi(a_{t} \mid s_{t}) π ( a t ∣ s t )
Suppose the expected TVD between two dynamics p p p and p ^ \hat{p} p ^ under new policy π \pi π is bounded as
max t E s t ∼ b t ( ⋅ ) E a t ∼ π ( ⋅ ∣ s t ) D T V ( p ( ⋅ ∣ s t , a t ) ∥ p ^ ( ⋅ ∣ s t , a t ) ) ≤ ϵ m ′ \max_{t} \mathcal{E}_{s_{t} \sim b^{t}(\cdot)} \mathcal{E}_{a_{t} \sim \pi(\cdot \mid s_{t})} D_{\mathrm{TV}} \Big( p(\cdot \mid s_{t},\ a_{t})\ \|\ \hat{p}(\cdot \mid s_{t},\ a_{t}) \Big) \le \epsilon_{m'}
t max E s t ∼ b t ( ⋅ ) E a t ∼ π ( ⋅ ∣ s t ) D T V ( p ( ⋅ ∣ s t , a t ) ∥ p ^ ( ⋅ ∣ s t , a t ) ) ≤ ϵ m ′
and the policy shift between data-collecting policy π D \pi_{D} π D and new policy π \pi π is bounded as
max s D T V ( π D ( ⋅ ∣ s ) ∥ π ( ⋅ ∣ s ) ) ≤ ϵ π \max_{s} D_{\mathrm{TV}} \Big( \pi_{D}(\cdot \mid s)\ \|\ \pi(\cdot \mid s) \Big) \le \epsilon_{\pi}
s max D T V ( π D ( ⋅ ∣ s ) ∥ π ( ⋅ ∣ s ) ) ≤ ϵ π
The difference between returns derived from the occupancy measures can be bounded through Lemma ④
η [ π ] − η b r a n c h [ π D , π ] = η [ π ] − η [ π D , π ] ⏟ + η [ π D , π ] − η b r a n c h [ π ] ⏟ ≥ − 2 r max γ k + 1 ( 1 − γ ) 2 ϵ π − 2 r max k 1 − γ ϵ m ′ \eta[\pi] - \eta^{\mathrm{branch}}[\pi_{D},\ \pi] = \underbrace{\eta[\pi] - \eta[\pi_{D},\ \pi]} + \underbrace{\eta[\pi_{D},\ \pi] - \eta^{\mathrm{branch}}[\pi]} \ge -2 r_{\max} \frac{\gamma^{k + 1}}{(1 - \gamma)^{2}} \epsilon_{\pi} - 2 r_{\max} \frac{k}{1 - \gamma} \epsilon_{m'}
η [ π ] − η b r a n c h [ π D , π ] = η [ π ] − η [ π D , π ] + η [ π D , π ] − η b r a n c h [ π ] ≥ − 2 r m a x ( 1 − γ ) 2 γ k + 1 ϵ π − 2 r m a x 1 − γ k ϵ m ′
Thus
η [ π ] ≥ η b r a n c h [ π D , π ] − 2 r max [ γ k + 1 ( 1 − γ ) 2 ϵ π + k 1 − γ ϵ m ′ ] \eta[\pi] \ge \eta^{\mathrm{branch}}[\pi_{D},\ \pi] - 2 r_{\max} \left[ \frac{\gamma^{k + 1}}{(1 - \gamma)^{2}} \epsilon_{\pi} + \frac{k}{1 - \gamma} \epsilon_{m'} \right]
η [ π ] ≥ η b r a n c h [ π D , π ] − 2 r m a x [ ( 1 − γ ) 2 γ k + 1 ϵ π + 1 − γ k ϵ m ′ ]
When ϵ m ′ \epsilon_{m'} ϵ m ′ is sufficiently low, the optimal rollout length statisfies k ⋆ = arg min k [ γ k + 1 ( 1 − γ ) 2 ϵ π + k 1 − γ ϵ m ′ ] ≥ 0 k^{\star} = \argmin_{k} \left[ \dfrac{\gamma^{k + 1}}{(1 - \gamma)^{2}} \epsilon_{\pi} + \dfrac{k}{1 - \gamma} \epsilon_{m'} \right] \ge 0 k ⋆ = a r g m i n k [ ( 1 − γ ) 2 γ k + 1 ϵ π + 1 − γ k ϵ m ′ ] ≥ 0
MBPO with DRL
The theoretical results suggest that a method should make use of truncated but nonzero-length model rollouts
Predictive Model
Use a bootstrap ensemble of dynamics models { p θ 1 , p θ 2 , ⋯ , p θ B } \{ p_{\theta}^{1},\ p_{\theta}^{2},\ \cdots,\ p_{\theta}^{B} \} { p θ 1 , p θ 2 , ⋯ , p θ B } , where
p θ i ( s ′ , r ∣ s , a ) = N [ μ θ i ( s , a ) , Σ θ i ( s , a ) = d i a g θ i ( s , a ) ] p_{\theta}^{i}(s',\ r \mid s,\ a) = \mathcal{N} \Big[ \mu_{\theta}^{i}(s,\ a),\ \Sigma_{\theta}^{i} (s,\ a) = \mathrm{diag}_{\theta}^{i}(s,\ a) \Big]
p θ i ( s ′ , r ∣ s , a ) = N [ μ θ i ( s , a ) , Σ θ i ( s , a ) = d i a g θ i ( s , a ) ]
Policy Optimization
Use SAC as policy optimization algorithm, which trains an actor π ϕ \pi_{\phi} π ϕ by minimizing the expected KL-divergence
min ϕ J π ( ϕ ; D ) = E s ∼ D D K L [ π ∣ ∣ exp ( Q π − V π ) ] \min_{\phi} J_{\pi}(\phi;\ \mathcal{D}) = \mathcal{E}_{s \sim \mathcal{D}} D_\mathrm{KL} \Big[ \pi \mid \mid \exp(Q^{\pi} - V^{\pi}) \Big]
ϕ min J π ( ϕ ; D ) = E s ∼ D D K L [ π ∣ ∣ exp ( Q π − V π ) ]
Model Usage
Branching replaces few long rollouts from the initial state distribution with many short rollouts starting from replay buffer states, which effectively relieves the limitation caused by compounding model errors
Useful Lemma
Lemma ① Joint Distribution TVD Bound
For two joint distribution p 1 ( x , y ) p_{1}(x,\ y) p 1 ( x , y ) and p 2 ( x , y ) p_{2}(x,\ y) p 2 ( x , y ) , the total variance distance of them can be bounded as
D T V ( p 1 ( ⋅ , ⋅ ) ∥ p 2 ( ⋅ , ⋅ ) ) = 1 2 ∑ x ∑ y ∣ p 1 ( x , y ) − p 2 ( x , y ) ∣ = 1 2 ∑ x ∑ y ∣ p 1 ( y ∣ x ) p 1 ( x ) − p 2 ( y ∣ x ) p 2 ( x ) ∣ = 1 2 ∑ x ∑ y ∣ p 1 ( y ∣ x ) p 1 ( x ) − p 2 ( y ∣ x ) p 1 ( x ) + p 2 ( y ∣ x ) p 1 ( x ) − p 2 ( y ∣ x ) p 2 ( x ) ∣ ≤ 1 2 ∑ x ∑ y p 1 ( x ) ∣ p 1 ( y ∣ x ) − p 2 ( y ∣ x ) ∣ + 1 2 ∑ x ∑ y p 2 ( y ∣ x ) ∣ p 1 ( x ) − p 2 ( x ) ∣ = ∑ x p 1 ( x ) 1 2 ∑ y ∣ p 1 ( y ∣ x ) − p 2 ( y ∣ x ) ∣ + 1 2 ∑ x ∣ p 1 ( x ) − p 2 ( x ) ∣ ∑ y p 2 ( y ∣ x ) = E x ∼ p 1 ( ⋅ ) D T V ( p 1 ( ⋅ ∣ x ) ∥ p 2 ( ⋅ ∣ x ) ) + D T V ( p 1 ( ⋅ ) ∥ p 2 ( ⋅ ) ) ≤ max x D T V ( p 1 ( ⋅ ∣ x ) ∥ p 2 ( ⋅ ∣ x ) ) + D T V ( p 1 ( ⋅ ) ∥ p 2 ( ⋅ ) ) \begin{aligned}
D_{\mathrm{TV}} \Big( p_{1}(\cdot,\ \cdot)\ \|\ p_{2}(\cdot,\ \cdot) \Big) &= \frac{1}{2} \sum_{x} \sum_{y} \Big| p_{1}(x,\ y) - p_{2}(x,\ y) \Big| = \frac{1}{2} \sum_{x} \sum_{y} \Big| p_{1}(y \mid x) p_{1}(x) - p_{2}(y \mid x) p_{2}(x) \Big| \\[7mm]
&= \frac{1}{2} \sum_{x} \sum_{y} \Big| p_{1}(y \mid x) p_{1}(x) - p_{2}(y \mid x) p_{1}(x) + p_{2}(y \mid x) p_{1}(x) - p_{2}(y \mid x) p_{2}(x) \Big| \\[7mm]
&\le \frac{1}{2} \sum_{x} \sum_{y} p_{1}(x) \Big| p_{1}(y \mid x) - p_{2}(y \mid x) \Big| + \frac{1}{2} \sum_{x} \sum_{y} p_{2}(y \mid x) \Big| p_{1}(x) - p_{2}(x) \Big| \\[7mm]
&= \sum_{x} p_{1}(x) \frac{1}{2} \sum_{y} \Big| p_{1}(y \mid x) - p_{2}(y \mid x) \Big| + \frac{1}{2} \sum_{x} \Big| p_{1}(x) - p_{2}(x) \Big| \sum_{y} p_{2}(y \mid x) \\[7mm]
&= \mathcal{E}_{x \sim p_{1}(\cdot)} D_{\mathrm{TV}} \Big( p_{1}(\cdot \mid x)\ \|\ p_{2}(\cdot \mid x) \Big) + D_{\mathrm{TV}} \Big( p_{1}(\cdot)\ \|\ p_{2}(\cdot) \Big) \\[7mm]
&\le \max_{x} D_{\mathrm{TV}} \Big( p_{1}(\cdot \mid x)\ \|\ p_{2}(\cdot \mid x) \Big) + D_{\mathrm{TV}} \Big( p_{1}(\cdot)\ \|\ p_{2}(\cdot) \Big)
\end{aligned}
D T V ( p 1 ( ⋅ , ⋅ ) ∥ p 2 ( ⋅ , ⋅ ) ) = 2 1 x ∑ y ∑ ∣ ∣ ∣ ∣ p 1 ( x , y ) − p 2 ( x , y ) ∣ ∣ ∣ ∣ = 2 1 x ∑ y ∑ ∣ ∣ ∣ ∣ p 1 ( y ∣ x ) p 1 ( x ) − p 2 ( y ∣ x ) p 2 ( x ) ∣ ∣ ∣ ∣ = 2 1 x ∑ y ∑ ∣ ∣ ∣ ∣ p 1 ( y ∣ x ) p 1 ( x ) − p 2 ( y ∣ x ) p 1 ( x ) + p 2 ( y ∣ x ) p 1 ( x ) − p 2 ( y ∣ x ) p 2 ( x ) ∣ ∣ ∣ ∣ ≤ 2 1 x ∑ y ∑ p 1 ( x ) ∣ ∣ ∣ ∣ p 1 ( y ∣ x ) − p 2 ( y ∣ x ) ∣ ∣ ∣ ∣ + 2 1 x ∑ y ∑ p 2 ( y ∣ x ) ∣ ∣ ∣ ∣ p 1 ( x ) − p 2 ( x ) ∣ ∣ ∣ ∣ = x ∑ p 1 ( x ) 2 1 y ∑ ∣ ∣ ∣ ∣ p 1 ( y ∣ x ) − p 2 ( y ∣ x ) ∣ ∣ ∣ ∣ + 2 1 x ∑ ∣ ∣ ∣ ∣ p 1 ( x ) − p 2 ( x ) ∣ ∣ ∣ ∣ y ∑ p 2 ( y ∣ x ) = E x ∼ p 1 ( ⋅ ) D T V ( p 1 ( ⋅ ∣ x ) ∥ p 2 ( ⋅ ∣ x ) ) + D T V ( p 1 ( ⋅ ) ∥ p 2 ( ⋅ ) ) ≤ x max D T V ( p 1 ( ⋅ ∣ x ) ∥ p 2 ( ⋅ ∣ x ) ) + D T V ( p 1 ( ⋅ ) ∥ p 2 ( ⋅ ) )
Lemma ② Markov Chain TVD Bound
For two Markov Chain p 1 ( s t + 1 ∣ s t ) p_{1}(s_{t + 1} \mid s_{t}) p 1 ( s t + 1 ∣ s t ) and p 2 ( s t + 1 ∣ s t ) p_{2}(s_{t + 1} \mid s_{t}) p 2 ( s t + 1 ∣ s t ) with the same initial distribution p 1 0 ( s 0 ) = p 2 0 ( s 0 ) p_{1}^{0}(s_{0}) = p_{2}^{0}(s_{0}) p 1 0 ( s 0 ) = p 2 0 ( s 0 )
∣ p 1 t ( s t ) − p 2 t ( s t ) ∣ = ∣ ∑ s t − 1 p 1 t − 1 ( s t − 1 ) p 1 ( s t ∣ s t − 1 ) − ∑ s t − 1 p 2 t − 1 ( s t − 1 ) p 2 ( s t ∣ s t − 1 ) ∣ ≤ ∑ s t − 1 p 1 t − 1 ( s t − 1 ) ∣ p 1 ( s t ∣ s t − 1 ) − p 2 ( s t ∣ s t − 1 ) ∣ + ∑ s t − 1 p 2 ( s t ∣ s t − 1 ) ∣ p 1 t − 1 ( s t − 1 ) − p 2 t − 1 ( s t − 1 ) ∣ = E s t − 1 ∼ p 1 t − 1 ( ⋅ ) ∣ p 1 ( s t ∣ s t − 1 ) − p 2 ( s t ∣ s t − 1 ) ∣ + ∑ s t − 1 p 2 ( s t ∣ s t − 1 ) ∣ p 1 t − 1 ( s t − 1 ) − p 2 t − 1 ( s t − 1 ) ∣ \begin{aligned}
\Big| p_{1}^{t}(s_{t}) - p_{2}^{t}(s_{t}) \Big| &= \Big| \sum_{s_{t - 1}} p_{1}^{t - 1}(s_{t - 1}) p_{1}(s_{t} \mid s_{t - 1}) - \sum_{s_{t - 1}} p_{2}^{t - 1}(s_{t - 1}) p_{2}(s_{t} \mid s_{t - 1}) \Big| \\[7mm]
&\le \sum_{s_{t - 1}} p_{1}^{t - 1}(s_{t - 1}) \Big| p_{1}(s_{t} \mid s_{t - 1}) - p_{2}(s_{t} \mid s_{t - 1}) \Big| + \sum_{s_{t - 1}} p_{2}(s_{t} \mid s_{t - 1}) \Big| p_{1}^{t - 1}(s_{t - 1}) - p_{2}^{t - 1}(s_{t - 1}) \Big| \\[7mm]
&= \mathcal{E}_{s_{t - 1} \sim p_{1}^{t - 1}(\cdot)} \Big| p_{1}(s_{t} \mid s_{t - 1}) - p_{2}(s_{t} \mid s_{t - 1}) \Big| + \sum_{s_{t - 1}} p_{2}(s_{t} \mid s_{t - 1}) \Big| p_{1}^{t - 1}(s_{t - 1}) - p_{2}^{t - 1}(s_{t - 1}) \Big|
\end{aligned}
∣ ∣ ∣ ∣ p 1 t ( s t ) − p 2 t ( s t ) ∣ ∣ ∣ ∣ = ∣ ∣ ∣ ∣ s t − 1 ∑ p 1 t − 1 ( s t − 1 ) p 1 ( s t ∣ s t − 1 ) − s t − 1 ∑ p 2 t − 1 ( s t − 1 ) p 2 ( s t ∣ s t − 1 ) ∣ ∣ ∣ ∣ ≤ s t − 1 ∑ p 1 t − 1 ( s t − 1 ) ∣ ∣ ∣ ∣ p 1 ( s t ∣ s t − 1 ) − p 2 ( s t ∣ s t − 1 ) ∣ ∣ ∣ ∣ + s t − 1 ∑ p 2 ( s t ∣ s t − 1 ) ∣ ∣ ∣ ∣ p 1 t − 1 ( s t − 1 ) − p 2 t − 1 ( s t − 1 ) ∣ ∣ ∣ ∣ = E s t − 1 ∼ p 1 t − 1 ( ⋅ ) ∣ ∣ ∣ ∣ p 1 ( s t ∣ s t − 1 ) − p 2 ( s t ∣ s t − 1 ) ∣ ∣ ∣ ∣ + s t − 1 ∑ p 2 ( s t ∣ s t − 1 ) ∣ ∣ ∣ ∣ p 1 t − 1 ( s t − 1 ) − p 2 t − 1 ( s t − 1 ) ∣ ∣ ∣ ∣
Then the total variance distance of the state marginal distribution is bounded as
ϵ t = D T V ( p 1 t ( ⋅ ) ∥ p 2 t ( ⋅ ) ) = 1 2 ∑ s t ∣ p 1 t ( s t ) − p 2 t ( s t ) ∣ ≤ E s t − 1 ∼ p 1 t − 1 ( ⋅ ) 1 2 ∑ s t ∣ p 1 ( s t ∣ s t − 1 ) − p 2 ( s t ∣ s t − 1 ) ∣ + 1 2 ∑ s t − 1 ∣ p 1 t − 1 ( s t − 1 ) − p 2 t − 1 ( s t − 1 ) ∣ ∑ s t p 2 ( s t ∣ s t − 1 ) = E s t − 1 ∼ p 1 t − 1 ( ⋅ ) D T V ( p 1 t ( ⋅ ∣ s t − 1 ) ∥ p 2 t ( ⋅ ∣ s t − 1 ) ) + D T V ( p 1 t − 1 ( ⋅ ) ∥ p 2 t − 1 ( ⋅ ) ) = δ t + ϵ t − 1 = δ t + δ t − 1 + ϵ t − 2 = ⋯ = ∑ τ = 1 t δ τ + ϵ 0 = ∑ τ = 1 t δ τ ≤ t δ \begin{aligned}
\epsilon_{t} &= D_{\mathrm{TV}} \Big( p_{1}^{t}(\cdot)\ \|\ p_{2}^{t}(\cdot) \Big) = \frac{1}{2} \sum_{s_{t}} \Big| p_{1}^{t}(s_{t}) - p_{2}^{t}(s_{t}) \Big| \\[7mm]
&\le \mathcal{E}_{s_{t - 1} \sim p_{1}^{t - 1}(\cdot)} \frac{1}{2} \sum_{s_{t}} \Big| p_{1}(s_{t} \mid s_{t - 1}) - p_{2}(s_{t} \mid s_{t - 1}) \Big| + \frac{1}{2} \sum_{s_{t - 1}} \Big| p_{1}^{t - 1}(s_{t - 1}) - p_{2}^{t - 1}(s_{t - 1}) \Big| \sum_{s_{t}} p_{2}(s_{t} \mid s_{t - 1}) \\[7mm]
&= \mathcal{E}_{s_{t - 1} \sim p_{1}^{t - 1}(\cdot)} D_{\mathrm{TV}} \Big( p_{1}^{t}(\cdot \mid s_{t - 1})\ \|\ p_{2}^{t}(\cdot \mid s_{t - 1}) \Big) + D_{\mathrm{TV}} \Big( p_{1}^{t - 1}(\cdot)\ \|\ p_{2}^{t - 1}(\cdot) \Big) \\[7mm]
&= \delta_{t} + \epsilon_{t - 1} = \delta_{t} + \delta_{t - 1} + \epsilon_{t - 2} = \cdots = \sum_{\tau = 1}^{t} \delta_{\tau} + \epsilon_{0} = \sum_{\tau = 1}^{t} \delta_{\tau} \le t \delta
\end{aligned}
ϵ t = D T V ( p 1 t ( ⋅ ) ∥ p 2 t ( ⋅ ) ) = 2 1 s t ∑ ∣ ∣ ∣ ∣ p 1 t ( s t ) − p 2 t ( s t ) ∣ ∣ ∣ ∣ ≤ E s t − 1 ∼ p 1 t − 1 ( ⋅ ) 2 1 s t ∑ ∣ ∣ ∣ ∣ p 1 ( s t ∣ s t − 1 ) − p 2 ( s t ∣ s t − 1 ) ∣ ∣ ∣ ∣ + 2 1 s t − 1 ∑ ∣ ∣ ∣ ∣ p 1 t − 1 ( s t − 1 ) − p 2 t − 1 ( s t − 1 ) ∣ ∣ ∣ ∣ s t ∑ p 2 ( s t ∣ s t − 1 ) = E s t − 1 ∼ p 1 t − 1 ( ⋅ ) D T V ( p 1 t ( ⋅ ∣ s t − 1 ) ∥ p 2 t ( ⋅ ∣ s t − 1 ) ) + D T V ( p 1 t − 1 ( ⋅ ) ∥ p 2 t − 1 ( ⋅ ) ) = δ t + ϵ t − 1 = δ t + δ t − 1 + ϵ t − 2 = ⋯ = τ = 1 ∑ t δ τ + ϵ 0 = τ = 1 ∑ t δ τ ≤ t δ
where δ t \delta_{t} δ t is assumed to be upper bounded by δ \delta δ
max t δ t = max t δ t + 1 = max t E s t ∼ p 1 t ( ⋅ ) D T V ( p 1 t ( ⋅ ∣ s t ) ∥ p 2 t ( ⋅ ∣ s t ) ) ≤ δ \max_{t} \delta_{t} = \max_{t} \delta_{t + 1} = \max_{t} \mathcal{E}_{s_{t} \sim p_{1}^{t}(\cdot)} D_{\mathrm{TV}} \Big( p_{1}^{t}(\cdot \mid s_{t})\ \|\ p_{2}^{t}(\cdot \mid s_{t}) \Big) \le \delta
t max δ t = t max δ t + 1 = t max E s t ∼ p 1 t ( ⋅ ) D T V ( p 1 t ( ⋅ ∣ s t ) ∥ p 2 t ( ⋅ ∣ s t ) ) ≤ δ
Lemma ③ Model Returns Bound
For two dynamics model p 1 ( s t + 1 ∣ s t , a t ) p_{1}(s_{t + 1} \mid s_{t},\ a_{t}) p 1 ( s t + 1 ∣ s t , a t ) , p 2 ( s t + 1 ∣ s t , a t ) p_{2}(s_{t + 1} \mid s_{t},\ a_{t}) p 2 ( s t + 1 ∣ s t , a t ) and their corresponding policy π 1 ( a t ∣ s t ) \pi_{1}(a_{t} \mid s_{t}) π 1 ( a t ∣ s t ) , π 2 ( a t ∣ s t ) \pi_{2}(a_{t} \mid s_{t}) π 2 ( a t ∣ s t )
∣ η 1 − η 2 ∣ = ∣ ∑ t = 0 ∞ γ t ∑ s t ∑ a t [ b 1 t ( s t ) π 1 ( a t ∣ s t ) − b 2 t ( s t ) π 2 ( a t ∣ s t ) ] R ( s t , a t ) ∣ ≤ ∑ t = 0 ∞ γ t ∑ s t ∑ a t ∣ b 1 t ( s t ) π 1 ( a t ∣ s t ) − b 2 t ( s t ) π 2 ( a t ∣ s t ) ∣ ⋅ ∣ R ( s t , a t ) ∣ ≤ r max ∑ t = 0 ∞ γ t ∑ s t ∑ a t ∣ p 1 t ( s t , a t ) − p 2 t ( s t , a t ) ∣ = 2 r max ∑ t = 0 ∞ γ t D T V ( p 1 t ( ⋅ , ⋅ ) ∥ p 2 t ( ⋅ , ⋅ ) ) ≤ 2 r max ∑ t = 0 ∞ γ t [ max s t D T V ( π 1 ( ⋅ ∣ s t ) ∥ π 2 ( ⋅ ∣ s t ) ) + D T V ( b 1 t ( ⋅ ) ∥ b 2 t ( ⋅ ) ) ] \begin{aligned}
|\eta_{1} - \eta_{2}| &= \left| \sum_{t = 0}^{\infty} \gamma^{t} \sum_{s_{t}} \sum_{a_{t}} \Big[ b_{1}^{t}(s_{t}) \pi_{1}(a_{t} \mid s_{t}) - b_{2}^{t}(s_{t}) \pi_{2}(a_{t} \mid s_{t}) \Big] \mathcal{R}(s_{t},\ a_{t}) \right| \\[7mm]
&\le \sum_{t = 0}^{\infty} \gamma^{t} \sum_{s_{t}} \sum_{a_{t}} \Big| b_{1}^{t}(s_{t}) \pi_{1}(a_{t} \mid s_{t}) - b_{2}^{t}(s_{t}) \pi_{2}(a_{t} \mid s_{t}) \Big| \cdot \Big| \mathcal{R}(s_{t},\ a_{t}) \Big| \\[7mm]
&\le r_{\max} \sum_{t = 0}^{\infty} \gamma^{t} \sum_{s_{t}} \sum_{a_{t}} \Big| p_{1}^{t}(s_{t},\ a_{t}) - p_{2}^{t}(s_{t},\ a_{t}) \Big| = 2 r_{\max} \sum_{t = 0}^{\infty} \gamma^{t} D_{\mathrm{TV}} \Big( p_{1}^{t}(\cdot,\ \cdot)\ \|\ p_{2}^{t}(\cdot,\ \cdot) \Big) \\[7mm]
&\le 2 r_{\max} \sum_{t = 0}^{\infty} \gamma^{t} \left[ \max_{s_{t}} D_{\mathrm{TV}} \Big( \pi_{1}(\cdot \mid s_{t})\ \|\ \pi_{2}(\cdot \mid s_{t}) \Big) + D_{\mathrm{TV}} \Big( b_{1}^{t}(\cdot)\ \|\ b_{2}^{t}(\cdot) \Big) \right]
\end{aligned}
∣ η 1 − η 2 ∣ = ∣ ∣ ∣ ∣ ∣ ∣ t = 0 ∑ ∞ γ t s t ∑ a t ∑ [ b 1 t ( s t ) π 1 ( a t ∣ s t ) − b 2 t ( s t ) π 2 ( a t ∣ s t ) ] R ( s t , a t ) ∣ ∣ ∣ ∣ ∣ ∣ ≤ t = 0 ∑ ∞ γ t s t ∑ a t ∑ ∣ ∣ ∣ ∣ b 1 t ( s t ) π 1 ( a t ∣ s t ) − b 2 t ( s t ) π 2 ( a t ∣ s t ) ∣ ∣ ∣ ∣ ⋅ ∣ ∣ ∣ ∣ R ( s t , a t ) ∣ ∣ ∣ ∣ ≤ r m a x t = 0 ∑ ∞ γ t s t ∑ a t ∑ ∣ ∣ ∣ ∣ p 1 t ( s t , a t ) − p 2 t ( s t , a t ) ∣ ∣ ∣ ∣ = 2 r m a x t = 0 ∑ ∞ γ t D T V ( p 1 t ( ⋅ , ⋅ ) ∥ p 2 t ( ⋅ , ⋅ ) ) ≤ 2 r m a x t = 0 ∑ ∞ γ t [ s t max D T V ( π 1 ( ⋅ ∣ s t ) ∥ π 2 ( ⋅ ∣ s t ) ) + D T V ( b 1 t ( ⋅ ) ∥ b 2 t ( ⋅ ) ) ]
Suppose the first item is bounded as max s t D T V ( π 1 ( ⋅ ∣ s t ) ∥ π 2 ( ⋅ ∣ s t ) ) ≤ ϵ π \max_{s_{t}} D_{\mathrm{TV}} \Big( \pi_{1}(\cdot \mid s_{t})\ \|\ \pi_{2}(\cdot \mid s_{t}) \Big) \le \epsilon_{\pi} max s t D T V ( π 1 ( ⋅ ∣ s t ) ∥ π 2 ( ⋅ ∣ s t ) ) ≤ ϵ π , the second item can be bounded as
D T V ( b 1 t ( ⋅ ) ∥ b 2 t ( ⋅ ) ) ≤ t max t E s t − 1 ∼ b 1 t − 1 ( ⋅ ) D T V ( p 1 ( ⋅ ∣ s t − 1 ) ∥ p 2 ( ⋅ ∣ s t − 1 ) ) D_{\mathrm{TV}} \Big( b_{1}^{t}(\cdot)\ \|\ b_{2}^{t}(\cdot) \Big) \le t \max_{t} \mathcal{E}_{s_{t - 1} \sim b_{1}^{t - 1}(\cdot)} D_{\mathrm{TV}} \Big( p_{1}(\cdot \mid s_{t - 1})\ \|\ p_{2}(\cdot \mid s_{t - 1}) \Big)
D T V ( b 1 t ( ⋅ ) ∥ b 2 t ( ⋅ ) ) ≤ t t max E s t − 1 ∼ b 1 t − 1 ( ⋅ ) D T V ( p 1 ( ⋅ ∣ s t − 1 ) ∥ p 2 ( ⋅ ∣ s t − 1 ) )
where
D T V ( p 1 ( ⋅ ∣ s t − 1 ) ∥ p 2 ( ⋅ ∣ s t − 1 ) ) = 1 2 ∑ s t ∣ p 1 ( s t ∣ s t − 1 ) − p 2 ( s t ∣ s t − 1 ) ∣ = 1 2 ∑ s t ∣ ∑ a t − 1 p 1 ( s t , a t − 1 ∣ s t − 1 ) − ∑ a t − 1 p 2 ( s t , a t − 1 ∣ s t − 1 ) ∣ ≤ 1 2 ∑ s t ∑ a t − 1 ∣ p 1 ( s t , a t − 1 ∣ s t − 1 ) − p 2 ( s t , a t − 1 ∣ s t − 1 ) ∣ = D T V ( p 1 ( ⋅ , ⋅ ∣ s t − 1 ) ∥ p 2 ( ⋅ , ⋅ ∣ s t − 1 ) ) ≤ E a t − 1 ∼ π 1 ( ⋅ ∣ s t − 1 ) D T V ( p 1 ( ⋅ ∣ s t − 1 , a t − 1 ) ∥ p 2 ( ⋅ ∣ s t − 1 , a t − 1 ) ) + D T V ( π 1 ( ⋅ ∣ s t − 1 ) ∥ π 2 ( ⋅ ∣ s t − 1 ) ) ≤ E a t − 1 ∼ π 1 ( ⋅ ∣ s t − 1 ) D T V ( p 1 ( ⋅ ∣ s t − 1 , a t − 1 ) ∥ p 2 ( ⋅ ∣ s t − 1 , a t − 1 ) ) + ϵ π \begin{aligned}
&D_{\mathrm{TV}} \Big( p_{1}(\cdot \mid s_{t - 1})\ \|\ p_{2}(\cdot \mid s_{t - 1}) \Big) \\[7mm]
=\ &\frac{1}{2} \sum_{s_{t}} \Big| p_{1}(s_{t} \mid s_{t - 1}) - p_{2}(s_{t} \mid s_{t - 1}) \Big| = \frac{1}{2} \sum_{s_{t}} \Big| \sum_{a_{t - 1}} p_{1}(s_{t},\ a_{t - 1} \mid s_{t - 1}) - \sum_{a_{t - 1}} p_{2}(s_{t},\ a_{t - 1} \mid s_{t - 1}) \Big| \\[7mm]
\le\ &\frac{1}{2} \sum_{s_{t}} \sum_{a_{t - 1}} \Big| p_{1}(s_{t},\ a_{t - 1} \mid s_{t - 1}) - p_{2}(s_{t},\ a_{t - 1} \mid s_{t - 1}) \Big| = D_{\mathrm{TV}} \Big( p_{1}(\cdot,\ \cdot \mid s_{t - 1})\ \|\ p_{2}(\cdot,\ \cdot \mid s_{t - 1}) \Big) \\[7mm]
\le\ &\mathcal{E}_{a_{t - 1} \sim \pi_{1}(\cdot \mid s_{t - 1})} D_{\mathrm{TV}} \Big( p_{1}(\cdot \mid s_{t - 1},\ a_{t - 1})\ \|\ p_{2}(\cdot \mid s_{t - 1},\ a_{t - 1}) \Big) + D_{\mathrm{TV}} \Big( \pi_{1}(\cdot \mid s_{t - 1})\ \|\ \pi_{2}(\cdot \mid s_{t - 1}) \Big) \\[7mm]
\le\ &\mathcal{E}_{a_{t - 1} \sim \pi_{1}(\cdot \mid s_{t - 1})} D_{\mathrm{TV}} \Big( p_{1}(\cdot \mid s_{t - 1},\ a_{t - 1})\ \|\ p_{2}(\cdot \mid s_{t - 1},\ a_{t - 1}) \Big) + \epsilon_{\pi}
\end{aligned}
= ≤ ≤ ≤ D T V ( p 1 ( ⋅ ∣ s t − 1 ) ∥ p 2 ( ⋅ ∣ s t − 1 ) ) 2 1 s t ∑ ∣ ∣ ∣ ∣ p 1 ( s t ∣ s t − 1 ) − p 2 ( s t ∣ s t − 1 ) ∣ ∣ ∣ ∣ = 2 1 s t ∑ ∣ ∣ ∣ ∣ a t − 1 ∑ p 1 ( s t , a t − 1 ∣ s t − 1 ) − a t − 1 ∑ p 2 ( s t , a t − 1 ∣ s t − 1 ) ∣ ∣ ∣ ∣ 2 1 s t ∑ a t − 1 ∑ ∣ ∣ ∣ ∣ p 1 ( s t , a t − 1 ∣ s t − 1 ) − p 2 ( s t , a t − 1 ∣ s t − 1 ) ∣ ∣ ∣ ∣ = D T V ( p 1 ( ⋅ , ⋅ ∣ s t − 1 ) ∥ p 2 ( ⋅ , ⋅ ∣ s t − 1 ) ) E a t − 1 ∼ π 1 ( ⋅ ∣ s t − 1 ) D T V ( p 1 ( ⋅ ∣ s t − 1 , a t − 1 ) ∥ p 2 ( ⋅ ∣ s t − 1 , a t − 1 ) ) + D T V ( π 1 ( ⋅ ∣ s t − 1 ) ∥ π 2 ( ⋅ ∣ s t − 1 ) ) E a t − 1 ∼ π 1 ( ⋅ ∣ s t − 1 ) D T V ( p 1 ( ⋅ ∣ s t − 1 , a t − 1 ) ∥ p 2 ( ⋅ ∣ s t − 1 , a t − 1 ) ) + ϵ π
Suppose the expected total variance distance between two dynamics model can be bounded as
max t E s t ∼ b 1 t ( ⋅ ) E a t ∼ π 1 ( ⋅ ∣ s t ) D T V ( p 1 ( ⋅ ∣ s t , a t ) ∥ p 2 ( ⋅ ∣ s t , a t ) ) ≤ ϵ m \max_{t} \mathcal{E}_{s_{t} \sim b_{1}^{t}(\cdot)} \mathcal{E}_{a_{t} \sim \pi_{1}(\cdot \mid s_{t})} D_{\mathrm{TV}} \Big( p_{1}(\cdot \mid s_{t},\ a_{t})\ \|\ p_{2}(\cdot \mid s_{t},\ a_{t}) \Big) \le \epsilon_{m}
t max E s t ∼ b 1 t ( ⋅ ) E a t ∼ π 1 ( ⋅ ∣ s t ) D T V ( p 1 ( ⋅ ∣ s t , a t ) ∥ p 2 ( ⋅ ∣ s t , a t ) ) ≤ ϵ m
then
D T V ( b 1 t ( ⋅ ) ∥ b 2 t ( ⋅ ) ) ≤ t ( ϵ m + ϵ π ) D_{\mathrm{TV}} \Big( b_{1}^{t}(\cdot)\ \|\ b_{2}^{t}(\cdot) \Big) \le t (\epsilon_{m} + \epsilon_{\pi})
D T V ( b 1 t ( ⋅ ) ∥ b 2 t ( ⋅ ) ) ≤ t ( ϵ m + ϵ π )
Thus, plugging this back and get
∣ η 1 − η 2 ∣ ≤ 2 r max ∑ t = 0 ∞ γ t [ ϵ π + t ( ϵ m + ϵ π ) ] = 2 r max [ 1 1 − γ ϵ π + γ ( 1 − γ ) 2 ( ϵ m + ϵ π ) ] |\eta_{1} - \eta_{2}| \le 2 r_{\max} \sum_{t = 0}^{\infty} \gamma^{t} \Big[ \epsilon_{\pi} + t(\epsilon_{m} + \epsilon_{\pi}) \Big] = 2 r_{\max} \left[ \frac{1}{1 - \gamma} \epsilon_{\pi} + \frac{\gamma}{(1 - \gamma)^{2}} (\epsilon_{m} + \epsilon_{\pi}) \right]
∣ η 1 − η 2 ∣ ≤ 2 r m a x t = 0 ∑ ∞ γ t [ ϵ π + t ( ϵ m + ϵ π ) ] = 2 r m a x [ 1 − γ 1 ϵ π + ( 1 − γ ) 2 γ ( ϵ m + ϵ π ) ]
Lemma ④ Branched Rollout Occupancy Measurement TVD Bound
Run a branched k k k step rollout started from a branch point t − k t - k t − k and generate two trajectories through
Pre.Dynamics
Pre.Policy
Post.Dynamics
Post.Policy
p 1 p r e ( s t + 1 ∣ s t , a t ) p_{1}^{\mathrm{pre}}(s_{t + 1} \mid s_{t},\ a_{t}) p 1 p r e ( s t + 1 ∣ s t , a t )
π 1 p r e ( a t ∣ s t ) \pi_{1}^{\mathrm{pre}}(a_{t} \mid s_{t}) π 1 p r e ( a t ∣ s t )
p 1 p o s t ( s t + 1 ∣ s t , a t ) p_{1}^{\mathrm{post}}(s_{t + 1} \mid s_{t},\ a_{t}) p 1 p o s t ( s t + 1 ∣ s t , a t )
π 1 p o s t ( a t ∣ s t ) \pi_{1}^{\mathrm{post}}(a_{t} \mid s_{t}) π 1 p o s t ( a t ∣ s t )
p 2 p r e ( s t + 1 ∣ s t , a t ) p_{2}^{\mathrm{pre}}(s_{t + 1} \mid s_{t},\ a_{t}) p 2 p r e ( s t + 1 ∣ s t , a t )
π 2 p r e ( a t ∣ s t ) \pi_{2}^{\mathrm{pre}}(a_{t} \mid s_{t}) π 2 p r e ( a t ∣ s t )
p 2 p o s t ( s t + 1 ∣ s t , a t ) p_{2}^{\mathrm{post}}(s_{t + 1} \mid s_{t},\ a_{t}) p 2 p o s t ( s t + 1 ∣ s t , a t )
π 2 p o s t ( a t ∣ s t ) \pi_{2}^{\mathrm{post}}(a_{t} \mid s_{t}) π 2 p o s t ( a t ∣ s t )
For t ≥ k t \ge k t ≥ k , the total variance distance between state-action marginals at time step t t t is bounded as
D T V ( p 1 t ( ⋅ , ⋅ ) ∥ p 2 t ( ⋅ , ⋅ ) ) ≤ max s D T V ( π 1 p o s t ( ⋅ ∣ s ) ∥ π 2 p o s t ( ⋅ ∣ s ) ) + D T V ( b 1 t ( ⋅ ) ∥ b 2 t ( ⋅ ) ) D_{\mathrm{TV}} \Big( p_{1}^{t}(\cdot,\ \cdot)\ \|\ p_{2}^{t}(\cdot,\ \cdot) \Big) \le \max_{s} D_{\mathrm{TV}} \Big( \pi_{1}^{\mathrm{post}}(\cdot \mid s)\ \|\ \pi_{2}^{\mathrm{post}}(\cdot \mid s) \Big) + D_{\mathrm{TV}} \Big( b_{1}^{t}(\cdot)\ \|\ b_{2}^{t}(\cdot) \Big)
D T V ( p 1 t ( ⋅ , ⋅ ) ∥ p 2 t ( ⋅ , ⋅ ) ) ≤ s max D T V ( π 1 p o s t ( ⋅ ∣ s ) ∥ π 2 p o s t ( ⋅ ∣ s ) ) + D T V ( b 1 t ( ⋅ ) ∥ b 2 t ( ⋅ ) )
The first item is supposed to be bounded by ϵ π p o s t \epsilon_{\pi}^{\mathrm{post}} ϵ π p o s t , and the second item can be bounded by
ϵ t p o s t = D T V ( b 1 t ( ⋅ ) ∥ b 2 t ( ⋅ ) ) ≤ ϵ m p o s t + ϵ π p o s t + ϵ t − 1 p o s t ≤ ⋯ ≤ k ( ϵ m p o s t + ϵ π p o s t ) + ϵ t − k p o s t \epsilon_{t}^{\mathrm{post}} = D_{\mathrm{TV}} \Big( b_{1}^{t}(\cdot)\ \|\ b_{2}^{t}(\cdot) \Big) \le \epsilon_{m}^{\mathrm{post}} + \epsilon_{\pi}^{\mathrm{post}} + \epsilon_{t - 1}^{\mathrm{post}} \le \cdots \le k (\epsilon_{m}^{\mathrm{post}} + \epsilon_{\pi}^{\mathrm{post}}) + \epsilon_{t - k}^{\mathrm{post}}
ϵ t p o s t = D T V ( b 1 t ( ⋅ ) ∥ b 2 t ( ⋅ ) ) ≤ ϵ m p o s t + ϵ π p o s t + ϵ t − 1 p o s t ≤ ⋯ ≤ k ( ϵ m p o s t + ϵ π p o s t ) + ϵ t − k p o s t
where ϵ m p o s t \epsilon_{m}^{\mathrm{post}} ϵ m p o s t bounds the expectation of post-branch dynamics distribution as
max t − k < τ ≤ t E s τ − 1 ∼ b 1 τ − 1 ( ⋅ ) E a τ − 1 ∼ π 1 ( ⋅ ∣ s τ − 1 ) D T V ( p 1 p o s t ( ⋅ ∣ s τ − 1 , a τ − 1 ) ∥ p 2 p o s t ( ⋅ ∣ s τ − 1 , a τ − 1 ) ) ≤ ϵ m p o s t \max_{t - k < \tau \le t} \mathcal{E}_{s_{\tau - 1} \sim b_{1}^{\tau - 1}(\cdot)} \mathcal{E}_{a_{\tau - 1} \sim \pi_{1}(\cdot \mid s_{\tau - 1})} D_{\mathrm{TV}} \Big( p_{1}^{\mathrm{post}}(\cdot \mid s_{\tau - 1},\ a_{\tau - 1})\ \|\ p_{2}^{\mathrm{post}}(\cdot \mid s_{\tau - 1},\ a_{\tau - 1}) \Big) \le \epsilon_{m}^{\mathrm{post}}
t − k < τ ≤ t max E s τ − 1 ∼ b 1 τ − 1 ( ⋅ ) E a τ − 1 ∼ π 1 ( ⋅ ∣ s τ − 1 ) D T V ( p 1 p o s t ( ⋅ ∣ s τ − 1 , a τ − 1 ) ∥ p 2 p o s t ( ⋅ ∣ s τ − 1 , a τ − 1 ) ) ≤ ϵ m p o s t
The ϵ t − k p o s t \epsilon_{t - k}^{\mathrm{post}} ϵ t − k p o s t can be further bounded through pre-branch error bounds analogously
ϵ t − k p o s t = ϵ t − k p r e = D T V ( b 1 t − k ( ⋅ ) ∥ b 2 t − k ( ⋅ ) ) ≤ ϵ m p r e + ϵ π p r e + ϵ t − k − 1 p r e ≤ ⋯ ≤ ( t − k ) ( ϵ m p r e + ϵ π p r e ) + ϵ 0 p r e = ( t − k ) ( ϵ m p r e + ϵ π p r e ) \epsilon_{t - k}^{\mathrm{post}} = \epsilon_{t - k}^{\mathrm{pre}} = D_{\mathrm{TV}} \Big( b_{1}^{t - k}(\cdot)\ \|\ b_{2}^{t - k}(\cdot) \Big) \le \epsilon_{m}^{\mathrm{pre}} + \epsilon_{\pi}^{\mathrm{pre}} + \epsilon_{t - k - 1}^{\mathrm{pre}} \le \cdots \le (t - k)(\epsilon_{m}^{\mathrm{pre}} + \epsilon_{\pi}^{\mathrm{pre}}) + \epsilon_{0}^{\mathrm{pre}} = (t - k)(\epsilon_{m}^{\mathrm{pre}} + \epsilon_{\pi}^{\mathrm{pre}})
ϵ t − k p o s t = ϵ t − k p r e = D T V ( b 1 t − k ( ⋅ ) ∥ b 2 t − k ( ⋅ ) ) ≤ ϵ m p r e + ϵ π p r e + ϵ t − k − 1 p r e ≤ ⋯ ≤ ( t − k ) ( ϵ m p r e + ϵ π p r e ) + ϵ 0 p r e = ( t − k ) ( ϵ m p r e + ϵ π p r e )
where ϵ m p r e \epsilon_{m}^{\mathrm{pre}} ϵ m p r e and ϵ π p r e \epsilon_{\pi}^{\mathrm{pre}} ϵ π p r e is defined similarly as ϵ m p o s t \epsilon_{m}^{\mathrm{post}} ϵ m p o s t and ϵ π p r e \epsilon_{\pi}^{\mathrm{pre}} ϵ π p r e respectively. The origin inequaion can be rewritten as
D T V ( p 1 t ( ⋅ , ⋅ ) ∥ p 2 t ( ⋅ , ⋅ ) ) ≤ ( t − k ) ( ϵ m p r e + ϵ π p r e ) + k ( ϵ m p o s t + ϵ π p o s t ) + ϵ π p o s t D_{\mathrm{TV}} \Big( p_{1}^{t}(\cdot,\ \cdot)\ \|\ p_{2}^{t}(\cdot,\ \cdot) \Big) \le (t - k) (\epsilon_{m}^{\mathrm{pre}} + \epsilon_{\pi}^{\mathrm{pre}}) + k (\epsilon_{m}^{\mathrm{post}} + \epsilon_{\pi}^{\mathrm{post}}) + \epsilon_{\pi}^{\mathrm{post}}
D T V ( p 1 t ( ⋅ , ⋅ ) ∥ p 2 t ( ⋅ , ⋅ ) ) ≤ ( t − k ) ( ϵ m p r e + ϵ π p r e ) + k ( ϵ m p o s t + ϵ π p o s t ) + ϵ π p o s t
For t < k t < k t < k , the trajectories are generated completely by post-branch dynamics and policy, then
D T V ( p 1 t ( ⋅ , ⋅ ) ∥ p 2 t ( ⋅ , ⋅ ) ) ≤ t ( ϵ m p o s t + ϵ π p o s t ) + ϵ π p o s t ≤ k ( ϵ m p o s t + ϵ π p o s t ) + ϵ π p o s t D_{\mathrm{TV}} \Big( p_{1}^{t}(\cdot,\ \cdot)\ \|\ p_{2}^{t}(\cdot,\ \cdot) \Big) \le t (\epsilon_{m}^{\mathrm{post}} + \epsilon_{\pi}^{\mathrm{post}}) + \epsilon_{\pi}^{\mathrm{post}} \le k (\epsilon_{m}^{\mathrm{post}} + \epsilon_{\pi}^{\mathrm{post}}) + \epsilon_{\pi}^{\mathrm{post}}
D T V ( p 1 t ( ⋅ , ⋅ ) ∥ p 2 t ( ⋅ , ⋅ ) ) ≤ t ( ϵ m p o s t + ϵ π p o s t ) + ϵ π p o s t ≤ k ( ϵ m p o s t + ϵ π p o s t ) + ϵ π p o s t
The difference of occupancy measures derived from the aforementioned state-action marginals is bounded as
D T V ( ρ 1 ( ⋅ , ⋅ ) ∥ ρ 2 ( ⋅ , ⋅ ) ) = 1 2 ( 1 − γ ) ∣ ∑ t = 0 ∞ γ t ∑ s t ∑ a t [ p 1 t ( s t , a t ) − p 2 t ( s t , a t ) ] ∣ ≤ ( 1 − γ ) [ ∑ t = 0 k − 1 γ t D T V ( p 1 t ( ⋅ , ⋅ ) ∥ p 2 t ( ⋅ , ⋅ ) ) + ∑ t = k ∞ γ t D T V ( p 1 t ( ⋅ , ⋅ ) ∥ p 2 t ( ⋅ , ⋅ ) ) ] ≤ ( 1 − γ ) [ ∑ t = 0 k − 1 γ t [ k ( ϵ m p o s t + ϵ π p o s t ) + ϵ π p o s t ] + ∑ t = k ∞ γ t [ ( t − k ) ( ϵ m p r e + ϵ π p r e ) + k ( ϵ m p o s t + ϵ π p o s t ) + ϵ π p o s t ] ] = ( 1 − γ ) [ 1 − γ k 1 − γ k ( ϵ m p o s t + ϵ π p o s t ) + 1 − γ k 1 − γ ϵ π p o s t ⏟ t < k + γ k + 1 ( 1 − γ ) 2 ( ϵ m p r e + ϵ π p r e ) + γ k 1 − γ k ( ϵ m p o s t + ϵ π p o s t ) + γ k 1 − γ ϵ π p o s t ⏟ t ≥ k ] \begin{aligned}
&D_{\mathrm{TV}} \Big( \rho_{1}(\cdot,\ \cdot)\ \|\ \rho_{2}(\cdot,\ \cdot) \Big) = \frac{1}{2} (1 - \gamma) \left| \sum_{t = 0}^{\infty} \gamma^{t} \sum_{s_{t}} \sum_{a_{t}} \Big[ p_{1}^{t}(s_{t},\ a_{t}) - p_{2}^{t}(s_{t},\ a_{t}) \Big] \right| \\[7mm]
\le\ &(1 - \gamma) \left[ \sum_{t = 0}^{k - 1} \gamma^{t} D_{\mathrm{TV}} \Big( p_{1}^{t}(\cdot,\ \cdot)\ \|\ p_{2}^{t}(\cdot,\ \cdot) \Big) + \sum_{t = k}^{\infty} \gamma^{t} D_{\mathrm{TV}} \Big( p_{1}^{t}(\cdot,\ \cdot)\ \|\ p_{2}^{t}(\cdot,\ \cdot) \Big) \right] \\[7mm]
\le\ &(1 - \gamma) \left[ \sum_{t = 0}^{k - 1} \gamma^{t} \Big[ k (\epsilon_{m}^{\mathrm{post}} + \epsilon_{\pi}^{\mathrm{post}}) + \epsilon_{\pi}^{\mathrm{post}} \Big] + \sum_{t = k}^{\infty} \gamma_{t} \Big[ (t - k) (\epsilon_{m}^{\mathrm{pre}} + \epsilon_{\pi}^{\mathrm{pre}}) + k (\epsilon_{m}^{\mathrm{post}} + \epsilon_{\pi}^{\mathrm{post}}) + \epsilon_{\pi}^{\mathrm{post}} \Big] \right] \\[7mm]
=\ &(1 - \gamma) \left[ \underset{t < k}{\underbrace{\frac{1 - \gamma^{k}}{1 - \gamma} k (\epsilon_{m}^{\mathrm{post}} + \epsilon_{\pi}^{\mathrm{post}}) + \frac{1 - \gamma^{k}}{1 - \gamma} \epsilon_{\pi}^{\mathrm{post}}}} + \underset{t \ge k}{\underbrace{\frac{\gamma^{k + 1}}{(1 - \gamma)^{2}} (\epsilon_{m}^{\mathrm{pre}} + \epsilon_{\pi}^{\mathrm{pre}}) + \frac{\gamma^{k}}{1 - \gamma} k (\epsilon_{m}^{\mathrm{post}} + \epsilon_{\pi}^{\mathrm{post}}) + \frac{\gamma^{k}}{1 - \gamma} \epsilon_{\pi}^{\mathrm{post}}}} \right]
\end{aligned}
≤ ≤ = D T V ( ρ 1 ( ⋅ , ⋅ ) ∥ ρ 2 ( ⋅ , ⋅ ) ) = 2 1 ( 1 − γ ) ∣ ∣ ∣ ∣ ∣ ∣ t = 0 ∑ ∞ γ t s t ∑ a t ∑ [ p 1 t ( s t , a t ) − p 2 t ( s t , a t ) ] ∣ ∣ ∣ ∣ ∣ ∣ ( 1 − γ ) [ t = 0 ∑ k − 1 γ t D T V ( p 1 t ( ⋅ , ⋅ ) ∥ p 2 t ( ⋅ , ⋅ ) ) + t = k ∑ ∞ γ t D T V ( p 1 t ( ⋅ , ⋅ ) ∥ p 2 t ( ⋅ , ⋅ ) ) ] ( 1 − γ ) [ t = 0 ∑ k − 1 γ t [ k ( ϵ m p o s t + ϵ π p o s t ) + ϵ π p o s t ] + t = k ∑ ∞ γ t [ ( t − k ) ( ϵ m p r e + ϵ π p r e ) + k ( ϵ m p o s t + ϵ π p o s t ) + ϵ π p o s t ] ] ( 1 − γ ) ⎣ ⎢ ⎢ ⎢ ⎢ ⎡ t < k 1 − γ 1 − γ k k ( ϵ m p o s t + ϵ π p o s t ) + 1 − γ 1 − γ k ϵ π p o s t + t ≥ k ( 1 − γ ) 2 γ k + 1 ( ϵ m p r e + ϵ π p r e ) + 1 − γ γ k k ( ϵ m p o s t + ϵ π p o s t ) + 1 − γ γ k ϵ π p o s t ⎦ ⎥ ⎥ ⎥ ⎥ ⎤
Unite the like terms and get the final result
D T V ( ρ 1 ( ⋅ , ⋅ ) ∥ ρ 2 ( ⋅ , ⋅ ) ) ≤ k ( ϵ m p o s t + ϵ π p o s t ) + ϵ π p o s t + γ k + 1 1 − γ ( ϵ m p r e + ϵ π p r e ) D_{\mathrm{TV}} \Big( \rho_{1}(\cdot,\ \cdot)\ \|\ \rho_{2}(\cdot,\ \cdot) \Big) \le k(\epsilon_{m}^{\mathrm{post}} + \epsilon_{\pi}^{\mathrm{post}}) + \epsilon_{\pi}^{\mathrm{post}} + \frac{\gamma^{k + 1}}{1 - \gamma} (\epsilon_{m}^{\mathrm{pre}} + \epsilon_{\pi}^{\mathrm{pre}})
D T V ( ρ 1 ( ⋅ , ⋅ ) ∥ ρ 2 ( ⋅ , ⋅ ) ) ≤ k ( ϵ m p o s t + ϵ π p o s t ) + ϵ π p o s t + 1 − γ γ k + 1 ( ϵ m p r e + ϵ π p r e )