Markov Decision Process

马尔可夫决策过程(MDP)

基础定义

MDP 在 MRP 的基础上加入了动作,一个 MDP 模型 M\mathscr{M} 可以形式化定义为元组 S, A, P, R, γ\langle \mathcal{S},\ \mathcal{A},\ \mathcal{P},\ \mathcal{R},\ \gamma \rangle

  1. 有限状态集合 S={s1, s2, , sn}\mathcal{S} = \{ s_{1},\ s_{2},\ \cdots,\ s_{n} \}
  2. 有限动作集合 A={a1, a2, , am}\mathcal{A} = \{ a_{1},\ a_{2},\ \cdots,\ a_{m} \}
  3. 状态转移概率矩阵 P={pij(a)=p(St+1=siSt=sj, At=a)}m×n×n\mathcal{P} = \{ p_{ij}^{(a)} = p(S_{t + 1} = s_{i} \mid S_{t} = s_{j},\ A_{t} = a) \}_{m \times n \times n}
  4. 奖励函数 R:S×AR=E(Rt+1St=s, At=a)\mathcal{R} : \mathcal{S} \times \mathcal{A} \mapsto \mathbb{R} = \mathcal{E}(R_{t + 1} \mid S_{t} = s,\ A_{t} = a)
  5. 衰减因子 γ[0, 1)\gamma \in [0,\ 1)

决策的过程实质上是通过历史信息选择动作的过程,而决策的策略 πt\pi_{t} 可以分为不同的种类进行建模:

策略 确定性 随机性
马尔可夫的 πt:stSatA\pi_{t} : s_{t} \in \mathcal{S} \mapsto a_{t} \in \mathcal{A} πt:atstS×Ap(atst)[0, 1]\pi_{t} : \langle a_{t} \mid s_{t} \rangle \in \mathcal{S} \times \mathcal{A} \mapsto p(a_{t} \mid s_{t}) \in [0,\ 1]
基于历史信息的 πt:htH(t)=St×At1atA\pi_{t} : h_{t} \in \mathcal{H}(t) = \mathcal{S}^{t} \times \mathcal{A}^{t - 1} \mapsto a_{t} \in \mathcal{A} πt:athtH(t)×Ap(atht)[0, 1]\pi_{t} : \langle a_{t} \mid h_{t} \rangle \in \mathcal{H}(t) \times \mathcal{A} \mapsto p(a_{t} \mid h_{t}) \in [0,\ 1]

不同策略种类的策略空间之间的关系如下所示:

定义静态策略为不随时间而变化的策略,即  t, πt=π\forall\ t,\ \pi_{t} = \pi,其中静态的马尔可夫确定性策略被广泛研究:

πD:sSaA\pi \in \mathcal{D} : s \in \mathcal{S} \mapsto a \in \mathcal{A}

另外地,定义静态的马尔可夫随机性策略为:

πDA:asS×Ap(as)[0, 1]\pi \in \mathcal{D}^{\mathrm{A}} : \langle a \mid s \rangle \in \mathcal{S} \times \mathcal{A} \mapsto p(a \mid s) \in [0,\ 1]

类似地,定义在某个策略 π\pi 下,在某时刻 tt 处于状态 ss 获得的回报的数学期望为状态价值函数:

vπ(s)=E(GtSt=s)v_{\pi}(s) = \mathcal{E}(G_{t} \mid S_{t} = s)

定义在某个策略 π\pi 下,在某时刻 tt 处于状态 ss 并且采取了动作 aa 获得的回报的数学期望为动作价值函数:

q(s, avπ)=qπ(s, a)=E(GtSt=s, At=a)q(s,\ a \mid v_{\pi}) = q_{\pi}(s,\ a) = \mathcal{E}(G_{t} \mid S_{t} = s,\ A_{t} = a)

策略等价性

由于 MDP 的状态转移具有马尔可夫性,因此可以通过构造 πΠMA\pi' \in \Pi^{\mathrm{MA}} 来获得与 πΠHA\pi \in \Pi^{\mathrm{HA}} 相同的价值函数。

π(atst)=pπ(atst)=pπ(atst, s0)\pi'(a_{t} \mid s_{t}) = p_{\pi'}(a_{t} \mid s_{t}) = p_{\pi}(a_{t} \mid s_{t},\ s_{0})

通过以上方法构造的马尔可夫随机策略 π\pi' 满足 pπ(st, ats0)=pπ(st, ats0)p_{\pi'}(s_{t},\ a_{t} \mid s_{0}) = p_{\pi}(s_{t},\ a_{t} \mid s_{0}),可以通过归纳法来证明:

pπ(s0, a0s0)=pπ(a0s0)=pπ(a0s0, s0)=pπ(s0, a0s0)p_{\pi'}(s_{0},\ a_{0} \mid s_{0}) = p_{\pi'}(a_{0} \mid s_{0}) = p_{\pi}(a_{0} \mid s_{0},\ s_{0}) = p_{\pi}(s_{0},\ a_{0} \mid s_{0})

t=0t = 0 时等式成立的基础上,假设等式在 t1t - 1 时成立,可得:

pπ(sts0)=st1at1pπ(st1, at1s0)pπ(stst1, at1, s0)=st1at1pπ(st1, at1s0)p(stst1, at1)=st1at1pπ(st1, at1s0)pπ(stst1, at1, s0)=pπ(sts0)\begin{aligned} p_{\pi}(s_{t} \mid s_{0}) &= \sum_{s_{t - 1}} \sum_{a_{t - 1}} p_{\pi}(s_{t - 1},\ a_{t - 1} \mid s_{0}) p_{\pi}(s_{t} \mid s_{t - 1},\ a_{t - 1},\ s_{0}) \\[7mm] &= \sum_{s_{t - 1}} \sum_{a_{t - 1}} p_{\pi'}(s_{t - 1},\ a_{t - 1} \mid s_{0}) p(s_{t} \mid s_{t - 1},\ a_{t - 1}) \\[7mm] &= \sum_{s_{t - 1}} \sum_{a_{t - 1}} p_{\pi'}(s_{t - 1},\ a_{t - 1} \mid s_{0}) p_{\pi'}(s_{t} \mid s_{t - 1},\ a_{t - 1},\ s_{0}) = p_{\pi'}(s_{t} \mid s_{0}) \end{aligned}

进而证明等式在 tt 时同样成立:

pπ(st, ats0)=pπ(atst, s0)pπ(atst)pπ(sts0)pπ(sts0)=pπ(atst, s0)pπ(sts0)=pπ(st, ats0)p_{\pi'}(s_{t},\ a_{t} \mid s_{0}) = \underset{p_{\pi'}(a_{t} \mid s_{t})}{\underbrace{p_{\pi'}(a_{t} \mid s_{t},\ s_{0})}} \underset{p_{\pi}(s_{t} \mid s_{0})}{\underbrace{p_{\pi'}(s_{t} \mid s_{0})}} = p_{\pi}(a_{t} \mid s_{t},\ s_{0}) p_{\pi}(s_{t} \mid s_{0}) = p_{\pi}(s_{t},\ a_{t} \mid s_{0})

注意到在不同的状态价值函数虽然是不同的形式,但是均为 E(Rt+1+ΔtSt=st)\mathcal{E}(R_{t + 1 + \Delta t} \mid S_{t} = s_{t}) 的线性组合,取 t=0t = 0 时:

Eπ(Rt+1S0=s0)=statEπ(Rt+1St=st, At=at, S0=s0)pπ(st, ats0)=statR(st, at)pπ(st, ats0)\begin{aligned} \mathcal{E}_{\pi}(R_{t + 1} \mid S_{0} = s_{0}) &= \sum_{s_{t}} \sum_{a_{t}} \mathcal{E}_{\pi}(R_{t + 1} \mid S_{t} = s_{t},\ A_{t} = a_{t},\ \xcancel{S_{0} = s_{0}}) p_{\pi}(s_{t},\ a_{t} \mid s_{0}) \\[7mm] &= \sum_{s_{t}} \sum_{a_{t}} \mathcal{R}(s_{t},\ a_{t}) p_{\pi}(s_{t},\ a_{t} \mid s_{0}) \end{aligned}

即在已知初始状态或初始状态分布的情况下,任何具有相同 p(st, ats0)p(s_{t},\ a_{t} \mid s_{0}) 的策略 π\pi 拥有相同的价值函数。也就是说,πΠHA\pi \in \Pi^{\mathrm{HA}}πΠMA\pi \in \Pi^{\mathrm{MA}} 在获得状态价值的层面上等价,因此后续只考虑马尔可夫随机策略。

贝尔曼期望方程

考虑在基于历史信息的策略 atπ(ht)a_{t} \sim \pi(\cdot \mid h_{t}) 的基础上,将动作价值函数展开为嵌套期望的形式:

qπ(st, at)=E(GtSt=st, At=at)=EhtstErt+1r(st, at)Est+1p(st, at)Eat+1π(ht+1)Ert+2r(st+1, at+1)[τ=0γτrt+τ+1]=EhtstErt+1r(st, at)[rt+1]+γEhtstEst+1p(st, at)Eat+1π(ht+1)Ert+2r(st+1, at+1)[rt+2]+=R(st, at)+γEhtstEst+1p(st, at)Eat+1π(ht+1)[R(st+1, at+1)]+=τ=0EhtstEst+1p(st, at)Eat+1π(ht+1)Est+2p(st+1, at+1)Eat+2π(ht+2)[γτR(st+τ, at+τ)]=EhtstEst+1p(st, at)Eat+1π(ht+1)Est+2p(st+1, at+1)Eat+2π(ht+2)[τ=0γτR(st+τ, at+τ)]\begin{aligned} q_{\pi}(s_{t},\ a_{t}) &= \mathcal{E}(G_{t} \mid S_{t} = s_{t},\ A_{t} = a_{t}) = \mathcal{E}_{h_{t} \setminus s_{t}} \mathcal{E}_{r_{t + 1} \sim r(\cdot \mid s_{t},\ a_{t})} \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t},\ a_{t})} \mathcal{E}_{a_{t + 1} \sim \pi(\cdot \mid h_{t + 1})} \mathcal{E}_{r_{t + 2} \sim r(\cdot \mid s_{t + 1},\ a_{t + 1})} \cdots \left[ \sum_{\tau = 0}^{\infty} \gamma^{\tau} r_{t + \tau + 1} \right] \\[7mm] &= \mathcal{E}_{h_{t} \setminus s_{t}} \mathcal{E}_{r_{t + 1} \sim r(\cdot \mid s_{t},\ a_{t})} \Big[ r_{t + 1} \Big] + \gamma \mathcal{E}_{h_{t} \setminus s_{t}} \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t},\ a_{t})} \mathcal{E}_{a_{t + 1} \sim \pi(\cdot \mid h_{t + 1})} \mathcal{E}_{r_{t + 2} \sim r(\cdot \mid s_{t + 1},\ a_{t + 1})} \Big[ r_{t + 2} \Big] + \cdots \\[7mm] &= \mathcal{R}(s_{t},\ a_{t}) + \gamma \mathcal{E}_{h_{t} \setminus s_{t}} \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t},\ a_{t})} \mathcal{E}_{a_{t + 1} \sim \pi(\cdot \mid h_{t + 1})} \Big[ \mathcal{R}(s_{t + 1},\ a_{t + 1}) \Big] + \cdots \\[7mm] &= \sum_{\tau = 0}^{\infty} \mathcal{E}_{h_{t} \setminus s_{t}} \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t},\ a_{t})} \mathcal{E}_{a_{t + 1} \sim \pi(\cdot \mid h_{t + 1})} \mathcal{E}_{s_{t + 2} \sim p(\cdot \mid s_{t + 1},\ a_{t + 1})} \mathcal{E}_{a_{t + 2} \sim \pi(\cdot \mid h_{t + 2})} \cdots \Big[ \gamma^{\tau} \mathcal{R}(s_{t + \tau},\ a_{t + \tau}) \Big] \\[7mm] &= \mathcal{E}_{h_{t} \setminus s_{t}} \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t},\ a_{t})} \mathcal{E}_{a_{t + 1} \sim \pi(\cdot \mid h_{t + 1})} \mathcal{E}_{s_{t + 2} \sim p(\cdot \mid s_{t + 1},\ a_{t + 1})} \mathcal{E}_{a_{t + 2} \sim \pi(\cdot \mid h_{t + 2})} \cdots \left[ \sum_{\tau = 0}^{\infty} \gamma^{\tau} \mathcal{R}(s_{t + \tau},\ a_{t + \tau}) \right] \end{aligned}

类似地,状态价值函数定义为:

vπ(st)=E(GtSt=st)=EhtstEatπ(ht)Ert+1r(st, at)Est+1p(st, at)Eat+1π(ht+1)Ert+2r(st+1, at+1)[τ=0γτrt+τ+1]=EhtstEatπ(ht)Ert+1r(st, at)[rt+1]+γEhtstEatπ(ht)Est+1p(st, at)Eat+1π(ht+1)Ert+2r(st+1, at+1)[rt+2]+=EhtstEatπ(ht)[R(st, at)]+γEhtstEatπ(ht)Est+1p(st, at)Eat+1π(ht+1)[R(st+1, at+1)]+=τ=0EhtstEatπ(ht)Est+1p(st, at)Eat+1π(ht+1)Est+2p(st+1, at+1)Eat+2π(ht+2)[γτR(st+τ, at+τ)]=EhtstEatπ(ht)Est+1p(st, at)Eat+1π(ht+1)Est+2p(st+1, at+1)Eat+2π(ht+2)[τ=0γτR(st+τ, at+τ)]\begin{aligned} v_{\pi}(s_{t}) &= \mathcal{E}(G_{t} \mid S_{t} = s_{t}) = \mathcal{E}_{h_{t} \setminus s_{t}} \mathcal{E}_{a_{t} \sim \pi(\cdot \mid h_{t})} \mathcal{E}_{r_{t + 1} \sim r(\cdot \mid s_{t},\ a_{t})} \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t},\ a_{t})} \mathcal{E}_{a_{t + 1} \sim \pi(\cdot \mid h_{t + 1})} \mathcal{E}_{r_{t + 2} \sim r(\cdot \mid s_{t + 1},\ a_{t + 1})} \cdots \left[ \sum_{\tau = 0}^{\infty} \gamma^{\tau} r_{t + \tau + 1} \right] \\[7mm] &= \mathcal{E}_{h_{t} \setminus s_{t}} \mathcal{E}_{a_{t} \sim \pi(\cdot \mid h_{t})} \mathcal{E}_{r_{t + 1} \sim r(\cdot \mid s_{t},\ a_{t})} \Big[ r_{t + 1} \Big] + \gamma \mathcal{E}_{h_{t} \setminus s_{t}} \mathcal{E}_{a_{t} \sim \pi(\cdot \mid h_{t})} \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t},\ a_{t})} \mathcal{E}_{a_{t + 1} \sim \pi(\cdot \mid h_{t + 1})} \mathcal{E}_{r_{t + 2} \sim r(\cdot \mid s_{t + 1},\ a_{t + 1})} \Big[ r_{t + 2} \Big] + \cdots \\[7mm] &= \mathcal{E}_{h_{t} \setminus s_{t}} \mathcal{E}_{a_{t} \sim \pi(\cdot \mid h_{t})} \Big[ \mathcal{R}(s_{t},\ a_{t}) \Big] + \gamma \mathcal{E}_{h_{t} \setminus s_{t}} \mathcal{E}_{a_{t} \sim \pi(\cdot \mid h_{t})} \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t},\ a_{t})} \mathcal{E}_{a_{t + 1} \sim \pi(\cdot \mid h_{t + 1})} \Big[ \mathcal{R}(s_{t + 1},\ a_{t + 1}) \Big] + \cdots \\[7mm] &= \sum_{\tau = 0}^{\infty} \mathcal{E}_{h_{t} \setminus s_{t}} \mathcal{E}_{a_{t} \sim \pi(\cdot \mid h_{t})} \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t},\ a_{t})} \mathcal{E}_{a_{t + 1} \sim \pi(\cdot \mid h_{t + 1})} \mathcal{E}_{s_{t + 2} \sim p(\cdot \mid s_{t + 1},\ a_{t + 1})} \mathcal{E}_{a_{t + 2} \sim \pi(\cdot \mid h_{t + 2})} \cdots \Big[ \gamma^{\tau} \mathcal{R}(s_{t + \tau},\ a_{t + \tau}) \Big] \\[7mm] &= \mathcal{E}_{h_{t} \setminus s_{t}} \mathcal{E}_{a_{t} \sim \pi(\cdot \mid h_{t})} \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t},\ a_{t})} \mathcal{E}_{a_{t + 1} \sim \pi(\cdot \mid h_{t + 1})} \mathcal{E}_{s_{t + 2} \sim p(\cdot \mid s_{t + 1},\ a_{t + 1})} \mathcal{E}_{a_{t + 2} \sim \pi(\cdot \mid h_{t + 2})} \cdots \Big[ \sum_{\tau = 0}^{\infty} \gamma^{\tau} \mathcal{R}(s_{t + \tau},\ a_{t + \tau}) \Big] \end{aligned}

考虑在静态随机马尔可夫策略 atπ(st)DAa_{t} \sim \pi(\cdot \mid s_{t}) \in \mathcal{D}^{\mathrm{A}} 下动作、状态价值函数可以简化为:

qπ(st, at)=Est+1p(st, at)Eat+1π(st+1)Est+2p(st+1, at+1)Eat+2π(st+2)[τ=0γτR(st+τ, at+τ)]vπ(st)=Eatπ(st)Est+1p(st, at)Eat+1π(st+1)Est+2p(st+1, at+1)Eat+2π(st+2)[τ=0γτR(st+τ, at+τ)]\begin{gathered} q_{\pi}(s_{t},\ a_{t}) = \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t},\ a_{t})} \mathcal{E}_{a_{t + 1} \sim \pi(\cdot \mid s_{t + 1})} \mathcal{E}_{s_{t + 2} \sim p(\cdot \mid s_{t + 1},\ a_{t + 1})} \mathcal{E}_{a_{t + 2} \sim \pi(\cdot \mid s_{t + 2})} \cdots \left[ \sum_{\tau = 0}^{\infty} \gamma^{\tau} \mathcal{R}(s_{t + \tau},\ a_{t + \tau}) \right] \\[7mm] v_{\pi}(s_{t}) = \mathcal{E}_{a_{t} \sim \pi(\cdot \mid s_{t})} \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t},\ a_{t})} \mathcal{E}_{a_{t + 1} \sim \pi(\cdot \mid s_{t + 1})} \mathcal{E}_{s_{t + 2} \sim p(\cdot \mid s_{t + 1},\ a_{t + 1})} \mathcal{E}_{a_{t + 2} \sim \pi(\cdot \mid s_{t + 2})} \cdots \Big[ \sum_{\tau = 0}^{\infty} \gamma^{\tau} \mathcal{R}(s_{t + \tau},\ a_{t + \tau}) \Big] \end{gathered}

结合二者的定义可以将动作价值函数递归地展开为:

qπ(st, at)= Est+1p(st, at)Eat+1π(st+1)Est+2p(st+1, at+1)Eat+2π(st+2)[R(st, at)] +γEst+1p(st, at)Eat+1π(st+1)Est+2p(st+1, at+1)Eat+2π(st+2)[τ=0γτR(s(t+1)+τ, a(t+1)+τ)]vπ(st+1)= R(st, at)+γEst+1p(st, at)[vπ(st+1)]=R(st, at)+γst+1vπ(st+1)p(st+1st, at)\begin{aligned} q_{\pi}(s_{t},\ a_{t}) =\ &\mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t},\ a_{t})} \mathcal{E}_{a_{t + 1} \sim \pi(\cdot \mid s_{t + 1})} \mathcal{E}_{s_{t + 2} \sim p(\cdot \mid s_{t + 1},\ a_{t + 1})} \mathcal{E}_{a_{t + 2} \sim \pi(\cdot \mid s_{t + 2})} \cdots \Big[ R(s_{t},\ a_{t}) \Big]\ + \\[7mm] &\gamma \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t},\ a_{t})} \underset{v_{\pi}(s_{t + 1})}{\underbrace{\mathcal{E}_{a_{t + 1} \sim \pi(\cdot \mid s_{t + 1})} \mathcal{E}_{s_{t + 2} \sim p(\cdot \mid s_{t + 1},\ a_{t + 1})} \mathcal{E}_{a_{t + 2} \sim \pi(\cdot \mid s_{t + 2})} \cdots \left[ \sum_{\tau = 0}^{\infty} \gamma^{\tau} \mathcal{R}(s_{(t + 1) + \tau},\ a_{(t + 1) + \tau}) \right]}} \\[10mm] =\ &R(s_{t},\ a_{t}) + \gamma \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t},\ a_{t})} \Big[ v_{\pi}(s_{t + 1}) \Big] = R(s_{t},\ a_{t}) + \gamma \sum_{s_{t + 1}} v_{\pi}(s_{t + 1}) p(s_{t + 1} \mid s_{t},\ a_{t}) \end{aligned}

状态价值函数也可以递归地展开为:

vπ(st)=Eatπ(st)Est+1p(st, at)Eat+1π(st+1)Est+2p(st+1, at+1)Eat+2π(st+2)[τ=0γτR(st+τ, at+τ)]qπ(st, at)=Eatπ(st)qπ(st, at)=atqπ(st, at)π(atst)\begin{aligned} v_{\pi}(s_{t}) &= \mathcal{E}_{a_{t} \sim \pi(\cdot \mid s_{t})} \underset{q_{\pi}(s_{t},\ a_{t})}{\underbrace{\mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t},\ a_{t})} \mathcal{E}_{a_{t + 1} \sim \pi(\cdot \mid s_{t + 1})} \mathcal{E}_{s_{t + 2} \sim p(\cdot \mid s_{t + 1},\ a_{t + 1})} \mathcal{E}_{a_{t + 2} \sim \pi(\cdot \mid s_{t + 2})} \cdots \Big[ \sum_{\tau = 0}^{\infty} \gamma^{\tau} \mathcal{R}(s_{t + \tau},\ a_{t + \tau}) \Big]}} \\[10mm] &= \mathcal{E}_{a_{t} \sim \pi(\cdot \mid s_{t})} q_{\pi}(s_{t},\ a_{t}) = \sum_{a_{t}} q_{\pi}(s_{t},\ a_{t}) \pi(a_{t} \mid s_{t}) \end{aligned}

可以通过以上方程定义贝尔曼期望算子 Lπ:vV=RnV\mathscr{L}_{\pi} : v \in \mathcal{V} = \mathbb{R}^{n} \mapsto \mathcal{V}

v=Lπ{v}=aq(s, av)π(as)=aR(s, a)π(as)+sv(s)ap(ss, a)π(sa)=Rπ+γPπvv = \mathscr{L}_{\pi} \{ v \} = \sum_{a} q(s,\ a \mid v) \pi(a \mid s) = \sum_{a} \mathcal{R}(s,\ a) \pi(a \mid s) + \sum_{s'} v(s') \sum_{a} p(s' \mid s,\ a) \pi(s \mid a) = \mathcal{R}_{\pi} + \gamma \mathcal{P}_{\pi} v

需要注意的是算子 Lπ\mathscr{L}_{\pi} 可以保持状态函数的偏序性,即在 v1v2v_{1} \le v_{2} 时,满足:

Lπ{v1}=Rπ+γPπv1Rπ+γPπv2=Lπ{v2}\mathscr{L}_{\pi} \{ v_{1} \} = \mathcal{R}_{\pi} + \gamma \mathcal{P}_{\pi} v_{1} \le \mathcal{R}_{\pi} + \gamma \mathcal{P}_{\pi} v_{2} = \mathscr{L}_{\pi} \{ v_{2} \}

贝尔曼最优方程

根据 ΠHA\Pi^{\mathrm{HA}}ΠMA\Pi^{\mathrm{MA}} 之间的等价性,定义最优状态价值函数 vv^{\star} 为:

 sS:v(s)=maxπΠHAvπ(s)=maxπΠMAvπ(s)\forall\ s \in \mathcal{S} : v^{\star}(s) = \max_{\pi \in \Pi^{\mathrm{HA}}} v_{\pi}(s) = \max_{\pi \in \Pi^{\mathrm{MA}}} v_{\pi}(s)

策略 π\pi^{\star} 是最优策略当且仅当 vπ=vv_{\pi^{\star}} = v^{\star}。在求解最优状态价值函数 vv^{\star} 以及最优策略 π\pi^{\star} 前先定义算子 L\mathscr{L}

L{v}:VV=maxaAq(s, av)=maxaA{R(s, a)+γsSv(s)p(ss, a)}\mathscr{L} \{ v \} : \mathcal{V} \mapsto \mathcal{V} = \max_{a \in \mathcal{A}} q(s,\ a \mid v) = \max_{a \in \mathcal{A}} \left\{ \mathcal{R}(s,\ a) + \gamma \sum_{s' \in \mathcal{S}} v(s') p(s' \mid s,\ a) \right\}

通过贝尔曼最优方程 v=L(v)v = \mathscr{L}(v) 可以解得一个唯一的最优状态价值函数 vv^{\star},证明和求解 π\pi^{\star} 的过程如下:

  1. 证明 L{v}=maxπDLπ{v}=maxδDALδ{v}\mathscr{L} \{ v \} = \max_{\pi \in \mathcal{D}} \mathscr{L}_{\pi} \{ v \} = \max_{\delta \in \mathcal{D}^{\mathrm{A}}} \mathscr{L}_{\delta} \{ v \}

考虑在静态马尔可夫确定性策略空间 D\mathcal{D} 下:

maxπDLπ{v}=maxπDaAq(s, av)π(as)=maxaAq(s, av)=L{v}\max_{\pi \in \mathcal{D}} \mathscr{L}_{\pi} \{ v \} = \max_{\pi \in \mathcal{D}} \sum_{a \in \mathcal{A}} q(s,\ a \mid v) \pi(a \mid s) = \max_{a \in \mathcal{A}} q(s,\ a \mid v) = \mathscr{L} \{ v \}

对于以上命题可以很容易地通过构造性方法来进行证明,例如:

π(as)={1a=arg maxaq(a, sv)0aarg maxaq(a, sv)\pi(a \mid s) = \left\{ \begin{matrix} 1 & a = \argmax_{a'} q(a',\ s \mid v) \\[3mm] 0 & a \ne \argmax_{a'} q(a',\ s \mid v) \end{matrix} \right.

对于一个静态马尔可夫随机性策略 δDA\delta \in \mathcal{D}^{\mathrm{A}}

Lδ{v}=aAq(s, av)δ(as)aAδ(as)maxaAq(s, av)=aAδ(as)L{v}=L{v}=maxπDLπ{v}\begin{aligned} \mathscr{L}_{\delta} \{ v \} &= \sum_{a \in \mathcal{A}} q(s,\ a \mid v) \delta(a \mid s) \le \sum_{a \in \mathcal{A}} \delta(a \mid s) \max_{a' \in \mathcal{A}} q(s,\ a' \mid v) \\[7mm] &= \sum_{a \in \mathcal{A}} \delta(a \mid s) \mathscr{L} \{ v \} = \mathscr{L} \{ v \} = \max_{\pi \in \mathcal{D}} \mathscr{L}_{\pi} \{ v \} \end{aligned}

因此 maxδDALδ{v}maxπDLπ{v}\max_{\delta \in \mathcal{D}^{\mathrm{A}}} \mathscr{L}_{\delta} \{ v \} \le \max_{\pi \in \mathcal{D}} \mathscr{L}_{\pi} \{ v \},同时 DDA\mathcal{D} \subset \mathcal{D}^{\mathrm{A}},又有 maxδDALδ{v}maxπDLπ{v}\max_{\delta \in \mathcal{D}^{\mathrm{A}}} \mathscr{L}_{\delta} \{ v \} \ge \max_{\pi \in \mathcal{D}} \mathscr{L}_{\pi} \{ v \},进而:

L{v}=maxπDLπ{v}=maxδDALδ{v}\mathscr{L} \{ v \} = \max_{\pi \in \mathcal{D}} \mathscr{L}_{\pi} \{ v \} = \max_{\delta \in \mathcal{D}^{\mathrm{A}}} \mathscr{L}_{\delta} \{ v \}

  1. 证明 v=L{v}v=vv = \mathscr{L} \{ v \} \Rightarrow v = v^{\star}

考虑 vVL{v}v \in \mathcal{V} \ge \mathscr{L} \{ v \} 以及 πΠMA=(π0, π1, )\pi \in \Pi^{\mathrm{MA}} = (\pi_{0},\ \pi_{1},\ \cdots),其中在每个时间步上可以认为 πtDA\pi_{t} \in \mathcal{D}^{\mathrm{A}},因此:

vL{v}=maxδDALδ{v}Lπt{v}\begin{aligned} v \ge \mathscr{L} \{ v \} = \max_{\delta \in \mathcal{D}^{\mathrm{A}}} \mathscr{L}_{\delta} \{ v \} \ge \mathscr{L}_{\pi_{t}} \{ v \} \end{aligned}

由于算子 Lπ\mathscr{L}_{\pi} 可以保持状态价值函数之间的偏序性,因此可以层层嵌套得到:

vLπ0{v}Lπ0{Lπ1{v}}Lπ0Lπ1LπT{v}=Rπ0+γPπ0Rπ1++γTt=0T1PπtRπT+γT+1t=0TPπtv\begin{aligned} v &\ge \mathscr{L}_{\pi_{0}} \{ v \} \ge \mathscr{L}_{\pi_{0}} \{ \mathscr{L}_{\pi_{1}} \{ v \} \} \ge \cdots \ge \mathscr{L}_{\pi_{0}} \mathscr{L}_{\pi_{1}} \cdots \mathscr{L}_{\pi_{\mathrm{T}}} \{ v \} \\[5mm] &= \mathcal{R}_{\pi_{0}} + \gamma \mathcal{P}_{\pi_{0}} \mathcal{R}_{\pi_{1}} + \cdots + \gamma^{\mathrm{T}} \prod_{t = 0}^{\mathrm{T} - 1} \mathcal{P}_{\pi_{t}} \mathcal{R}_{\pi_{\mathrm{T}}} + \gamma^{\mathrm{T} + 1} \prod_{t = 0}^{\mathrm{T}} \mathcal{P}_{\pi_{t}} v \end{aligned}

因此

vvπ=v[Rπ0+γPπ0Rπ1++γTt=0T1PπtRπT+]γT+1t=0TPπtvτ=Tγτ+1t=0τPπtRπτ+1 T\begin{aligned} v - v_{\pi} &= v - \left[ \mathcal{R}_{\pi_{0}} + \gamma \mathcal{P}_{\pi_{0}} \mathcal{R}_{\pi_{1}} + \cdots + \gamma^{\mathrm{T}} \prod_{t = 0}^{\mathrm{T} - 1} \mathcal{P}_{\pi_{t}} \mathcal{R}_{\pi_{\mathrm{T}}} + \cdots \right] \\[7mm] &\ge \gamma^{\mathrm{T} + 1} \prod_{t = 0}^{\mathrm{T}} \mathcal{P}_{\pi_{t}} v - \sum_{\tau = \mathrm{T}}^{\infty} \gamma^{\tau + 1} \prod_{t = 0}^{\tau} \mathcal{P}_{\pi_{t}} \mathcal{R}_{\pi_{\tau + 1}} \Leftarrow \forall\ \mathrm{T} \end{aligned}

对以上不等式的右侧的两项的范数进行放缩:

γT+1t=0TPπtvγT+1t=0TPπtvγT+1vτ=Tγτ+1t=0τPπtRπτ+1τ=Tγτ+1t=0τPπtRπτ+1τ=Tγτ+1t=0τPπtRπτ+1τ=Tγτ+1maxs, aR(s, a)\begin{gathered} \left|\left| \gamma^{\mathrm{T} + 1} \prod_{t = 0}^{\mathrm{T}} \mathcal{P}_{\pi_{t}} v \right|\right| \le \gamma^{\mathrm{T} + 1} \prod_{t = 0}^{\mathrm{T}} || \mathcal{P}_{\pi_{t}} || \cdot || v || \le \gamma^{\mathrm{T} + 1} || v || \\[7mm] \left|\left| \sum_{\tau = \mathrm{T}}^{\infty} \gamma^{\tau + 1} \prod_{t = 0}^{\tau} \mathcal{P}_{\pi_{t}} \mathcal{R}_{\pi_{\tau + 1}} \right|\right| \le \sum_{\tau = \mathrm{T}}^{\infty} \gamma^{\tau + 1} \left|\left| \prod_{t = 0}^{\tau} \mathcal{P}_{\pi_{t}} \mathcal{R}_{\pi_{\tau + 1}} \right|\right| \le \sum_{\tau = \mathrm{T}}^{\infty} \gamma^{\tau + 1} \prod_{t = 0}^{\tau} || \mathcal{P}_{\pi_{t}} || \cdot || \mathcal{R}_{\pi_{\tau + 1}} || \le \sum_{\tau = \mathrm{T}}^{\infty} \gamma^{\tau + 1} \max_{s,\ a} | \mathcal{R}(s,\ a) | \end{gathered}

因此在 T\mathrm{T} \to \infty 时,不等式右侧趋于 0,因此  vπΠMA:vvπ\forall\ v_{\pi} \in \Pi^{\mathrm{MA}} : v \ge v_{\pi},进而:

vmaxπΠMAvπ=maxπΠHAvπ=vv \ge \max_{\pi \in \Pi^{\mathrm{MA}}} v_{\pi} = \max_{\pi \in \Pi^{\mathrm{HA}}} v_{\pi} = v^{\star}

通过类似的方法可以证明 vL{v}vvv \le \mathscr{L} \{ v \} \Rightarrow v \le v^{\star},结合这两个命题即可得出 v=L{v}v=vv = \mathscr{L} \{ v \} \Rightarrow v = v^{\star}

  1. 证明方程 v=L(v)v = \mathscr{L}(v) 的解存在且唯一

定义距离函数 ρ:V×VR=maxsSv1(s)v2(s)\rho : \mathcal{V} \times \mathcal{V} \mapsto \mathbb{R} = \max_{s \in \mathcal{S}} | v_{1}(s) - v_{2}(s) |。则 V, ρ\langle \mathcal{V},\ \rho \rangle 是一个完备距离空间。在此基础上,只要证明算子 L\mathscr{L} 是该空间上的压缩映射即可(巴拿赫不动点定理),对于 v, uVv,\ u \in \mathcal{V} 以及 sSs \in \mathcal{S},令:

a^arg maxaAq(s, av)=arg maxaA{R(s, a)+γsSv(s)p(ss, a)}\hat{a} \in \argmax_{a \in \mathcal{A}} q(s,\ a \mid v) = \argmax_{a \in \mathcal{A}} \left\{ \mathcal{R}(s,\ a) + \gamma \sum_{s' \in \mathcal{S}} v(s') p(s' \mid s,\ a) \right\}

不失一般性地假设 L{v}(s)L{u}(s)\mathscr{L} \{ v \}(s) \le \mathscr{L} \{ u \}(s),则:

L{v}(s)L{u}(s)=L{v}(s)L{u}(s)q(s, a^v)q(s, a^u)=γsS[v(s)u(s)]p(ss, a)γsSv(s)u(s)p(ss, a)γsSρ(v, u)p(ss, a)=γρ(v, u)ρ(L{v}, L{u})=maxsSL{v}(s)L{u}(s)γρ(v, u)\begin{aligned} | \mathscr{L} \{ v \}(s) - \mathscr{L} \{ u \}(s) | &= \mathscr{L} \{ v \}(s) - \mathscr{L} \{ u \}(s) \le q(s,\ \hat{a} \mid v) - q(s,\ \hat{a} \mid u) \\[5mm] &= \gamma \sum_{s' \in \mathcal{S}} \bigg[ v(s') - u(s') \bigg] p(s' \mid s,\ a) \le \gamma \sum_{s' \in \mathcal{S}} \bigg| v(s') - u(s') \bigg| p(s' \mid s,\ a) \\[7mm] &\le \gamma \sum_{s' \in \mathcal{S}} \rho(v,\ u) p(s' \mid s,\ a) = \gamma \rho(v,\ u) \\[7mm] \rho(\mathscr{L} \{ v \},\ \mathscr{L} \{ u \}) &= \max_{s \in \mathcal{S}} | \mathscr{L} \{ v \}(s) - \mathscr{L} \{ u \}(s) | \le \gamma \rho(v,\ u) \end{aligned}

由此可得算子 L\mathscr{L}V, ρ\langle \mathcal{V},\ \rho \rangle 空间上的压缩映射,因此方程 v=L(v)v = \mathscr{L}(v) 的解存在且唯一。并且通过巴拿赫不动点定理的构造性证明中也可以得到方程 v=L{v}v = \mathscr{L} \{ v \} 的解,即 v=limkLk{v0}v^{\star} = \lim_{k \to \infty} \mathscr{L}^{k} \{ v_{0} \}

  1. 证明总是存在一个静态马尔可夫确定性策略 πD\pi \in \mathcal{D} 使得 π\pi 是一个最优策略

考虑 πarg maxπDLπ{v}\pi^{\star} \in \argmax_{\pi \in \mathcal{D}} \mathscr{L}_{\pi} \{ v^{\star} \},根据以上的证明过程,可得:

Lπ{v}=maxπDLπ{v}=L{v}=v\mathscr{L}_{\pi^{\star}} \{ v^{\star} \} = \max_{\pi \in \mathcal{D}} \mathscr{L}_{\pi} \{ v^{\star} \} = \mathscr{L} \{ v^{\star} \} = v^{\star}

由于贝尔曼期望方程 v=Lπ{v}v = \mathscr{L}_{\pi^{\star}} \{ v \} 的解存在且唯一,因此 vπ=vv_{\pi^{\star}} = v^{\star},即 π\pi^{\star} 是最优策略,并且可以构造为:

π(as)={1a=arg maxaq(a, sv)0aarg maxaq(a, sv)\pi^{\star}(a \mid s) = \left\{ \begin{matrix} 1 & a = \argmax_{a'} q(a',\ s \mid v^{\star}) \\[3mm] 0 & a \ne \argmax_{a'} q(a',\ s \mid v^{\star}) \end{matrix} \right.


Markov Decision Process
http://example.com/2024/07/09/MDP/
Author
木辛
Posted on
July 9, 2024
Licensed under