Markov Decision Process

马尔可夫决策过程（MDP）

基础定义

MDP 在 MRP 的基础上加入了动作，一个 MDP 模型 $\mathscr{M}$ 可以形式化定义为元组 $\langle \mathcal{S},\ \mathcal{A},\ \mathcal{P},\ \mathcal{R},\ \gamma \rangle$ ：

有限状态集合 $\mathcal{S} = \{ s_{1},\ s_{2},\ \cdots,\ s_{n} \}$
有限动作集合 $\mathcal{A} = \{ a_{1},\ a_{2},\ \cdots,\ a_{m} \}$
状态转移概率矩阵 $\mathcal{P} = \{ p_{ij}^{(a)} = p(S_{t + 1} = s_{i} \mid S_{t} = s_{j},\ A_{t} = a) \}_{m \times n \times n}$
奖励函数 $\mathcal{R} : \mathcal{S} \times \mathcal{A} \mapsto \mathbb{R} = \mathcal{E}(R_{t + 1} \mid S_{t} = s,\ A_{t} = a)$
衰减因子 $\gamma \in [0,\ 1)$

决策的过程实质上是通过历史信息选择动作的过程，而决策的策略 $\pi_{t}$ 可以分为不同的种类进行建模：

策略	确定性	随机性
马尔可夫的	$\pi_{t} : s_{t} \in \mathcal{S} \mapsto a_{t} \in \mathcal{A}$	$\pi_{t} : \langle a_{t} \mid s_{t} \rangle \in \mathcal{S} \times \mathcal{A} \mapsto p(a_{t} \mid s_{t}) \in [0,\ 1]$
基于历史信息的	$\pi_{t} : h_{t} \in \mathcal{H}(t) = \mathcal{S}^{t} \times \mathcal{A}^{t - 1} \mapsto a_{t} \in \mathcal{A}$	$\pi_{t} : \langle a_{t} \mid h_{t} \rangle \in \mathcal{H}(t) \times \mathcal{A} \mapsto p(a_{t} \mid h_{t}) \in [0,\ 1]$

不同策略种类的策略空间之间的关系如下所示：

定义静态策略为不随时间而变化的策略，即 $\forall\ t,\ \pi_{t} = \pi$ ，其中静态的马尔可夫确定性策略被广泛研究：

\pi \in \mathcal{D} : s \in \mathcal{S} \mapsto a \in \mathcal{A}

另外地，定义静态的马尔可夫随机性策略为：

\pi \in \mathcal{D}^{\mathrm{A}} : \langle a \mid s \rangle \in \mathcal{S} \times \mathcal{A} \mapsto p(a \mid s) \in [0,\ 1]

类似地，定义在某个策略 $\pi$ 下，在某时刻 $t$ 处于状态 $s$ 获得的回报的数学期望为状态价值函数：

v_{\pi}(s) = \mathcal{E}(G_{t} \mid S_{t} = s)

定义在某个策略 $\pi$ 下，在某时刻 $t$ 处于状态 $s$ 并且采取了动作 $a$ 获得的回报的数学期望为动作价值函数：

q(s,\ a \mid v_{\pi}) = q_{\pi}(s,\ a) = \mathcal{E}(G_{t} \mid S_{t} = s,\ A_{t} = a)

策略等价性

由于 MDP 的状态转移具有马尔可夫性，因此可以通过构造 $\pi' \in \Pi^{\mathrm{MA}}$ 来获得与 $\pi \in \Pi^{\mathrm{HA}}$ 相同的价值函数。

\pi'(a_{t} \mid s_{t}) = p_{\pi'}(a_{t} \mid s_{t}) = p_{\pi}(a_{t} \mid s_{t},\ s_{0})

通过以上方法构造的马尔可夫随机策略 $\pi'$ 满足 $p_{\pi'}(s_{t},\ a_{t} \mid s_{0}) = p_{\pi}(s_{t},\ a_{t} \mid s_{0})$ ，可以通过归纳法来证明：

p_{\pi'}(s_{0},\ a_{0} \mid s_{0}) = p_{\pi'}(a_{0} \mid s_{0}) = p_{\pi}(a_{0} \mid s_{0},\ s_{0}) = p_{\pi}(s_{0},\ a_{0} \mid s_{0})

在 $t = 0$ 时等式成立的基础上，假设等式在 $t - 1$ 时成立，可得：

\begin{aligned} p_{\pi}(s_{t} \mid s_{0}) &= \sum_{s_{t - 1}} \sum_{a_{t - 1}} p_{\pi}(s_{t - 1},\ a_{t - 1} \mid s_{0}) p_{\pi}(s_{t} \mid s_{t - 1},\ a_{t - 1},\ s_{0}) \\[7mm] &= \sum_{s_{t - 1}} \sum_{a_{t - 1}} p_{\pi'}(s_{t - 1},\ a_{t - 1} \mid s_{0}) p(s_{t} \mid s_{t - 1},\ a_{t - 1}) \\[7mm] &= \sum_{s_{t - 1}} \sum_{a_{t - 1}} p_{\pi'}(s_{t - 1},\ a_{t - 1} \mid s_{0}) p_{\pi'}(s_{t} \mid s_{t - 1},\ a_{t - 1},\ s_{0}) = p_{\pi'}(s_{t} \mid s_{0}) \end{aligned}

进而证明等式在 $t$ 时同样成立：

p_{\pi'}(s_{t},\ a_{t} \mid s_{0}) = \underset{p_{\pi'}(a_{t} \mid s_{t})}{\underbrace{p_{\pi'}(a_{t} \mid s_{t},\ s_{0})}} \underset{p_{\pi}(s_{t} \mid s_{0})}{\underbrace{p_{\pi'}(s_{t} \mid s_{0})}} = p_{\pi}(a_{t} \mid s_{t},\ s_{0}) p_{\pi}(s_{t} \mid s_{0}) = p_{\pi}(s_{t},\ a_{t} \mid s_{0})

注意到在不同的状态价值函数虽然是不同的形式，但是均为 $\mathcal{E}(R_{t + 1 + \Delta t} \mid S_{t} = s_{t})$ 的线性组合，取 $t = 0$ 时：

\begin{aligned} \mathcal{E}_{\pi}(R_{t + 1} \mid S_{0} = s_{0}) &= \sum_{s_{t}} \sum_{a_{t}} \mathcal{E}_{\pi}(R_{t + 1} \mid S_{t} = s_{t},\ A_{t} = a_{t},\ \xcancel{S_{0} = s_{0}}) p_{\pi}(s_{t},\ a_{t} \mid s_{0}) \\[7mm] &= \sum_{s_{t}} \sum_{a_{t}} \mathcal{R}(s_{t},\ a_{t}) p_{\pi}(s_{t},\ a_{t} \mid s_{0}) \end{aligned}

即在已知初始状态或初始状态分布的情况下，任何具有相同 $p(s_{t},\ a_{t} \mid s_{0})$ 的策略 $\pi$ 拥有相同的价值函数。也就是说， $\pi \in \Pi^{\mathrm{HA}}$ 与 $\pi \in \Pi^{\mathrm{MA}}$ 在获得状态价值的层面上等价，因此后续只考虑马尔可夫随机策略。

贝尔曼期望方程

考虑在基于历史信息的策略 $a_{t} \sim \pi(\cdot \mid h_{t})$ 的基础上，将动作价值函数展开为嵌套期望的形式：

\begin{aligned} q_{\pi}(s_{t},\ a_{t}) &= \mathcal{E}(G_{t} \mid S_{t} = s_{t},\ A_{t} = a_{t}) = \mathcal{E}_{h_{t} \setminus s_{t}} \mathcal{E}_{r_{t + 1} \sim r(\cdot \mid s_{t},\ a_{t})} \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t},\ a_{t})} \mathcal{E}_{a_{t + 1} \sim \pi(\cdot \mid h_{t + 1})} \mathcal{E}_{r_{t + 2} \sim r(\cdot \mid s_{t + 1},\ a_{t + 1})} \cdots \left[ \sum_{\tau = 0}^{\infty} \gamma^{\tau} r_{t + \tau + 1} \right] \\[7mm] &= \mathcal{E}_{h_{t} \setminus s_{t}} \mathcal{E}_{r_{t + 1} \sim r(\cdot \mid s_{t},\ a_{t})} \Big[ r_{t + 1} \Big] + \gamma \mathcal{E}_{h_{t} \setminus s_{t}} \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t},\ a_{t})} \mathcal{E}_{a_{t + 1} \sim \pi(\cdot \mid h_{t + 1})} \mathcal{E}_{r_{t + 2} \sim r(\cdot \mid s_{t + 1},\ a_{t + 1})} \Big[ r_{t + 2} \Big] + \cdots \\[7mm] &= \mathcal{R}(s_{t},\ a_{t}) + \gamma \mathcal{E}_{h_{t} \setminus s_{t}} \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t},\ a_{t})} \mathcal{E}_{a_{t + 1} \sim \pi(\cdot \mid h_{t + 1})} \Big[ \mathcal{R}(s_{t + 1},\ a_{t + 1}) \Big] + \cdots \\[7mm] &= \sum_{\tau = 0}^{\infty} \mathcal{E}_{h_{t} \setminus s_{t}} \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t},\ a_{t})} \mathcal{E}_{a_{t + 1} \sim \pi(\cdot \mid h_{t + 1})} \mathcal{E}_{s_{t + 2} \sim p(\cdot \mid s_{t + 1},\ a_{t + 1})} \mathcal{E}_{a_{t + 2} \sim \pi(\cdot \mid h_{t + 2})} \cdots \Big[ \gamma^{\tau} \mathcal{R}(s_{t + \tau},\ a_{t + \tau}) \Big] \\[7mm] &= \mathcal{E}_{h_{t} \setminus s_{t}} \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t},\ a_{t})} \mathcal{E}_{a_{t + 1} \sim \pi(\cdot \mid h_{t + 1})} \mathcal{E}_{s_{t + 2} \sim p(\cdot \mid s_{t + 1},\ a_{t + 1})} \mathcal{E}_{a_{t + 2} \sim \pi(\cdot \mid h_{t + 2})} \cdots \left[ \sum_{\tau = 0}^{\infty} \gamma^{\tau} \mathcal{R}(s_{t + \tau},\ a_{t + \tau}) \right] \end{aligned}

类似地，状态价值函数定义为：

\begin{aligned} v_{\pi}(s_{t}) &= \mathcal{E}(G_{t} \mid S_{t} = s_{t}) = \mathcal{E}_{h_{t} \setminus s_{t}} \mathcal{E}_{a_{t} \sim \pi(\cdot \mid h_{t})} \mathcal{E}_{r_{t + 1} \sim r(\cdot \mid s_{t},\ a_{t})} \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t},\ a_{t})} \mathcal{E}_{a_{t + 1} \sim \pi(\cdot \mid h_{t + 1})} \mathcal{E}_{r_{t + 2} \sim r(\cdot \mid s_{t + 1},\ a_{t + 1})} \cdots \left[ \sum_{\tau = 0}^{\infty} \gamma^{\tau} r_{t + \tau + 1} \right] \\[7mm] &= \mathcal{E}_{h_{t} \setminus s_{t}} \mathcal{E}_{a_{t} \sim \pi(\cdot \mid h_{t})} \mathcal{E}_{r_{t + 1} \sim r(\cdot \mid s_{t},\ a_{t})} \Big[ r_{t + 1} \Big] + \gamma \mathcal{E}_{h_{t} \setminus s_{t}} \mathcal{E}_{a_{t} \sim \pi(\cdot \mid h_{t})} \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t},\ a_{t})} \mathcal{E}_{a_{t + 1} \sim \pi(\cdot \mid h_{t + 1})} \mathcal{E}_{r_{t + 2} \sim r(\cdot \mid s_{t + 1},\ a_{t + 1})} \Big[ r_{t + 2} \Big] + \cdots \\[7mm] &= \mathcal{E}_{h_{t} \setminus s_{t}} \mathcal{E}_{a_{t} \sim \pi(\cdot \mid h_{t})} \Big[ \mathcal{R}(s_{t},\ a_{t}) \Big] + \gamma \mathcal{E}_{h_{t} \setminus s_{t}} \mathcal{E}_{a_{t} \sim \pi(\cdot \mid h_{t})} \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t},\ a_{t})} \mathcal{E}_{a_{t + 1} \sim \pi(\cdot \mid h_{t + 1})} \Big[ \mathcal{R}(s_{t + 1},\ a_{t + 1}) \Big] + \cdots \\[7mm] &= \sum_{\tau = 0}^{\infty} \mathcal{E}_{h_{t} \setminus s_{t}} \mathcal{E}_{a_{t} \sim \pi(\cdot \mid h_{t})} \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t},\ a_{t})} \mathcal{E}_{a_{t + 1} \sim \pi(\cdot \mid h_{t + 1})} \mathcal{E}_{s_{t + 2} \sim p(\cdot \mid s_{t + 1},\ a_{t + 1})} \mathcal{E}_{a_{t + 2} \sim \pi(\cdot \mid h_{t + 2})} \cdots \Big[ \gamma^{\tau} \mathcal{R}(s_{t + \tau},\ a_{t + \tau}) \Big] \\[7mm] &= \mathcal{E}_{h_{t} \setminus s_{t}} \mathcal{E}_{a_{t} \sim \pi(\cdot \mid h_{t})} \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t},\ a_{t})} \mathcal{E}_{a_{t + 1} \sim \pi(\cdot \mid h_{t + 1})} \mathcal{E}_{s_{t + 2} \sim p(\cdot \mid s_{t + 1},\ a_{t + 1})} \mathcal{E}_{a_{t + 2} \sim \pi(\cdot \mid h_{t + 2})} \cdots \Big[ \sum_{\tau = 0}^{\infty} \gamma^{\tau} \mathcal{R}(s_{t + \tau},\ a_{t + \tau}) \Big] \end{aligned}

考虑在静态随机马尔可夫策略 $a_{t} \sim \pi(\cdot \mid s_{t}) \in \mathcal{D}^{\mathrm{A}}$ 下动作、状态价值函数可以简化为：

\begin{gathered} q_{\pi}(s_{t},\ a_{t}) = \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t},\ a_{t})} \mathcal{E}_{a_{t + 1} \sim \pi(\cdot \mid s_{t + 1})} \mathcal{E}_{s_{t + 2} \sim p(\cdot \mid s_{t + 1},\ a_{t + 1})} \mathcal{E}_{a_{t + 2} \sim \pi(\cdot \mid s_{t + 2})} \cdots \left[ \sum_{\tau = 0}^{\infty} \gamma^{\tau} \mathcal{R}(s_{t + \tau},\ a_{t + \tau}) \right] \\[7mm] v_{\pi}(s_{t}) = \mathcal{E}_{a_{t} \sim \pi(\cdot \mid s_{t})} \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t},\ a_{t})} \mathcal{E}_{a_{t + 1} \sim \pi(\cdot \mid s_{t + 1})} \mathcal{E}_{s_{t + 2} \sim p(\cdot \mid s_{t + 1},\ a_{t + 1})} \mathcal{E}_{a_{t + 2} \sim \pi(\cdot \mid s_{t + 2})} \cdots \Big[ \sum_{\tau = 0}^{\infty} \gamma^{\tau} \mathcal{R}(s_{t + \tau},\ a_{t + \tau}) \Big] \end{gathered}

结合二者的定义可以将动作价值函数递归地展开为：

\begin{aligned} q_{\pi}(s_{t},\ a_{t}) =\ &\mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t},\ a_{t})} \mathcal{E}_{a_{t + 1} \sim \pi(\cdot \mid s_{t + 1})} \mathcal{E}_{s_{t + 2} \sim p(\cdot \mid s_{t + 1},\ a_{t + 1})} \mathcal{E}_{a_{t + 2} \sim \pi(\cdot \mid s_{t + 2})} \cdots \Big[ R(s_{t},\ a_{t}) \Big]\ + \\[7mm] &\gamma \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t},\ a_{t})} \underset{v_{\pi}(s_{t + 1})}{\underbrace{\mathcal{E}_{a_{t + 1} \sim \pi(\cdot \mid s_{t + 1})} \mathcal{E}_{s_{t + 2} \sim p(\cdot \mid s_{t + 1},\ a_{t + 1})} \mathcal{E}_{a_{t + 2} \sim \pi(\cdot \mid s_{t + 2})} \cdots \left[ \sum_{\tau = 0}^{\infty} \gamma^{\tau} \mathcal{R}(s_{(t + 1) + \tau},\ a_{(t + 1) + \tau}) \right]}} \\[10mm] =\ &R(s_{t},\ a_{t}) + \gamma \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t},\ a_{t})} \Big[ v_{\pi}(s_{t + 1}) \Big] = R(s_{t},\ a_{t}) + \gamma \sum_{s_{t + 1}} v_{\pi}(s_{t + 1}) p(s_{t + 1} \mid s_{t},\ a_{t}) \end{aligned}

状态价值函数也可以递归地展开为：

\begin{aligned} v_{\pi}(s_{t}) &= \mathcal{E}_{a_{t} \sim \pi(\cdot \mid s_{t})} \underset{q_{\pi}(s_{t},\ a_{t})}{\underbrace{\mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t},\ a_{t})} \mathcal{E}_{a_{t + 1} \sim \pi(\cdot \mid s_{t + 1})} \mathcal{E}_{s_{t + 2} \sim p(\cdot \mid s_{t + 1},\ a_{t + 1})} \mathcal{E}_{a_{t + 2} \sim \pi(\cdot \mid s_{t + 2})} \cdots \Big[ \sum_{\tau = 0}^{\infty} \gamma^{\tau} \mathcal{R}(s_{t + \tau},\ a_{t + \tau}) \Big]}} \\[10mm] &= \mathcal{E}_{a_{t} \sim \pi(\cdot \mid s_{t})} q_{\pi}(s_{t},\ a_{t}) = \sum_{a_{t}} q_{\pi}(s_{t},\ a_{t}) \pi(a_{t} \mid s_{t}) \end{aligned}

可以通过以上方程定义贝尔曼期望算子 $\mathscr{L}_{\pi} : v \in \mathcal{V} = \mathbb{R}^{n} \mapsto \mathcal{V}$ ：

v = \mathscr{L}_{\pi} \{ v \} = \sum_{a} q(s,\ a \mid v) \pi(a \mid s) = \sum_{a} \mathcal{R}(s,\ a) \pi(a \mid s) + \sum_{s'} v(s') \sum_{a} p(s' \mid s,\ a) \pi(s \mid a) = \mathcal{R}_{\pi} + \gamma \mathcal{P}_{\pi} v

需要注意的是算子 $\mathscr{L}_{\pi}$ 可以保持状态函数的偏序性，即在 $v_{1} \le v_{2}$ 时，满足：

\mathscr{L}_{\pi} \{ v_{1} \} = \mathcal{R}_{\pi} + \gamma \mathcal{P}_{\pi} v_{1} \le \mathcal{R}_{\pi} + \gamma \mathcal{P}_{\pi} v_{2} = \mathscr{L}_{\pi} \{ v_{2} \}

贝尔曼最优方程

根据 $\Pi^{\mathrm{HA}}$ 和 $\Pi^{\mathrm{MA}}$ 之间的等价性，定义最优状态价值函数 $v^{\star}$ 为：

\forall\ s \in \mathcal{S} : v^{\star}(s) = \max_{\pi \in \Pi^{\mathrm{HA}}} v_{\pi}(s) = \max_{\pi \in \Pi^{\mathrm{MA}}} v_{\pi}(s)

策略 $\pi^{\star}$ 是最优策略当且仅当 $v_{\pi^{\star}} = v^{\star}$ 。在求解最优状态价值函数 $v^{\star}$ 以及最优策略 $\pi^{\star}$ 前先定义算子 $\mathscr{L}$ ：

\mathscr{L} \{ v \} : \mathcal{V} \mapsto \mathcal{V} = \max_{a \in \mathcal{A}} q(s,\ a \mid v) = \max_{a \in \mathcal{A}} \left\{ \mathcal{R}(s,\ a) + \gamma \sum_{s' \in \mathcal{S}} v(s') p(s' \mid s,\ a) \right\}

通过贝尔曼最优方程 $v = \mathscr{L}(v)$ 可以解得一个唯一的最优状态价值函数 $v^{\star}$ ，证明和求解 $\pi^{\star}$ 的过程如下：

证明 $\mathscr{L} \{ v \} = \max_{\pi \in \mathcal{D}} \mathscr{L}_{\pi} \{ v \} = \max_{\delta \in \mathcal{D}^{\mathrm{A}}} \mathscr{L}_{\delta} \{ v \}$

考虑在静态马尔可夫确定性策略空间 $\mathcal{D}$ 下：

\max_{\pi \in \mathcal{D}} \mathscr{L}_{\pi} \{ v \} = \max_{\pi \in \mathcal{D}} \sum_{a \in \mathcal{A}} q(s,\ a \mid v) \pi(a \mid s) = \max_{a \in \mathcal{A}} q(s,\ a \mid v) = \mathscr{L} \{ v \}

对于以上命题可以很容易地通过构造性方法来进行证明，例如：

\pi(a \mid s) = \left\{ \begin{matrix} 1 & a = \argmax_{a'} q(a',\ s \mid v) \\[3mm] 0 & a \ne \argmax_{a'} q(a',\ s \mid v) \end{matrix} \right.

对于一个静态马尔可夫随机性策略 $\delta \in \mathcal{D}^{\mathrm{A}}$ ：

\begin{aligned} \mathscr{L}_{\delta} \{ v \} &= \sum_{a \in \mathcal{A}} q(s,\ a \mid v) \delta(a \mid s) \le \sum_{a \in \mathcal{A}} \delta(a \mid s) \max_{a' \in \mathcal{A}} q(s,\ a' \mid v) \\[7mm] &= \sum_{a \in \mathcal{A}} \delta(a \mid s) \mathscr{L} \{ v \} = \mathscr{L} \{ v \} = \max_{\pi \in \mathcal{D}} \mathscr{L}_{\pi} \{ v \} \end{aligned}

因此 $\max_{\delta \in \mathcal{D}^{\mathrm{A}}} \mathscr{L}_{\delta} \{ v \} \le \max_{\pi \in \mathcal{D}} \mathscr{L}_{\pi} \{ v \}$ ，同时 $\mathcal{D} \subset \mathcal{D}^{\mathrm{A}}$ ，又有 $\max_{\delta \in \mathcal{D}^{\mathrm{A}}} \mathscr{L}_{\delta} \{ v \} \ge \max_{\pi \in \mathcal{D}} \mathscr{L}_{\pi} \{ v \}$ ，进而：

\mathscr{L} \{ v \} = \max_{\pi \in \mathcal{D}} \mathscr{L}_{\pi} \{ v \} = \max_{\delta \in \mathcal{D}^{\mathrm{A}}} \mathscr{L}_{\delta} \{ v \}

证明 $v = \mathscr{L} \{ v \} \Rightarrow v = v^{\star}$

考虑 $v \in \mathcal{V} \ge \mathscr{L} \{ v \}$ 以及 $\pi \in \Pi^{\mathrm{MA}} = (\pi_{0},\ \pi_{1},\ \cdots)$ ，其中在每个时间步上可以认为 $\pi_{t} \in \mathcal{D}^{\mathrm{A}}$ ，因此：

\begin{aligned} v \ge \mathscr{L} \{ v \} = \max_{\delta \in \mathcal{D}^{\mathrm{A}}} \mathscr{L}_{\delta} \{ v \} \ge \mathscr{L}_{\pi_{t}} \{ v \} \end{aligned}

由于算子 $\mathscr{L}_{\pi}$ 可以保持状态价值函数之间的偏序性，因此可以层层嵌套得到：

\begin{aligned} v &\ge \mathscr{L}_{\pi_{0}} \{ v \} \ge \mathscr{L}_{\pi_{0}} \{ \mathscr{L}_{\pi_{1}} \{ v \} \} \ge \cdots \ge \mathscr{L}_{\pi_{0}} \mathscr{L}_{\pi_{1}} \cdots \mathscr{L}_{\pi_{\mathrm{T}}} \{ v \} \\[5mm] &= \mathcal{R}_{\pi_{0}} + \gamma \mathcal{P}_{\pi_{0}} \mathcal{R}_{\pi_{1}} + \cdots + \gamma^{\mathrm{T}} \prod_{t = 0}^{\mathrm{T} - 1} \mathcal{P}_{\pi_{t}} \mathcal{R}_{\pi_{\mathrm{T}}} + \gamma^{\mathrm{T} + 1} \prod_{t = 0}^{\mathrm{T}} \mathcal{P}_{\pi_{t}} v \end{aligned}

因此

\begin{aligned} v - v_{\pi} &= v - \left[ \mathcal{R}_{\pi_{0}} + \gamma \mathcal{P}_{\pi_{0}} \mathcal{R}_{\pi_{1}} + \cdots + \gamma^{\mathrm{T}} \prod_{t = 0}^{\mathrm{T} - 1} \mathcal{P}_{\pi_{t}} \mathcal{R}_{\pi_{\mathrm{T}}} + \cdots \right] \\[7mm] &\ge \gamma^{\mathrm{T} + 1} \prod_{t = 0}^{\mathrm{T}} \mathcal{P}_{\pi_{t}} v - \sum_{\tau = \mathrm{T}}^{\infty} \gamma^{\tau + 1} \prod_{t = 0}^{\tau} \mathcal{P}_{\pi_{t}} \mathcal{R}_{\pi_{\tau + 1}} \Leftarrow \forall\ \mathrm{T} \end{aligned}

对以上不等式的右侧的两项的范数进行放缩：

\begin{gathered} \left|\left| \gamma^{\mathrm{T} + 1} \prod_{t = 0}^{\mathrm{T}} \mathcal{P}_{\pi_{t}} v \right|\right| \le \gamma^{\mathrm{T} + 1} \prod_{t = 0}^{\mathrm{T}} || \mathcal{P}_{\pi_{t}} || \cdot || v || \le \gamma^{\mathrm{T} + 1} || v || \\[7mm] \left|\left| \sum_{\tau = \mathrm{T}}^{\infty} \gamma^{\tau + 1} \prod_{t = 0}^{\tau} \mathcal{P}_{\pi_{t}} \mathcal{R}_{\pi_{\tau + 1}} \right|\right| \le \sum_{\tau = \mathrm{T}}^{\infty} \gamma^{\tau + 1} \left|\left| \prod_{t = 0}^{\tau} \mathcal{P}_{\pi_{t}} \mathcal{R}_{\pi_{\tau + 1}} \right|\right| \le \sum_{\tau = \mathrm{T}}^{\infty} \gamma^{\tau + 1} \prod_{t = 0}^{\tau} || \mathcal{P}_{\pi_{t}} || \cdot || \mathcal{R}_{\pi_{\tau + 1}} || \le \sum_{\tau = \mathrm{T}}^{\infty} \gamma^{\tau + 1} \max_{s,\ a} | \mathcal{R}(s,\ a) | \end{gathered}

因此在 $\mathrm{T} \to \infty$ 时，不等式右侧趋于 0，因此 $\forall\ v_{\pi} \in \Pi^{\mathrm{MA}} : v \ge v_{\pi}$ ，进而：

v \ge \max_{\pi \in \Pi^{\mathrm{MA}}} v_{\pi} = \max_{\pi \in \Pi^{\mathrm{HA}}} v_{\pi} = v^{\star}

通过类似的方法可以证明 $v \le \mathscr{L} \{ v \} \Rightarrow v \le v^{\star}$ ，结合这两个命题即可得出 $v = \mathscr{L} \{ v \} \Rightarrow v = v^{\star}$ 。

证明方程 $v = \mathscr{L}(v)$ 的解存在且唯一

定义距离函数 $\rho : \mathcal{V} \times \mathcal{V} \mapsto \mathbb{R} = \max_{s \in \mathcal{S}} | v_{1}(s) - v_{2}(s) |$ 。则 $\langle \mathcal{V},\ \rho \rangle$ 是一个完备距离空间。在此基础上，只要证明算子 $\mathscr{L}$ 是该空间上的压缩映射即可（巴拿赫不动点定理），对于 $v,\ u \in \mathcal{V}$ 以及 $s \in \mathcal{S}$ ，令：

\hat{a} \in \argmax_{a \in \mathcal{A}} q(s,\ a \mid v) = \argmax_{a \in \mathcal{A}} \left\{ \mathcal{R}(s,\ a) + \gamma \sum_{s' \in \mathcal{S}} v(s') p(s' \mid s,\ a) \right\}

不失一般性地假设 $\mathscr{L} \{ v \}(s) \le \mathscr{L} \{ u \}(s)$ ，则：

\begin{aligned} | \mathscr{L} \{ v \}(s) - \mathscr{L} \{ u \}(s) | &= \mathscr{L} \{ v \}(s) - \mathscr{L} \{ u \}(s) \le q(s,\ \hat{a} \mid v) - q(s,\ \hat{a} \mid u) \\[5mm] &= \gamma \sum_{s' \in \mathcal{S}} \bigg[ v(s') - u(s') \bigg] p(s' \mid s,\ a) \le \gamma \sum_{s' \in \mathcal{S}} \bigg| v(s') - u(s') \bigg| p(s' \mid s,\ a) \\[7mm] &\le \gamma \sum_{s' \in \mathcal{S}} \rho(v,\ u) p(s' \mid s,\ a) = \gamma \rho(v,\ u) \\[7mm] \rho(\mathscr{L} \{ v \},\ \mathscr{L} \{ u \}) &= \max_{s \in \mathcal{S}} | \mathscr{L} \{ v \}(s) - \mathscr{L} \{ u \}(s) | \le \gamma \rho(v,\ u) \end{aligned}

由此可得算子 $\mathscr{L}$ 是 $\langle \mathcal{V},\ \rho \rangle$ 空间上的压缩映射，因此方程 $v = \mathscr{L}(v)$ 的解存在且唯一。并且通过巴拿赫不动点定理的构造性证明中也可以得到方程 $v = \mathscr{L} \{ v \}$ 的解，即 $v^{\star} = \lim_{k \to \infty} \mathscr{L}^{k} \{ v_{0} \}$ 。

证明总是存在一个静态马尔可夫确定性策略 $\pi \in \mathcal{D}$ 使得 $\pi$ 是一个最优策略

考虑 $\pi^{\star} \in \argmax_{\pi \in \mathcal{D}} \mathscr{L}_{\pi} \{ v^{\star} \}$ ，根据以上的证明过程，可得：

\mathscr{L}_{\pi^{\star}} \{ v^{\star} \} = \max_{\pi \in \mathcal{D}} \mathscr{L}_{\pi} \{ v^{\star} \} = \mathscr{L} \{ v^{\star} \} = v^{\star}

由于贝尔曼期望方程 $v = \mathscr{L}_{\pi^{\star}} \{ v \}$ 的解存在且唯一，因此 $v_{\pi^{\star}} = v^{\star}$ ，即 $\pi^{\star}$ 是最优策略，并且可以构造为：

\pi^{\star}(a \mid s) = \left\{ \begin{matrix} 1 & a = \argmax_{a'} q(a',\ s \mid v^{\star}) \\[3mm] 0 & a \ne \argmax_{a'} q(a',\ s \mid v^{\star}) \end{matrix} \right.

RL > Preliminary

#MDP

Markov Decision Process

http://example.com/2024/07/09/MDP/

Author

木辛

Posted on

July 9, 2024

Licensed under

Partially Observable Markov Decision Process Previous

Markov Reward Process Next