Markov Reward Process

马尔可夫奖励过程（MRP）

基础定义

一个 MRP 模型 $\mathscr{M}$ 可以形式化定义为元组 $\langle \mathcal{S},\ \mathcal{P},\ \mathcal{R},\ \gamma \rangle$ ：

有限状态集合 $\mathcal{S} = \{ s_{1},\ s_{2},\ \cdots,\ s_{n} \}$
状态转移概率矩阵 $\mathcal{P} = \{ p_{ij} = p(S_{t + 1} = s_{i} \mid S_{t} = s_{j}) \}_{n \times n}$
奖励函数 $\mathcal{R} : \mathcal{S} \mapsto \mathbb{R} = \mathcal{E}(R_{t + 1} \mid S_{t} = s)$
衰减因子 $\gamma \in [0,\ 1)$

其中状态集合 $\mathcal{S}$ 和状态转移概率矩阵 $\mathcal{P}$ 的定义和普通的马尔可夫过程一致。奖励函数 $\mathcal{R}$ 代表某时刻 $t$ 处于状态 $s$ 时获得的收益的期望（下一时刻收取）。考虑在某一时刻 $t$ 处于状态 $s$ 时未来一段时间 $\mathrm{T}$ 的回报：

G_{t}(\mathrm{T},\ 1) = R_{t + 1} + R_{t + 2} + \cdots + R_{t + \mathrm{T}}

考虑近期收益和远期收益在实际情况中并不等价，因此使用衰减因子 $\gamma$ 对未来不同时间的收益进行加权：

G_{t}(\mathrm{T},\ \gamma) = R_{t + 1} + \gamma R_{t + 2} + \cdots + \gamma^{\mathrm{T}} R_{t + \mathrm{T}}

某个状态 $s$ 的价值函数 $v : \mathcal{S} \mapsto \mathbb{R}$ 为任意时刻 $t$ 处于状态 $s$ 时收益的数学期望，可以定义为不同的形式：

收益形式	数学定义
有限期价值和	$G_{t}(\mathrm{T},\ 1) = R_{t + 1} + R_{t + 2} + \cdots + R_{t + \mathrm{T}}$
有限期衰减价值和	$G_{t}(\mathrm{T},\ \gamma) = R_{t + 1} + \gamma R_{t + 2} + \cdots + \gamma^{\mathrm{T}} R_{t + \mathrm{T}}$
无限期衰减价值和	$G_{t}(\infty,\ \gamma) = R_{t + 1} + \gamma R_{t + 2} + \gamma^{2} R_{t + 3} + \cdots$
不确定期价值和（伪无限期）	$G_{t}(\mathrm{T},\ \gamma) = R_{t + 1} + \gamma R_{t + 2} + \cdots + \gamma^{\mathrm{T}} R_{t + \mathrm{T}}$
无限期平均价值	$\bar{G}_{t}(\infty,\ 1) = \lim_{t \to \infty} \dfrac{1}{\mathrm{T}}G_{t}(\mathrm{T},\ 1)$

一般来说默认的回报为 $G_{t} = G_{t}(\infty,\ \gamma)$ ，在此定义下，状态价值函数为：

\begin{aligned} v(s_{t}) &= \mathcal{E}(G_{t} \mid S_{t} = s_{t}) = \mathcal{E}_{r_{t + 1} \sim r(\cdot \mid s_{t})} \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t})} \mathcal{E}_{r_{t + 2} \sim r(\cdot \mid s_{t + 1})} \mathcal{E}_{s_{t + 2} \sim p(\cdot \mid s_{t + 1})} \cdots \left[ \sum_{\tau = 0}^{\infty} \gamma^{\tau} r_{t + \tau + 1} \right] \\[7mm] &= \mathcal{E}_{r_{t + 1} \sim r(\cdot \mid s_{t})} \Big[ r_{t + 1} \Big] + \gamma \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t})} \mathcal{E}_{r_{t + 2} \sim r(\cdot \mid s_{t + 1})} \Big[ r_{t + 2} \Big] + \gamma^{2} \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t})} \mathcal{E}_{s_{t + 2} \sim p(\cdot \mid s_{t + 1})} \mathcal{E}_{r_{t + 3} \sim r(\cdot \mid s_{t + 1})} \Big[ r_{t + 3} \Big] + \cdots \\[7mm] &= \mathcal{R}(s_{t}) + \gamma \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t})} \Big[ \mathcal{R}(s_{t + 1}) \Big] + \gamma^{2} \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t})} \mathcal{E}_{s_{t + 2} \sim p(\cdot \mid s_{t + 1})} \Big[ \mathcal{R}(s_{t + 2}) \Big] + \cdots \\[7mm] &= \sum_{\tau = 0}^{\infty} \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t})} \mathcal{E}_{s_{t + 2} \sim p(\cdot \mid s_{t + 1})} \cdots \Big[ \gamma^{\tau} \mathcal{R}(s_{t + \tau}) \Big] = \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t})} \mathcal{E}_{s_{t + 2} \sim p(\cdot \mid s_{t + 1})} \cdots \left[ \sum_{\tau = 0}^{\infty} \gamma^{\tau} \mathcal{R}(s_{t + \tau}) \right] \end{aligned}

贝尔曼期望方程

通过定义可以将状态价值函数递归地分解为贝尔曼期望方程的形式：

\begin{aligned} v(s_{t}) &= \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t})} \mathcal{E}_{s_{t + 2} \sim p(\cdot \mid s_{t + 1})} \cdots \Big[ \mathcal{R}(s_{t}) \Big] + \gamma \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t})} \underset{v(s_{t + 1})}{\underbrace{\mathcal{E}_{s_{t + 2} \sim p(\cdot \mid s_{t + 1})} \cdots \left[ \sum_{\tau = 0}^{\infty} \gamma^{\tau} \mathcal{R}(s_{(t + 1) + \tau}) \right]}} \\[10mm] &= \mathcal{R}(s_{t}) + \gamma \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t})} \Big[ v(s_{t + 1}) \Big] = \mathcal{R}(s_{t}) + \gamma \sum_{s_{t + 1}} v(s_{t + 1}) p(s_{t + 1} \mid s_{t}) \end{aligned}

RL > Preliminary

#MRP

Markov Reward Process

http://example.com/2024/07/08/MRP/

Author

木辛

Posted on

July 8, 2024

Licensed under

Markov Decision Process Previous