Markov Reward Process

马尔可夫奖励过程(MRP)

基础定义

一个 MRP 模型 M\mathscr{M} 可以形式化定义为元组 S, P, R, γ\langle \mathcal{S},\ \mathcal{P},\ \mathcal{R},\ \gamma \rangle

  1. 有限状态集合 S={s1, s2, , sn}\mathcal{S} = \{ s_{1},\ s_{2},\ \cdots,\ s_{n} \}
  2. 状态转移概率矩阵 P={pij=p(St+1=siSt=sj)}n×n\mathcal{P} = \{ p_{ij} = p(S_{t + 1} = s_{i} \mid S_{t} = s_{j}) \}_{n \times n}
  3. 奖励函数 R:SR=E(Rt+1St=s)\mathcal{R} : \mathcal{S} \mapsto \mathbb{R} = \mathcal{E}(R_{t + 1} \mid S_{t} = s)
  4. 衰减因子 γ[0, 1)\gamma \in [0,\ 1)

其中状态集合 S\mathcal{S} 和 状态转移概率矩阵 P\mathcal{P} 的定义和普通的马尔可夫过程一致。奖励函数 R\mathcal{R} 代表某时刻 tt 处于状态 ss 时获得的收益的期望(下一时刻收取)。考虑在某一时刻 tt 处于状态 ss 时未来一段时间 T\mathrm{T} 的回报:

Gt(T, 1)=Rt+1+Rt+2++Rt+TG_{t}(\mathrm{T},\ 1) = R_{t + 1} + R_{t + 2} + \cdots + R_{t + \mathrm{T}}

考虑近期收益和远期收益在实际情况中并不等价,因此使用衰减因子 γ\gamma 对未来不同时间的收益进行加权:

Gt(T, γ)=Rt+1+γRt+2++γTRt+TG_{t}(\mathrm{T},\ \gamma) = R_{t + 1} + \gamma R_{t + 2} + \cdots + \gamma^{\mathrm{T}} R_{t + \mathrm{T}}

某个状态 ss 的价值函数 v:SRv : \mathcal{S} \mapsto \mathbb{R} 为任意时刻 tt 处于状态 ss 时收益的数学期望,可以定义为不同的形式:

收益形式 数学定义
有限期价值和 Gt(T, 1)=Rt+1+Rt+2++Rt+TG_{t}(\mathrm{T},\ 1) = R_{t + 1} + R_{t + 2} + \cdots + R_{t + \mathrm{T}}
有限期衰减价值和 Gt(T, γ)=Rt+1+γRt+2++γTRt+TG_{t}(\mathrm{T},\ \gamma) = R_{t + 1} + \gamma R_{t + 2} + \cdots + \gamma^{\mathrm{T}} R_{t + \mathrm{T}}
无限期衰减价值和 Gt(, γ)=Rt+1+γRt+2+γ2Rt+3+G_{t}(\infty,\ \gamma) = R_{t + 1} + \gamma R_{t + 2} + \gamma^{2} R_{t + 3} + \cdots
不确定期价值和(伪无限期) Gt(T, γ)=Rt+1+γRt+2++γTRt+TG_{t}(\mathrm{T},\ \gamma) = R_{t + 1} + \gamma R_{t + 2} + \cdots + \gamma^{\mathrm{T}} R_{t + \mathrm{T}}
无限期平均价值 Gˉt(, 1)=limt1TGt(T, 1)\bar{G}_{t}(\infty,\ 1) = \lim_{t \to \infty} \dfrac{1}{\mathrm{T}}G_{t}(\mathrm{T},\ 1)

一般来说默认的回报为 Gt=Gt(, γ)G_{t} = G_{t}(\infty,\ \gamma),在此定义下,状态价值函数为:

v(st)=E(GtSt=st)=Ert+1r(st)Est+1p(st)Ert+2r(st+1)Est+2p(st+1)[τ=0γτrt+τ+1]=Ert+1r(st)[rt+1]+γEst+1p(st)Ert+2r(st+1)[rt+2]+γ2Est+1p(st)Est+2p(st+1)Ert+3r(st+1)[rt+3]+=R(st)+γEst+1p(st)[R(st+1)]+γ2Est+1p(st)Est+2p(st+1)[R(st+2)]+=τ=0Est+1p(st)Est+2p(st+1)[γτR(st+τ)]=Est+1p(st)Est+2p(st+1)[τ=0γτR(st+τ)]\begin{aligned} v(s_{t}) &= \mathcal{E}(G_{t} \mid S_{t} = s_{t}) = \mathcal{E}_{r_{t + 1} \sim r(\cdot \mid s_{t})} \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t})} \mathcal{E}_{r_{t + 2} \sim r(\cdot \mid s_{t + 1})} \mathcal{E}_{s_{t + 2} \sim p(\cdot \mid s_{t + 1})} \cdots \left[ \sum_{\tau = 0}^{\infty} \gamma^{\tau} r_{t + \tau + 1} \right] \\[7mm] &= \mathcal{E}_{r_{t + 1} \sim r(\cdot \mid s_{t})} \Big[ r_{t + 1} \Big] + \gamma \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t})} \mathcal{E}_{r_{t + 2} \sim r(\cdot \mid s_{t + 1})} \Big[ r_{t + 2} \Big] + \gamma^{2} \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t})} \mathcal{E}_{s_{t + 2} \sim p(\cdot \mid s_{t + 1})} \mathcal{E}_{r_{t + 3} \sim r(\cdot \mid s_{t + 1})} \Big[ r_{t + 3} \Big] + \cdots \\[7mm] &= \mathcal{R}(s_{t}) + \gamma \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t})} \Big[ \mathcal{R}(s_{t + 1}) \Big] + \gamma^{2} \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t})} \mathcal{E}_{s_{t + 2} \sim p(\cdot \mid s_{t + 1})} \Big[ \mathcal{R}(s_{t + 2}) \Big] + \cdots \\[7mm] &= \sum_{\tau = 0}^{\infty} \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t})} \mathcal{E}_{s_{t + 2} \sim p(\cdot \mid s_{t + 1})} \cdots \Big[ \gamma^{\tau} \mathcal{R}(s_{t + \tau}) \Big] = \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t})} \mathcal{E}_{s_{t + 2} \sim p(\cdot \mid s_{t + 1})} \cdots \left[ \sum_{\tau = 0}^{\infty} \gamma^{\tau} \mathcal{R}(s_{t + \tau}) \right] \end{aligned}

贝尔曼期望方程

通过定义可以将状态价值函数递归地分解为贝尔曼期望方程的形式:

v(st)=Est+1p(st)Est+2p(st+1)[R(st)]+γEst+1p(st)Est+2p(st+1)[τ=0γτR(s(t+1)+τ)]v(st+1)=R(st)+γEst+1p(st)[v(st+1)]=R(st)+γst+1v(st+1)p(st+1st)\begin{aligned} v(s_{t}) &= \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t})} \mathcal{E}_{s_{t + 2} \sim p(\cdot \mid s_{t + 1})} \cdots \Big[ \mathcal{R}(s_{t}) \Big] + \gamma \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t})} \underset{v(s_{t + 1})}{\underbrace{\mathcal{E}_{s_{t + 2} \sim p(\cdot \mid s_{t + 1})} \cdots \left[ \sum_{\tau = 0}^{\infty} \gamma^{\tau} \mathcal{R}(s_{(t + 1) + \tau}) \right]}} \\[10mm] &= \mathcal{R}(s_{t}) + \gamma \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t})} \Big[ v(s_{t + 1}) \Big] = \mathcal{R}(s_{t}) + \gamma \sum_{s_{t + 1}} v(s_{t + 1}) p(s_{t + 1} \mid s_{t}) \end{aligned}


Markov Reward Process
http://example.com/2024/07/08/MRP/
Author
木辛
Posted on
July 8, 2024
Licensed under