Policy Gradient

策略梯度与基线函数

策略梯度

将一个静态马尔可夫随机策略参数化为 $\pi_{\theta}(a \mid s)$ ，为了最大化策略的期望回报，定义目标函数：

\begin{aligned} J(\theta) &= \mathcal{E}_{s_{0} \sim b_{0}(\cdot)} \mathcal{E}_{a_{0} \sim \pi_{\theta}(\cdot \mid s_{0})} \mathcal{E}_{r_{1} \sim r(\cdot \mid s_{0},\ a_{0})} \mathcal{E}_{s_{1} \sim p(\cdot \mid s_{0},\ a_{0})} \mathcal{E}_{a_{1} \sim \pi_{\theta}(\cdot \mid s_{1})} \mathcal{E}_{r_{2} \sim r(\cdot \mid s_{1},\ a_{1})} \cdots \mathcal{E}_{r_{\mathrm{T} + 1} \sim r(\cdot \mid s_{\mathrm{T}},\ a_{\mathrm{T}})} \left[ \sum_{t = 0}^{\mathrm{T}} \gamma^{t} r_{t + 1} \right] \\[7mm] &= \mathcal{E}_{s_{0} \sim b_{0}(\cdot)} \mathcal{E}_{a_{0} \sim \pi_{\theta}(\cdot \mid s_{0})} \mathcal{E}_{s_{1} \sim p(\cdot \mid s_{0}\ a_{0})} \mathcal{E}_{a_{1} \sim \pi_{\theta}(\cdot \mid s_{1})} \cdots \mathcal{E}_{s_{\mathrm{T}} \sim p(\cdot \mid s_{\mathrm{T} - 1},\ a_{\mathrm{T} - 1})} \mathcal{E}_{a_{\mathrm{T}} \sim \pi_{\theta}(\cdot \mid s_{\mathrm{T}})} \left[ \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{R}(s_{t},\ a_{t}) \right] \\[7mm] &= \sum_{s_{0}} \sum_{a_{0}} \sum_{s_{1}} \sum_{a_{1}} \cdots \sum_{s_{\mathrm{T}}} \sum_{a_{\mathrm{T}}} \left[ b_{0}(s_{0}) \prod_{t = 0}^{\mathrm{T}} \pi_{\theta}(a_{t} \mid s_{t}) \prod_{t = 1}^{\mathrm{T}} p(s_{t} \mid s_{t - 1},\ a_{t - 1}) \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{R}(s_{t},\ a_{t}) \right] \end{aligned}

在策略的条件概率值 $\pi_{\theta}(a_{t} \mid s_{t})$ 对策略参数 $\theta$ 求梯度时为了方便起见将形式重写为：

\nabla_{\theta} \pi_{\theta}(a_{t} \mid s_{t}) = \pi_{\theta}(a_{t} \mid s_{t}) \frac{\nabla_{\theta} \pi_{\theta}(a_{t} \mid s_{t})}{\pi_{\theta}(a_{t} \mid s_{t})} = \pi_{\theta}(a_{t} \mid s_{t}) \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t})

因此目标函数中策略条件概率的乘积项的梯度为：

\nabla_{\theta} \prod_{t = 0}^{\mathrm{T}} \pi_{\theta}(a_{t} \mid s_{t}) = \prod_{t = 0}^{\mathrm{T}} \pi_{\theta}(a_{t} \mid s_{t}) \sum_{t = 0}^{\mathrm{T}} \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t})

目标函数对策略参数 $\theta$ 求导得到（一行实在写不下了，期望下标中的条件分布暂时省略 😭）：

\begin{aligned} \nabla_{\theta} J(\theta) &= \sum_{s_{0}} \sum_{a_{0}} \sum_{s_{1}} \sum_{a_{1}} \cdots \sum_{s_{\mathrm{T}}} \sum_{a_{\mathrm{T}}} \left[ b_{0}(s_{0}) \left( \nabla_{\theta} \prod_{t = 0}^{\mathrm{T}} \pi_{\theta}(a_{t} \mid s_{t}) \right) \prod_{t = 1}^{\mathrm{T}} p(s_{t} \mid s_{t - 1},\ a_{t - 1}) \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{R}(s_{t},\ a_{t}) \right] \\[7mm] &= \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \mathcal{E}_{a_{\mathrm{T}}} \left[ \left( \sum_{t = 0}^{\mathrm{T}} \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) \right) \cdot \left( \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{R}(s_{t},\ a_{t}) \right) \right] \end{aligned}

考虑以上梯度中的一项乘积因子 $\nabla_{\theta} \pi_{\theta}(a_{t} \mid s_{t}) \mathcal{R}(s_{\tau},\ a_{\tau})$ 在 $t > \tau$ 时的期望：

\mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \mathcal{E}_{a_{\mathrm{T}}} \Big[ \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) \mathcal{R}(s_{\tau},\ a_{\tau}) \Big] = \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{t}} \mathcal{E}_{a_{t}} \Big[ \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) \mathcal{R}(s_{\tau},\ a_{\tau}) \Big]

其中：

\begin{aligned} &\mathcal{E}_{a_{t} \sim \pi_{\theta}(\cdot \mid s_{t})} \Big[ \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) \mathcal{R}(s_{\tau},\ a_{\tau}) \Big] = \mathcal{R}(s_{\tau},\ a_{\tau}) \mathcal{E}_{a_{t} \mid \pi_{\theta}(\cdot \mid s_{t})} \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) \\[5mm] = &\mathcal{R}(s_{\tau},\ a_{\tau}) \sum_{a_{t}} \pi_{\theta}(a_{t} \mid s_{t}) \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) = \mathcal{R}(s_{\tau},\ a_{\tau}) \nabla_{\theta} \underset{1}{\underbrace{\sum_{a_{t}} \pi_{\theta}(a_{t} \mid s_{t})}} = 0 \end{aligned}

因此所有 $t > \tau$ 的乘积因子项的期望均为 0，策略梯度简化为：

\begin{aligned} \nabla_{\theta} J(\theta) &= \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \mathcal{E}_{a_{\mathrm{T}}} \left[ \sum_{t = 0}^{\mathrm{T}} \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) \sum_{\tau = t}^{\mathrm{T}} \gamma^{\tau} \mathcal{R}(s_{\tau},\ a_{\tau}) \right] \\[7mm] &= \sum_{t = 0}^{\mathrm{T}} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{t}} \mathcal{E}_{a_{t}} \bigg[ \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) \cdot \gamma^{t} \underset{q_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t})}{\underbrace{\mathcal{E}_{s_{t + 1}} \mathcal{E}_{a_{t + 1}} \cdots \mathcal{E}_{s_{\mathrm{T}}} \mathcal{E}_{a_{\mathrm{T}}} \sum_{\tau = t}^{\mathrm{T}} \gamma^{\tau - t} \mathcal{R}(s_{\tau},\ a_{\tau})}} \bigg] \\[10mm] &= \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{t}} \mathcal{E}_{a_{t}} \Big[ \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) q_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) \Big] \end{aligned}

由于：

\begin{aligned} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{t}} \Big[ f(s_{t}) \Big] &= \sum_{s_{0}} \sum_{a_{0}} \sum_{s_{1}} \sum_{a_{1}} \cdots \sum_{s_{t}} b_{0}(s_{0}) \prod_{\tau = 0}^{t - 1} \pi_{\theta}(a_{\tau} \mid s_{\tau}) \prod_{\tau = 0}^{t - 1} p(s_{\tau + 1} \mid s_{\tau},\ a_{\tau}) f(s_{t}) \\[7mm] &= \sum_{s_{t}} f(s_{t}) \sum_{s_{0}} \sum_{a_{0}} \sum_{s_{1}} \sum_{a_{1}} \cdots \sum_{s_{t - 1}} b_{0}(s_{0}) \prod_{\tau = 0}^{t - 1} \pi_{\theta}(a_{\tau} \mid s_{\tau}) \prod_{\tau = 0}^{t - 1} p(s_{\tau + 1} \mid s_{\tau},\ a_{\tau}) \\[7mm] &= \sum_{s_{t}} f(s_{t}) b_{t}(s_{t}) = \mathcal{E}_{s_{t} \sim b_{t}(\cdot)} \Big[ f(s_{t}) \Big] \end{aligned}

其中 $b_{t}(\cdot)$ 为初始状态分布 $b_{0}(\cdot)$ 和策略 $\pi_{\theta}$ 下状态 $s_{t}$ 的边缘概率分布，基于此可以改写策略梯度形式为：

\nabla_{\theta} J(\theta) = \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{t} \sim b_{t}(\cdot)} \mathcal{E}_{a_{t} \sim \pi_{\theta}(\cdot \mid s_{t})} \Big[ \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) q_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) \Big]

推广到无限期规划下：

\begin{aligned} \nabla_{\theta} J(\theta) &= \sum_{t = 0}^{\infty} \gamma^{t} \mathcal{E}_{s_{t} \sim b_{t}(\cdot)} \mathcal{E}_{a_{t} \sim \pi_{\theta}(\cdot \mid s_{t})} \Big[ \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) q_{\pi_{\theta}}(s_{t},\ a_{t}) \Big] \\[7mm] &= \sum_{t = 0}^{\infty} \gamma^{t} \sum_{s} b_{t}(s) \sum_{a} \pi_{\theta}(a \mid s) \Big[ \nabla_{\theta} \ln \pi_{\theta}(a \mid s) q_{\pi_{\theta}}(s,\ a) \Big] \\[7mm] &= \frac{1}{1 - \gamma} \sum_{s} \underset{\nu_{\pi_{\theta}}(s)}{\underbrace{\left( (1 - \gamma) \sum_{t = 0}^{\infty} \gamma^{t} b_{t}(s) \right)}} \sum_{a} \pi_{\theta}(a \mid s) \Big[ \nabla_{\theta} \ln \pi_{\theta}(a \mid s) q_{\pi_{\theta}}(s,\ a) \Big] \\[10mm] &= \frac{1}{1 - \gamma} \sum_{s} \sum_{a} \underset{\rho_{\pi_{\theta}}(s,\ a)}{\underbrace{\nu_{\pi_{\theta}}(s) \pi_{\theta}(a \mid s)}} \Big[ \nabla_{\theta} \ln \pi_{\theta}(a \mid s) q_{\pi_{\theta}}(s,\ a) \Big] \\[10mm] &\propto \mathcal{E}_{s \sim \nu_{\theta}(\cdot)} \mathcal{E}_{a \sim \pi_{\theta}(\cdot)} \Big[ \nabla_{\theta} \ln \pi_{\theta}(a \mid s) q_{\pi_{\theta}}(s,\ a) \Big] = \mathcal{E}_{(s,\ a) \sim \rho_{\theta}(\cdot,\ \cdot)} \Big[ \nabla_{\theta} \ln \pi_{\theta}(a \mid s) q_{\pi_{\theta}}(s,\ a) \Big] \end{aligned}

其中 $\nu_{\pi_{\theta}}(\cdot)$ 和 $\rho_{\pi_{\theta}}(\cdot,\ \cdot)$ 分别为初始状态分布 $b_{0}(\cdot)$ 和策略 $\pi_{\theta}$ 下的状态访问分布和占用度量，并满足：

\begin{aligned} \nu_{\pi}(s) &= (1 - \gamma) \sum_{t = 0}^{\infty} \gamma^{t} b_{t}(s) = (1 - \gamma) b_{0}(s) + (1 - \gamma) \gamma \sum_{t = 0}^{\infty} \gamma^{t} b_{t + 1}(s) \\[7mm] &= (1 - \gamma) b_{0}(s) + (1 - \gamma) \gamma \sum_{t = 0}^{\infty} \gamma^{t} \sum_{s'} b_{t}(s') p_{\pi}^{(1)}(s \mid s') \\[7mm] &= (1 - \gamma) b_{0}(s) + (1 - \gamma) \gamma \sum_{t = 0}^{\infty} \gamma^{t} \sum_{s'} b_{t}(s') \sum_{a'} p(s \mid s',\ a') \pi(a' \mid s') \\[7mm] &= (1 - \gamma) b_{0}(s) + \gamma \sum_{s'} \underset{\nu_{\pi}(s')}{\underbrace{(1 - \gamma) \sum_{t = 0}^{\infty} \gamma^{t} b_{t}(s')}} \sum_{a'} p(s \mid s',\ a') \pi(a' \mid s') \\[11mm] &= (1 - \gamma) b_{0}(s) + \gamma \sum_{s'}\nu_{\pi}(s') \sum_{a'} p(s \mid s',\ a') \pi(a' \mid s') \end{aligned}

通过和贝尔曼期望方程类似的方法可以判断出等式右侧的映射为压缩映射，方程的解存在且唯一。其中，归一化系数 $1 - \gamma$ 是为了保证状态访问分布的概率规范性：

\sum_{s} \nu_{\pi_{\theta}}(s) = (1 - \gamma) \sum_{t = 0}^{\infty} \gamma^{t} \sum_{s} b_{t}(s) = (1 - \gamma) \sum_{t = 0}^{\infty} \gamma^{t} = (1 - \gamma) \frac{1}{1 - \gamma} = 1

如果初始状态分布 $b_{0}(\cdot)$ 为 $\pi_{\theta}$ 下的稳态分布 $\iota_{\pi_{\theta}}(\cdot)$ ，即：

\iota_{\pi_{\theta}}(s_{t + 1}) = \sum_{s_{t}} \sum_{a_{t}} \iota_{\pi_{\theta}}(s_{t}) \pi_{\theta}(a_{t} \mid s_{t}) p(s_{t + 1} \mid s_{t},\ a_{t}) = \sum_{s_{t}} \iota_{\pi_{\theta}}(s_{t}) \sum_{a_{t}} \pi_{\theta}(a_{t} \mid s_{t}) p(s_{t + 1} \mid s_{t},\ a_{t})

在稳态分布下有：

\begin{aligned} &\mathcal{E}_{s_{t} \sim \iota_{\pi_{\theta}}(\cdot)} \mathcal{E}_{a_{t} \sim \pi_{\theta}(\cdot \mid s_{t})} \mathcal{E}_{s_{t + 1} \sim p(\cdot \mid s_{t},\ a_{t})} \Big[ f(s_{t + 1}) \Big] = \sum_{s_{t}} \sum_{a_{t}} \sum_{s_{t + 1}} \iota_{\pi_{\theta}}(s_{t}) \pi_{\theta}(a_{t} \mid s_{t}) p(s_{t + 1} \mid s_{t},\ a_{t}) f(s_{t + 1}) \\[7mm] = &\sum_{s_{t + 1}} f(s_{t + 1}) \sum_{s_{t}} \sum_{a_{t}} \iota_{\pi_{\theta}}(s_{t}) \pi_{\theta}(a_{t} \mid s_{t}) p(s_{t + 1} \mid s_{t},\ a_{t}) = \sum_{s_{t + 1}} \iota_{\pi_{\theta}}(s_{t + 1}) f(s_{t + 1}) = \mathcal{E}_{s_{t + 1} \sim \iota_{\pi_{\theta}}(\cdot)} \Big[ f(s_{t + 1}) \Big] \end{aligned}

结合稳态分布假设可以将策略梯度重写为：

\begin{aligned} \nabla_{\theta} J(\theta) &= \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{t} \sim b_{t}(\cdot)} \mathcal{E}_{a_{t} \sim \pi_{\theta}(\cdot \mid s_{t})} \Big[ \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) q_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) \Big] \\[7mm] &= \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{t} \sim \iota_{\pi_{\theta}}(\cdot)} \mathcal{E}_{a_{t} \sim \pi_{\theta}(\cdot \mid s_{t})} \Big[ \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) q_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) \Big] \end{aligned}

在无限期规划下则有：

\nabla_{\theta} J(\theta) = \frac{1}{1 - \gamma} \mathcal{E}_{s \sim \iota_{\pi_{\theta}}(\cdot)} \mathcal{E}_{a \sim \pi_{\theta}(\cdot \mid s)} \Big[ \nabla_{\theta} \ln \pi_{\theta}(a \mid s) q_{\pi_{\theta}}(s,\ a) \Big] \propto \mathcal{E}_{s \sim \iota_{\pi_{\theta}}(\cdot)} \mathcal{E}_{a \sim \pi_{\theta}(\cdot \mid s)} \Big[ \nabla_{\theta} \ln \pi_{\theta}(a \mid s) q_{\pi_{\theta}}(s,\ a) \Big]

策略梯度（带基线）

通过基线函数 $b$ 来重写策略梯度形式为：

\nabla_{\theta} J(\theta) \triangleq \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \mathcal{E}_{s_{1}} \mathcal{E}_{a_{1}} \cdots \mathcal{E}_{s_{t}} \mathcal{E}_{a_{t}} \left[ \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) \Big( q_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) - b \Big) \right]

在稳态分布假设和无限期规划下则是：

\nabla_{\theta} J(\theta) \triangleq \mathcal{E}_{s \sim \iota_{\pi_{\theta}}(\cdot)} \mathcal{E}_{a \sim \pi_{\theta}(\cdot \mid s)} \Big[ \Big( q_{\pi_{\theta}}(s,\ a) - b \Big) \nabla_{\theta} \ln \pi_{\theta}(a \mid s) \Big]

其中，基线函数 $b$ 不是动作 $a$ （或 $a_{t}$ ）的函数，进而可得：

\mathcal{E}_{a \sim \pi_{\theta}(\cdot \mid s)} b \nabla_{\theta} \ln \pi_{\theta}(a \mid s) = b \sum_{a} \pi_{\theta}(a \mid s) \nabla_{\theta} \ln \pi_{\theta}(a \mid s) = b \nabla_{\theta} \sum_{a} \pi_{\theta}(a \mid s) = 0

因此在加入基线函数后策略梯度保持不变，通过样本的近似估计仍然无偏，但估计的方差与基线函数相关：

\begin{aligned} \mathrm{Var} &= \sum_{i = 1}^{d} \mathcal{D}_{a \sim \pi_{\theta}(\cdot \mid s)} \left[ \Big( q_{\pi_{\theta}}(s,\ a) - b \Big) \frac{\partial}{\partial \theta_{i}} \ln \pi_{\theta}(a \mid s) \right] \\[5mm] &= \sum_{i = 1}^{d} \mathcal{E}_{a \sim \pi_{\theta}(\cdot \mid s)} \left[ \Big( q_{\pi_{\theta}}(s,\ a) - b \Big)^{2} \left( \frac{\partial}{\partial \theta_{i}} \ln \pi_{\theta}(a \mid s) \right)^{2} \right] - \sum_{i = 1}^{d} \mathcal{E}_{a \sim \pi_{\theta}(\cdot \mid s)}^{2} \left[ \Big( q_{\pi_{\theta}}(s,\ a) - b \Big) \frac{\partial}{\partial \theta_{i}} \ln \pi_{\theta}(a \mid s) \right] \\[5mm] &= \sum_{i = 1}^{d} \mathcal{E}_{a \sim \pi_{\theta}(\cdot \mid s)} \left[ \Big( q_{\pi_{\theta}}(s,\ a) - b \Big)^{2} \left( \frac{\partial}{\partial \theta_{i}} \ln \pi_{\theta}(a \mid s) \right)^{2} \right] - \sum_{i = 1}^{d} \mathcal{E}_{a \sim \pi_{\theta}(\cdot \mid s)}^{2} \left[ q_{\pi_{\theta}}(s,\ a) \frac{\partial}{\partial \theta_{i}} \ln \pi_{\theta}(a \mid s) \right] \end{aligned}

方差对基线函数的一阶导数和二阶导数分别为：

\begin{gathered} \frac{d}{db} \mathrm{Var} = 2 \sum_{i = 1}^{d} \mathcal{E}_{a \sim \pi_{\theta}(\cdot \mid s)} \left[ \Big( b - q_{\pi_{\theta}}(s,\ a) \Big) \left( \frac{\partial}{\partial \theta_{i}} \ln \pi_{\theta}(a \mid s) \right)^{2} \right] \\[7mm] \frac{d^{2}}{db^{2}} \mathrm{Var} = 2 \sum_{i = 1}^{d} \mathcal{E}_{a \sim \pi_{\theta}(\cdot \mid s)} \left( \frac{\partial}{\partial \theta_{i}} \ln \pi_{\theta}(a \mid s) \right)^{2} \ge 0 \end{gathered}

可得最优基线函数为：

b^{\star} = \sum_{i = 1}^{d} \mathcal{E}_{a \sim \pi_{\theta}(\cdot \mid s)} \left[ \Big( b - q_{\pi_{\theta}}(s,\ a) \Big) \left( \frac{\partial}{\partial \theta_{i}} \ln \pi_{\theta}(a \mid s) \right)^{2} \right] \bigg/ \sum_{i = 1}^{d} \mathcal{E}_{a \sim \pi_{\theta}(\cdot \mid s)} \left( \frac{\partial}{\partial \theta_{i}} \ln \pi_{\theta}(a \mid s) \right)^{2}

实际中可以取基线函数为 $b = v_{\pi_{\theta}}(s)$ ，从而达到降低策略梯度估计方差的效果。

💡从另一个角度来理解基线函数的意义：在原始的策略梯度定义下使用单个样本进行随机梯度更新时：
$\theta \leftarrow \theta + \alpha \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) q_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t})$
例如所有动作的动作价值均为正，即使 $a_{t}$ 的价值低于所有动作的价值的平均水平（例如 $v_{\pi_{\theta}}^{(t)}(s_{t})$ ）上式也会倾向于增加 $\pi_{\theta}(a_{t} \mid s_{t})$ 并相应地降低其他更优动作的采样概率。而加入基线函数 $v_{\pi_{\theta}}^{(t)}(s_{t})$ 后：
$\theta \leftarrow \theta + \alpha \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) \Big[ q_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) - v_{\pi_{\theta}}^{(t)}(s_{t}) \Big] = \theta + \alpha \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) \Big[ q_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) - \mathcal{E}_{a_{t} \sim \pi(\cdot \mid s_{t})} q_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) \Big]$
如果 $q_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) < v_{\pi_{\theta}}^{(t)}(s_{t})$ ，上式会倾向于降低 $\pi_{\theta}(a_{t} \mid s_{t})$ ，并增加其他更优动作的采样概率。与未加入基线函数的原始策略梯度相比，这种方法可以在随机梯度下更有效地更新策略。

RL > Preliminary

#Baseline #PG

Policy Gradient

http://example.com/2024/07/19/PG/

Author

木辛

Posted on

July 19, 2024

Licensed under

REINFORCE Previous

Value Learning Technique Next