Gaussian Policy

随机高斯策略

随机高斯策略将每个动作维度建模为独立的正态分布,将均值和方差分别建模为均值网络和方差对数网络:

πθ(as)=i=1d12πσi(s)exp([aiμi(s)]22σi2(s))=i=1d12πexp[ρθ; i(s)]exp([aiμθ; i(s)]22exp[ρθ; i(s)])\pi_{\theta}(a \mid s) = \prod_{i = 1}^{d} \frac{1}{\sqrt{2 \pi} \sigma_{i}(s)} \exp \left( - \frac{[a_{i} - \mu_{i}(s)]^{2}}{2 \sigma_{i}^{2}(s)} \right) = \prod_{i = 1}^{d} \frac{1}{\sqrt{2 \pi \exp [\rho_{\theta;\ i}(s)]}} \exp \left( -\frac{[a_{i} - \mu_{\theta;\ i}(s)]^{2}}{2 \exp [\rho_{\theta;\ i}(s)]} \right)

其中方差对数网络可以避免标准差为正的约束,在均值网络和方差对数网络的基础上定义辅助网络:

fθ(s, a)=lnπθ(as)=12i=1d(ρθ; i(s)+[aiμθ; i(s)]2exp[ρθ; i(s)])+constantf_{\theta}(s,\ a) = \ln \pi_{\theta}(a \mid s) = -\frac{1}{2} \sum_{i = 1}^{d} \left( \rho_{\theta;\ i}(s) + \frac{[a_{i} - \mu_{\theta;\ i}(s)]^{2}}{\exp [\rho_{\theta;\ i}(s)]} \right) + \mathrm{constant}

在随机高斯策略下的策略梯度可以写作:

θJ(θ)=t=0TγtEs0Ea0EstEat[θlnπθ(atst)qπθ(t)(st, at)]=t=0TγtEs0Ea0EstEat[θfθ(st, at)qπθ(t)(st, at)]\begin{aligned} \nabla_{\theta} J(\theta) &= \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \cdots \mathcal{E}_{s_{t}} \mathcal{E}_{a_{t}} \Big[ \nabla_{\theta} \ln \pi_{\theta}(a_{t} \mid s_{t}) q_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) \Big] = \sum_{t = 0}^{\mathrm{T}} \gamma^{t} \mathcal{E}_{s_{0}} \mathcal{E}_{a_{0}} \cdots \mathcal{E}_{s_{t}} \mathcal{E}_{a_{t}} \Big[ \nabla_{\theta} f_{\theta}(s_{t},\ a_{t}) q_{\pi_{\theta}}^{(t)}(s_{t},\ a_{t}) \Big] \end{aligned}

其中动作价值函数可以通过 REINFORCE、Actor-Critic 及其带基线的变种算法进行估计。


Gaussian Policy
http://example.com/2024/07/19/GP/
Author
木辛
Posted on
July 19, 2024
Licensed under