EMU

EMU

State Embedding

Conventional episodic control methods adopt Gaussian random projection to embed states for memory retrieval

xt=WstRk=Rk×dRdWijN(0, 1k)x_{t} = W s_{t} \Rightarrow \mathbb{R}^{k} = \mathbb{R}^{k \times d} \cdot \mathbb{R}^{d} \qquad W_{ij} \sim \mathcal{N} \left( 0,\ \frac{1}{k} \right)

Despite preserving the distance relationship in raw state space, such random projection hardly has a semantic meaning

Random Projection AutoEncoder

Hence, EMU constructs a deterministic conditional autoencoder fϕ, fψ\langle f_{\phi},\ f_{\psi} \rangle to generate return-regulated state embedding

L(ϕ, ψ)=[H(st)fψH(xtt)]2+λrconstfψs(xtt)22xt=fϕ(stt)\mathcal{L}(\phi,\ \psi) = \Big[ H(s_{t}) - f_{\psi}^{H}(x_{t} \mid t) \Big]^{2} + \lambda_{\text{rcon}} \Big\| s_{t} - f_{\psi}^{s}(x_{t} \mid t) \Big\|_{2}^{2} \qquad x_{t} = f_{\phi}(s_{t} \mid t)

where H()H(\cdot) is the highest return maintained in an episodic buffer along with the corresponding state sts_{t} and timestep tt

H(st)={max{H(s^), Rt}fϕ(s^)fϕ(st)2<δRtfϕ(s^)fϕ(st)2δs^=arg minsDEfϕ(s^)fϕ(st)2H(s_{t}) = \left\{ \begin{matrix} \max \{ H(\hat{s}),\ R_{t} \} & \| f_{\phi}(\hat{s}) - f_{\phi}(s_{t}) \|_{2} < \delta \\[5mm] R_{t} & \| f_{\phi}(\hat{s}) - f_{\phi}(s_{t}) \|_{2} \ge \delta \end{matrix} \right. \qquad \hat{s} = \argmin_{s \in \mathcal{D}_{E}} \| f_{\phi}(\hat{s}) - f_{\phi}(s_{t}) \|_{2}

The episodic memory can be viewed as a trivial method to estimate optimal value function V(s)V^{\star}(s) from past experience

Episodic Incentive

Combined with the vanilla TD error, the episodic memory can be uiltized to expedite the convergence of learning process

LEC(θ)=[r+γmaxaQθtot(s, a)Qθtot(s, a)]2+λ[r+γH(s)Qθtot(s, a)]2\mathcal{L}^{\text{EC}}(\theta) = \Big[ r + \gamma \max_{a'} Q_{\theta^{-}}^{\text{tot}}(s',\ a') - Q_{\theta}^{\text{tot}}(s,\ a) \Big]^{2} + \lambda \Big[ r + \gamma H(s') - Q_{\theta}^{\text{tot}}(s,\ a) \Big]^{2}

From the perspective of parameter gradients, the loss function above is equivalent to add an additional reward rECr^{\text{EC}}

L(θ)=[r+λsg(r+γH(s)Qθtot(s, a))rEC+γmaxaQθtot(s, a)Qθtot(s, a)]2\mathcal{L}(\theta) = \Big[ r + \underset{r^{\text{EC}}}{\underbrace{\lambda \operatorname{sg} \Big( r + \gamma H(s') - Q_{\theta}^{\text{tot}}(s,\ a) \Big)}} + \gamma \max_{a'} Q_{\theta^{-}}^{\text{tot}}(s,\ a) - Q_{\theta}^{\text{tot}}(s,\ a) \Big]^{2}

However, the naive usage of rECr^{\text{EC}} is prone to converge on local minima, EMU proposes an episodic incentive alternatively

rp=γξπ(s)[H(s)maxaQθtot(s, a)]γNξ(s)Ncall(s)[H(s)maxaQθtot(s, a)]r^{p} = \gamma \xi_{\pi}(s') \Big[ H(s') - \max_{a'} Q_{\theta^{-}}^{\text{tot}}(s',\ a') \Big] \approx \gamma \frac{N_{\xi}(s')}{N_{\text{call}}(s')} \Big[ H(s') - \max_{a'} Q_{\theta^{-}}^{\text{tot}}(s',\ a') \Big]

The ξπ(s)[0, 1]\xi_{\pi}(s') \in [0,\ 1] denotes the probability that ss' can lead to a desired goal under the current policy π\pi, which is estimated in a count-based manner with recorded number of total visits Ncall(s)N_{\text{call}}(s') and desired visits Nξ(s)N_{\xi}(s') on nearest neighbour

As the policy converges to optimal policy, the loss function with the episodic incentive converges to optimal TD error

Lp(θ)=[r+γξπ(s)1[H(s)V(s)maxaQθtot(s, a)]+γmaxaQθtot(s, a)Qθtot(s, a)]2ππ[r+γV(s)γmaxaQθtot(s, a)+γmaxaQθtot(s, a)Qθtot(s, a)]2L(θ)\begin{aligned} \mathcal{L}^{p}(\theta) &= \Big[ r + \gamma \underset{\to 1}{\underbrace{\xi_{\pi}(s')}} \Big[ \underset{\to V^{\star}(s')}{\underbrace{H(s')}} - \max_{a'} Q_{\theta^{-}}^{\text{tot}}(s',\ a') \Big] + \gamma \max_{a'} Q_{\theta^{-}}^{\text{tot}}(s',\ a') - Q_{\theta}^{\text{tot}}(s,\ a) \Big]^{2} \Leftarrow \pi \to \pi^{\star} \\[5mm] &\to \Big[ r + \gamma V^{\star}(s') - \cancel{\gamma \max_{a'} Q_{\theta^{-}}^{\text{tot}}(s',\ a')} + \cancel{\gamma \max_{a'} Q_{\theta^{-}}^{\text{tot}}(s',\ a')} - Q_{\theta}^{\text{tot}}(s,\ a) \Big]^{2} \triangleq \mathcal{L}^{\star}(\theta) \end{aligned}


EMU
http://example.com/2024/10/22/EMU/
Author
木辛
Posted on
October 22, 2024
Licensed under