EMU

State Embedding

Conventional episodic control methods adopt Gaussian random projection to embed states for memory retrieval

x_{t} = W s_{t} \Rightarrow \mathbb{R}^{k} = \mathbb{R}^{k \times d} \cdot \mathbb{R}^{d} \qquad W_{ij} \sim \mathcal{N} \left( 0,\ \frac{1}{k} \right)

Despite preserving the distance relationship in raw state space, such random projection hardly has a semantic meaning

Random Projection	AutoEncoder

Hence, EMU constructs a deterministic conditional autoencoder $\langle f_{\phi},\ f_{\psi} \rangle$ to generate return-regulated state embedding

\mathcal{L}(\phi,\ \psi) = \Big[ H(s_{t}) - f_{\psi}^{H}(x_{t} \mid t) \Big]^{2} + \lambda_{\text{rcon}} \Big\| s_{t} - f_{\psi}^{s}(x_{t} \mid t) \Big\|_{2}^{2} \qquad x_{t} = f_{\phi}(s_{t} \mid t)

where $H(\cdot)$ is the highest return maintained in an episodic buffer along with the corresponding state $s_{t}$ and timestep $t$

H(s_{t}) = \left\{ \begin{matrix} \max \{ H(\hat{s}),\ R_{t} \} & \| f_{\phi}(\hat{s}) - f_{\phi}(s_{t}) \|_{2} < \delta \\[5mm] R_{t} & \| f_{\phi}(\hat{s}) - f_{\phi}(s_{t}) \|_{2} \ge \delta \end{matrix} \right. \qquad \hat{s} = \argmin_{s \in \mathcal{D}_{E}} \| f_{\phi}(\hat{s}) - f_{\phi}(s_{t}) \|_{2}

The episodic memory can be viewed as a trivial method to estimate optimal value function $V^{\star}(s)$ from past experience

Episodic Incentive

Combined with the vanilla TD error, the episodic memory can be uiltized to expedite the convergence of learning process

\mathcal{L}^{\text{EC}}(\theta) = \Big[ r + \gamma \max_{a'} Q_{\theta^{-}}^{\text{tot}}(s',\ a') - Q_{\theta}^{\text{tot}}(s,\ a) \Big]^{2} + \lambda \Big[ r + \gamma H(s') - Q_{\theta}^{\text{tot}}(s,\ a) \Big]^{2}

From the perspective of parameter gradients, the loss function above is equivalent to add an additional reward $r^{\text{EC}}$

\mathcal{L}(\theta) = \Big[ r + \underset{r^{\text{EC}}}{\underbrace{\lambda \operatorname{sg} \Big( r + \gamma H(s') - Q_{\theta}^{\text{tot}}(s,\ a) \Big)}} + \gamma \max_{a'} Q_{\theta^{-}}^{\text{tot}}(s,\ a) - Q_{\theta}^{\text{tot}}(s,\ a) \Big]^{2}

However, the naive usage of $r^{\text{EC}}$ is prone to converge on local minima, EMU proposes an episodic incentive alternatively

r^{p} = \gamma \xi_{\pi}(s') \Big[ H(s') - \max_{a'} Q_{\theta^{-}}^{\text{tot}}(s',\ a') \Big] \approx \gamma \frac{N_{\xi}(s')}{N_{\text{call}}(s')} \Big[ H(s') - \max_{a'} Q_{\theta^{-}}^{\text{tot}}(s',\ a') \Big]

The $\xi_{\pi}(s') \in [0,\ 1]$ denotes the probability that $s'$ can lead to a desired goal under the current policy $\pi$ , which is estimated in a count-based manner with recorded number of total visits $N_{\text{call}}(s')$ and desired visits $N_{\xi}(s')$ on nearest neighbour