GENTLE

Task Representation Learning

Given the assumption of $I(z;\ M;\ x) \ge 0$ and the dynamics + reward model of tasks are both deterministic, then

\begin{aligned} I(z;\ M) &= I(z;\ M \mid x) + \underset{\ge 0}{\underbrace{I(z;\ M;\ x)}} \ge I(z;\ M \mid x) = I(z,\ y;\ M \mid x) - I(y;\ M \mid z,\ x) \\[7mm] &= I(y;\ M \mid x) + \underset{= 0}{\underbrace{I(z;\ M \mid y,\ x)}} - I(y;\ M \mid z,\ x) \quad \Leftarrow \quad (z \perp M \mid y,\ x) \\[7mm] &= I(y;\ M \mid x) - H(y \mid z,\ x) + \underset{= 0}{\underbrace{H(y \mid z,\ x,\ M)}} \quad \Leftarrow \quad p(y \mid x,\ M) = \delta \Big[ y = M(x) \Big] \end{aligned}

Construct a variational lower bound of the second term in the RHS with an additional decoder $q_{\psi}(y \mid z,\ x)$

\begin{aligned} -H(y \mid z,\ x) &= \mathcal{E}_{x,\ y,\ z} \log p(y \mid z,\ x) = \mathcal{E}_{x,\ y,\ z} \left[ \log \frac{p(y \mid z,\ x)}{q_{\psi}(y \mid z,\ x)} q_{\psi}(y \mid z,\ x) \right] \\[5mm] &= \mathcal{E}_{x,\ z} \underset{D_{\text{KL}}(p\ |\ q_{\psi})}{\underbrace{\mathcal{E}_{y \sim p(y \mid z,\ x)} \left[ \log \frac{p(y \mid z,\ x)}{q_{\psi}(y \mid z,\ x)} \right]}} + \mathcal{E}_{x,\ y,\ z} \log q_{\psi}(y \mid z,\ x) \\[10mm] &\ge \mathcal{E}_{M \sim p(M)} \mathcal{E}_{x \sim \rho(x)} \mathcal{E}_{y \sim p(y \mid x,\ M)} \mathcal{E}_{z \sim q_{\theta}(z \mid x,\ y)} \log q_{\psi}(y \mid z,\ x) \end{aligned}

Hence, the mutual information between task and task representation $I(z;\ M)$ can be lower bounded as

I(z;\ M) \ge \underset{\text{const}}{\underbrace{I(y;\ M \mid x)}} + \mathcal{E}_{M \sim p(M)} \mathcal{E}_{x \sim \rho(x)} \mathcal{E}_{y \sim p(y \mid x,\ M)} \mathcal{E}_{z \sim q_{\theta}(z \mid x,\ y)} \log q_{\psi}(y \mid z,\ x)

Based on this, GENTLE designs a deterministic task auto-encoder to learn task representations from offline datasets

where the task encoder $q_{\theta}(x_{1:n},\ y_{1:n})$ takes the average of intermediate embeddings of each transition tuple $(x_{i},\ y_{i})$

z = q_{\theta} (x_{1:n},\ y_{1:n}) = \frac{1}{n} \sum_{i = 1}^{n} q_{\theta}(x_{i},\ y_{i}) = \frac{1}{n} \sum_{i = 1}^{n} q_{\theta}(s_{i},\ a_{i},\ r_{i},\ s_{i}')

The task auto-encoder is trained to maximize $I(z;\ M)$ by maximizing the aforementioned lower bound

\mathcal{J}(\theta,\ \psi) = \mathcal{E}_{M \sim p(M)} \mathcal{E}_{x \sim \rho(x)} \mathcal{E}_{y \sim p(y \mid x,\ M)} \mathcal{E}_{z \sim q_{\theta}(z \mid x,\ y)} \log q_{\psi}(y \mid z,\ x)

Considering that both the encoder and decoder are deterministic, the original objective can be equivalently defined as

\mathcal{J}(\theta,\ \psi) \triangleq \mathcal{E}_{M \sim p(M)} \mathcal{E}_{x_{1:n} \sim \rho(x)} \left[ \sum_{i = 1}^{n} \Big| M(x_{i}) - q_{\psi}(q_{\theta}(x_{i},\ y_{i}),\ x_{i}) \Big|_{2}^{2} \right]

To ensure independence between the distribution of task and probing data, GENTLE augments the dataset via relabeling

Component	Synthetic	Source of Sampling / Relabeling	Description
$s$	×	$\mathcal{D}(M_{1:N})$	Overall Dataset
$a$	√	$\pi_{i}(\cdot \mid s,\ z_{i})$	Up-to-Date Meta Policy
$r$	√	$\widehat{M_{i}}(s,\ a)$	Pretrained Model
$s'$	√	$\widehat{M_{i}}(s,\ a)$	Pretrained Model