GENTLE

GENTLE

Task Representation Learning

Given the assumption of I(z; M; x)0I(z;\ M;\ x) \ge 0 and the dynamics + reward model of tasks are both deterministic, then

I(z; M)=I(z; Mx)+I(z; M; x)0I(z; Mx)=I(z, y; Mx)I(y; Mz, x)=I(y; Mx)+I(z; My, x)=0I(y; Mz, x)(zMy, x)=I(y; Mx)H(yz, x)+H(yz, x, M)=0p(yx, M)=δ[y=M(x)]\begin{aligned} I(z;\ M) &= I(z;\ M \mid x) + \underset{\ge 0}{\underbrace{I(z;\ M;\ x)}} \ge I(z;\ M \mid x) = I(z,\ y;\ M \mid x) - I(y;\ M \mid z,\ x) \\[7mm] &= I(y;\ M \mid x) + \underset{= 0}{\underbrace{I(z;\ M \mid y,\ x)}} - I(y;\ M \mid z,\ x) \quad \Leftarrow \quad (z \perp M \mid y,\ x) \\[7mm] &= I(y;\ M \mid x) - H(y \mid z,\ x) + \underset{= 0}{\underbrace{H(y \mid z,\ x,\ M)}} \quad \Leftarrow \quad p(y \mid x,\ M) = \delta \Big[ y = M(x) \Big] \end{aligned}

Construct a variational lower bound of the second term in the RHS with an additional decoder qψ(yz, x)q_{\psi}(y \mid z,\ x)

H(yz, x)=Ex, y, zlogp(yz, x)=Ex, y, z[logp(yz, x)qψ(yz, x)qψ(yz, x)]=Ex, zEyp(yz, x)[logp(yz, x)qψ(yz, x)]DKL(p  qψ)+Ex, y, zlogqψ(yz, x)EMp(M)Exρ(x)Eyp(yx, M)Ezqθ(zx, y)logqψ(yz, x)\begin{aligned} -H(y \mid z,\ x) &= \mathcal{E}_{x,\ y,\ z} \log p(y \mid z,\ x) = \mathcal{E}_{x,\ y,\ z} \left[ \log \frac{p(y \mid z,\ x)}{q_{\psi}(y \mid z,\ x)} q_{\psi}(y \mid z,\ x) \right] \\[5mm] &= \mathcal{E}_{x,\ z} \underset{D_{\text{KL}}(p\ |\ q_{\psi})}{\underbrace{\mathcal{E}_{y \sim p(y \mid z,\ x)} \left[ \log \frac{p(y \mid z,\ x)}{q_{\psi}(y \mid z,\ x)} \right]}} + \mathcal{E}_{x,\ y,\ z} \log q_{\psi}(y \mid z,\ x) \\[10mm] &\ge \mathcal{E}_{M \sim p(M)} \mathcal{E}_{x \sim \rho(x)} \mathcal{E}_{y \sim p(y \mid x,\ M)} \mathcal{E}_{z \sim q_{\theta}(z \mid x,\ y)} \log q_{\psi}(y \mid z,\ x) \end{aligned}

Hence, the mutual information between task and task representation I(z; M)I(z;\ M) can be lower bounded as

I(z; M)I(y; Mx)const+EMp(M)Exρ(x)Eyp(yx, M)Ezqθ(zx, y)logqψ(yz, x)I(z;\ M) \ge \underset{\text{const}}{\underbrace{I(y;\ M \mid x)}} + \mathcal{E}_{M \sim p(M)} \mathcal{E}_{x \sim \rho(x)} \mathcal{E}_{y \sim p(y \mid x,\ M)} \mathcal{E}_{z \sim q_{\theta}(z \mid x,\ y)} \log q_{\psi}(y \mid z,\ x)

Based on this, GENTLE designs a deterministic task auto-encoder to learn task representations from offline datasets

where the task encoder qθ(x1:n, y1:n)q_{\theta}(x_{1:n},\ y_{1:n}) takes the average of intermediate embeddings of each transition tuple (xi, yi)(x_{i},\ y_{i})

z=qθ(x1:n, y1:n)=1ni=1nqθ(xi, yi)=1ni=1nqθ(si, ai, ri, si)z = q_{\theta} (x_{1:n},\ y_{1:n}) = \frac{1}{n} \sum_{i = 1}^{n} q_{\theta}(x_{i},\ y_{i}) = \frac{1}{n} \sum_{i = 1}^{n} q_{\theta}(s_{i},\ a_{i},\ r_{i},\ s_{i}')

The task auto-encoder is trained to maximize I(z; M)I(z;\ M) by maximizing the aforementioned lower bound

J(θ, ψ)=EMp(M)Exρ(x)Eyp(yx, M)Ezqθ(zx, y)logqψ(yz, x)\mathcal{J}(\theta,\ \psi) = \mathcal{E}_{M \sim p(M)} \mathcal{E}_{x \sim \rho(x)} \mathcal{E}_{y \sim p(y \mid x,\ M)} \mathcal{E}_{z \sim q_{\theta}(z \mid x,\ y)} \log q_{\psi}(y \mid z,\ x)

Considering that both the encoder and decoder are deterministic, the original objective can be equivalently defined as

J(θ, ψ)EMp(M)Ex1:nρ(x)[i=1nM(xi)qψ(qθ(xi, yi), xi)22]\mathcal{J}(\theta,\ \psi) \triangleq \mathcal{E}_{M \sim p(M)} \mathcal{E}_{x_{1:n} \sim \rho(x)} \left[ \sum_{i = 1}^{n} \Big| M(x_{i}) - q_{\psi}(q_{\theta}(x_{i},\ y_{i}),\ x_{i}) \Big|_{2}^{2} \right]

To ensure independence between the distribution of task and probing data, GENTLE augments the dataset via relabeling

Component Synthetic Source of Sampling / Relabeling Description
ss × D(M1:N)\mathcal{D}(M_{1:N}) Overall Dataset
aa πi(s, zi)\pi_{i}(\cdot \mid s,\ z_{i}) Up-to-Date Meta Policy
rr Mi^(s, a)\widehat{M_{i}}(s,\ a) Pretrained Model
ss' Mi^(s, a)\widehat{M_{i}}(s,\ a) Pretrained Model

Meta Behavior Learning

GENTLE adopts TD3+BC as the backbone offline RL algorithm to train the meta policy on multi-task dataset


GENTLE
http://example.com/2024/10/30/GENTLE/
Author
木辛
Posted on
October 30, 2024
Licensed under