GENTLE
Task Representation Learning
Given the assumption of I(z; M; x)≥0 and the dynamics + reward model of tasks are both deterministic, then
I(z; M)=I(z; M∣x)+≥0I(z; M; x)≥I(z; M∣x)=I(z, y; M∣x)−I(y; M∣z, x)=I(y; M∣x)+=0I(z; M∣y, x)−I(y; M∣z, x)⇐(z⊥M∣y, x)=I(y; M∣x)−H(y∣z, x)+=0H(y∣z, x, M)⇐p(y∣x, M)=δ[y=M(x)]
Construct a variational lower bound of the second term in the RHS with an additional decoder qψ(y∣z, x)
−H(y∣z, x)=Ex, y, zlogp(y∣z, x)=Ex, y, z[logqψ(y∣z, x)p(y∣z, x)qψ(y∣z, x)]=Ex, zDKL(p ∣ qψ)Ey∼p(y∣z, x)[logqψ(y∣z, x)p(y∣z, x)]+Ex, y, zlogqψ(y∣z, x)≥EM∼p(M)Ex∼ρ(x)Ey∼p(y∣x, M)Ez∼qθ(z∣x, y)logqψ(y∣z, x)
Hence, the mutual information between task and task representation I(z; M) can be lower bounded as
I(z; M)≥constI(y; M∣x)+EM∼p(M)Ex∼ρ(x)Ey∼p(y∣x, M)Ez∼qθ(z∣x, y)logqψ(y∣z, x)
Based on this, GENTLE designs a deterministic task auto-encoder to learn task representations from offline datasets
where the task encoder qθ(x1:n, y1:n) takes the average of intermediate embeddings of each transition tuple (xi, yi)
z=qθ(x1:n, y1:n)=n1i=1∑nqθ(xi, yi)=n1i=1∑nqθ(si, ai, ri, si′)
The task auto-encoder is trained to maximize I(z; M) by maximizing the aforementioned lower bound
J(θ, ψ)=EM∼p(M)Ex∼ρ(x)Ey∼p(y∣x, M)Ez∼qθ(z∣x, y)logqψ(y∣z, x)
Considering that both the encoder and decoder are deterministic, the original objective can be equivalently defined as
J(θ, ψ)≜EM∼p(M)Ex1:n∼ρ(x)[i=1∑n∣∣∣∣M(xi)−qψ(qθ(xi, yi), xi)∣∣∣∣22]
To ensure independence between the distribution of task and probing data, GENTLE augments the dataset via relabeling
Component |
Synthetic |
Source of Sampling / Relabeling |
Description |
s |
× |
D(M1:N) |
Overall Dataset |
a |
√ |
πi(⋅∣s, zi) |
Up-to-Date Meta Policy |
r |
√ |
Mi(s, a) |
Pretrained Model |
s′ |
√ |
Mi(s, a) |
Pretrained Model |
GENTLE adopts TD3+BC as the backbone offline RL algorithm to train the meta policy on multi-task dataset