PEARL
Probabilistic Task Inference
PEARL infers tasks as an embedding through a posterior probabilistic encoder conditioned on past transitions (context)
qϕ(z∣c1:N)∝n=1∏NΨϕ(z∣cn)=n=1∏NN[fϕμ(cn), fϕσ(cn)]cn=(s, a, r, s′)n
where the permutation-invariant encoder qϕ(⋅∣c1:N) is modeled as the product of independent Gaussian factors
Generally, the context encoder qϕ(z∣c) can be optimized by a specific objective with additional information bottleneck
ϕminET∼p(T)Ec∼T[Ez∼qϕ(z∣c)R(T, z)+βDKL(qϕ(z∣c) ∥ p(z))]
where the objective R(T, z) can be derived from contextual transition / reward model learning or behavior learning
Off-Policy Behavior Learning
PEARL leverages off-policy RL (SAC) to learn contextual value function and policy over a series of similar tasks {Ti}
To alleviate the problem of mismatch between the state-action distribution of replay buffer and visited by evolving policy, the data for task inference are sampled uniformly from the most recently collected batch of data through the sampler Sc
Meta Training |
Meta Testing |
.png) |
.png) |
The context encoder, actor and critic are optimized jointly over disentangled context samples via the following objectives
Objective |
Target |
LKL(c)=βDKL(qϕ(z∣c) ∥ p(z)) |
ϕ |
Lcritic(b, z)=E(s, a, r, s′)∼b[r+γEa′∼πθ(⋅∣s′, z)(Qθ−(s′, a′, z)−αlogπθ(a′∣s′, z))−Qθ(s, a, z)]2 |
θ, ϕ |
Lactor(b, z)=−Es∼bEa∼πθ(⋅∣s,sg(z))[Qθ(s, a, sg(z))−αlogπθ(a∣s, sg(z))] |
θ |
During testing, PEARL performs exploration with policy and diverse task embedding sampled from growing experience