PEARL

PEARL

Probabilistic Task Inference

PEARL infers tasks as an embedding through a posterior probabilistic encoder conditioned on past transitions (context)

qϕ(zc1:N)n=1NΨϕ(zcn)=n=1NN[fϕμ(cn), fϕσ(cn)]cn=(s, a, r, s)nq_{\phi}(z \mid c_{1:N}) \propto \prod_{n = 1}^{N} \Psi_{\phi}(z \mid c_{n}) = \prod_{n = 1}^{N} \mathcal{N} \Big[ f_{\phi}^{\mu}(c_{n}),\ f_{\phi}^{\sigma}(c_{n}) \Big] \qquad c_{n} = (s,\ a,\ r,\ s')_{n}

where the permutation-invariant encoder qϕ(c1:N)q_{\phi}(\cdot \mid c_{1:N}) is modeled as the product of independent Gaussian factors

Generally, the context encoder qϕ(zc)q_{\phi}(z \mid c) can be optimized by a specific objective with additional information bottleneck

minϕETp(T)EcT[Ezqϕ(zc)R(T, z)+βDKL(qϕ(zc)  p(z))]\min_{\phi} \mathcal{E}_{\mathcal{T} \sim p(\mathcal{T})} \mathcal{E}_{c \sim \mathcal{T}} \Big[ \mathcal{E}_{z \sim q_{\phi}(z \mid c)} R(\mathcal{T},\ z) + \beta D_{\text{KL}} \Big( q_{\phi}(z \mid c)\ \|\ p(z) \Big) \Big]

where the objective R(T, z)R(\mathcal{T},\ z) can be derived from contextual transition / reward model learning or behavior learning

Off-Policy Behavior Learning

PEARL leverages off-policy RL (SAC) to learn contextual value function and policy over a series of similar tasks {Ti}\{ \mathcal{T}_{i} \}

To alleviate the problem of mismatch between the state-action distribution of replay buffer and visited by evolving policy, the data for task inference are sampled uniformly from the most recently collected batch of data through the sampler Sc\mathcal{S}_{c}

Meta Training Meta Testing

The context encoder, actor and critic are optimized jointly over disentangled context samples via the following objectives

Objective Target
LKL(c)=βDKL(qϕ(zc)  p(z))\mathcal{L}_{\text{KL}}(c) = \beta D_{\text{KL}} \Big( q_{\phi}(z \mid c)\ \|\ p(z) \Big) ϕ\phi
Lcritic(b, z)=E(s, a, r, s)b[r+γEaπθ(s, z)(Qθ(s, a, z)αlogπθ(as, z))Qθ(s, a, z)]2\mathcal{L}_{critic}(b,\ z) = \mathcal{E}_{(s,\ a,\ r,\ s') \sim b} \Big[ r + \gamma \mathcal{E}_{a' \sim \pi_{\theta}(\cdot \mid s',\ z)} \Big( Q_{\theta^{-}}(s',\ a',\ z) - \alpha \log \pi_{\theta}(a' \mid s',\ z) \Big) - Q_{\theta}(s,\ a,\ z) \Big]^{2} θ, ϕ\theta,\ \phi
Lactor(b, z)=EsbEaπθ(s,sg(z))[Qθ(s, a, sg(z))αlogπθ(as, sg(z))]\mathcal{L}_{actor}(b,\ z) = -\mathcal{E}_{s \sim b} \mathcal{E}_{a \sim \pi_{\theta}(\cdot \mid s, \operatorname{sg}(z))} \Big[ Q_{\theta}(s,\ a,\ \operatorname{sg}(z)) - \alpha \log \pi_{\theta}(a \mid s,\ \operatorname{sg}(z)) \Big] θ\theta

During testing, PEARL performs exploration with policy and diverse task embedding sampled from growing experience


PEARL
http://example.com/2024/10/19/PEARL/
Author
木辛
Posted on
October 19, 2024
Licensed under