PEARL

Probabilistic Task Inference

PEARL infers tasks as an embedding through a posterior probabilistic encoder conditioned on past transitions (context)

q_{\phi}(z \mid c_{1:N}) \propto \prod_{n = 1}^{N} \Psi_{\phi}(z \mid c_{n}) = \prod_{n = 1}^{N} \mathcal{N} \Big[ f_{\phi}^{\mu}(c_{n}),\ f_{\phi}^{\sigma}(c_{n}) \Big] \qquad c_{n} = (s,\ a,\ r,\ s')_{n}

where the permutation-invariant encoder $q_{\phi}(\cdot \mid c_{1:N})$ is modeled as the product of independent Gaussian factors

Generally, the context encoder $q_{\phi}(z \mid c)$ can be optimized by a specific objective with additional information bottleneck

\min_{\phi} \mathcal{E}_{\mathcal{T} \sim p(\mathcal{T})} \mathcal{E}_{c \sim \mathcal{T}} \Big[ \mathcal{E}_{z \sim q_{\phi}(z \mid c)} R(\mathcal{T},\ z) + \beta D_{\text{KL}} \Big( q_{\phi}(z \mid c)\ \|\ p(z) \Big) \Big]

where the objective $R(\mathcal{T},\ z)$ can be derived from contextual transition / reward model learning or behavior learning

Off-Policy Behavior Learning

PEARL leverages off-policy RL (SAC) to learn contextual value function and policy over a series of similar tasks $\{ \mathcal{T}_{i} \}$

To alleviate the problem of mismatch between the state-action distribution of replay buffer and visited by evolving policy, the data for task inference are sampled uniformly from the most recently collected batch of data through the sampler $\mathcal{S}_{c}$

Meta Training	Meta Testing

The context encoder, actor and critic are optimized jointly over disentangled context samples via the following objectives

Objective	Target
$\mathcal{L}_{\text{KL}}(c) = \beta D_{\text{KL}} \Big( q_{\phi}(z \mid c)\ \\|\ p(z) \Big)$	$\phi$
$\mathcal{L}_{critic}(b,\ z) = \mathcal{E}_{(s,\ a,\ r,\ s') \sim b} \Big[ r + \gamma \mathcal{E}_{a' \sim \pi_{\theta}(\cdot \mid s',\ z)} \Big( Q_{\theta^{-}}(s',\ a',\ z) - \alpha \log \pi_{\theta}(a' \mid s',\ z) \Big) - Q_{\theta}(s,\ a,\ z) \Big]^{2}$	$\theta,\ \phi$
$\mathcal{L}_{actor}(b,\ z) = -\mathcal{E}_{s \sim b} \mathcal{E}_{a \sim \pi_{\theta}(\cdot \mid s, \operatorname{sg}(z))} \Big[ Q_{\theta}(s,\ a,\ \operatorname{sg}(z)) - \alpha \log \pi_{\theta}(a \mid s,\ \operatorname{sg}(z)) \Big]$	$\theta$

During testing, PEARL performs exploration with policy and diverse task embedding sampled from growing experience

RL > Meta-Learning

#PEARL

PEARL

http://example.com/2024/10/19/PEARL/

Author

木辛

Posted on

October 19, 2024

Licensed under

EMU Previous

MACAW Next