PALO

Instruction Decomposition

In a contextual MDP, the policy $\pi(a_{t} \mid s_{t},\ \ell)$ is conditioned on a free-form language instruction $\ell \in \mathcal{L}$ , where $\ell$ can be hierarchically decomposed into ordered subtasks $c_{1:K} \in \mathcal{L}^{K} \sim p_{\mathcal{M}}(c_{1:K} \mid \ell,\ s_{0})$ through VLM $\mathcal{M}$

The full task can be accomplished by solving these subtasks step by step, i.e. conditioning the policy on each subtask $\pi(a_{t} \mid s_{t},\ c_{k})$ within corresponding segament $u_{k}$ , where $u_{1:K}$ is an ordered partition of $\{ 0,\ 1,\ \cdots,\ H \}$

BC Pretraining

An instruction-conditioned policy is pretrained via BC on expert trajectories generated by expert policy $\pi_{\beta}$ , which are further augmented with additional subtask labels $c_{1:K}$ and $u_{1:K}$ to form a dataset $\mathcal{D}_{\mathrm{prior}}$

\begin{aligned} \mathcal{L}_{\mathrm{BC}}(\theta) &= -\mathcal{E}_{\ell \sim \rho_{\mathrm{prior}}} \mathcal{E}_{(s_{t},\ a_{t}) \sim \pi_{\beta}} \left[ \sum_{t = 1}^{H} \log \pi_{\theta}(a_{t} \mid s_{t},\ \ell) \right] \approx -\mathcal{E}_{(s_{t},\ a_{t},\ c,\ u,\ \ell) \sim \mathcal{D}_{\mathrm{prior}}} \left[ \sum_{k = 1}^{K} \sum_{t \in u_{k}} \log \pi_{\theta}(a_{t} \mid s_{t},\ c_{k}) \right] \\[7mm] &\Rightarrow -\mathcal{E}_{(s_{t},\ a_{t},\ c,\ u,\ \ell) \sim \mathcal{D}_{\mathrm{prior}}} \left[ \sum_{k = 1}^{K} \sum_{t \in u_{k}} \log \pi_{\theta}(a_{t} \mid s_{t},\ c_{k}^{H},\ c_{k}^{L}) + \log \pi_{\theta}(a_{t} \mid s_{t},\ c_{k}^{H},\ \boldsymbol{0}) + \log \pi_{\theta}(a_{t} \mid s_{t},\ \boldsymbol{0},\ c_{k}^{L}) \right] \end{aligned}

where the subtask instructions $c_{k}$ is decomposed into high-level part $c_{k}^{H}$ and low-level part $c_{k}^{L}$ . The final form of objective encourages the policy to learn to follow instructions at both abstraction level

Few-Shot Adaptation

To solve out-of-distribution tasks, PALO decomposes them into in-distribution subtasks, and search for optimal subtask decomposition $c_{1:K}^{\star}$ and optimal subtask partition $u_{1:K}^{\star}$ by randomly sampling

The $c$ and $u$ are optimized jointly to minimize the cost function over a handful of expert demonstration in $\mathcal{D}_{\mathrm{target}}$

\min_{c \in \mathcal{M}(s_{0},\ \ell)} \mathcal{E}_{\tau \sim \pi_{\beta}} \min_{u \in \mathcal{U}} J(c,\ u,\ \tau) \approx \min_{c \in c^{(1:M)}} \sum_{\tau \in \mathcal{D}_{\mathrm{target}}} \min_{u \in u^{(1:N)}} \left[ \sum_{k = 1}^{K} \sum_{t \in u_{k}} \log \pi_{\theta}(a_{t} \mid s_{t},\ c_{k}) \right]

The optimal subtask decomposition $c^{\star}$ and partition $u^{\star}$ can be further deployed for execution

RL > Language-Assistant

#PALO

PALO

http://example.com/2024/09/14/PALO/

Author

木辛

Posted on

September 14, 2024

Licensed under

EMMA Previous

DLLM Next