PALO

PALO

Instruction Decomposition

In a contextual MDP, the policy π(atst, )\pi(a_{t} \mid s_{t},\ \ell) is conditioned on a free-form language instruction L\ell \in \mathcal{L}, where \ell can be hierarchically decomposed into ordered subtasks c1:KLKpM(c1:K, s0)c_{1:K} \in \mathcal{L}^{K} \sim p_{\mathcal{M}}(c_{1:K} \mid \ell,\ s_{0}) through VLM M\mathcal{M}

The full task can be accomplished by solving these subtasks step by step, i.e. conditioning the policy on each subtask π(atst, ck)\pi(a_{t} \mid s_{t},\ c_{k}) within corresponding segament uku_{k}, where u1:Ku_{1:K} is an ordered partition of {0, 1, , H}\{ 0,\ 1,\ \cdots,\ H \}

BC Pretraining

An instruction-conditioned policy is pretrained via BC on expert trajectories generated by expert policy πβ\pi_{\beta}, which are further augmented with additional subtask labels c1:Kc_{1:K} and u1:Ku_{1:K} to form a dataset Dprior\mathcal{D}_{\mathrm{prior}}

LBC(θ)=EρpriorE(st, at)πβ[t=1Hlogπθ(atst, )]E(st, at, c, u, )Dprior[k=1Ktuklogπθ(atst, ck)]E(st, at, c, u, )Dprior[k=1Ktuklogπθ(atst, ckH, ckL)+logπθ(atst, ckH, 0)+logπθ(atst, 0, ckL)]\begin{aligned} \mathcal{L}_{\mathrm{BC}}(\theta) &= -\mathcal{E}_{\ell \sim \rho_{\mathrm{prior}}} \mathcal{E}_{(s_{t},\ a_{t}) \sim \pi_{\beta}} \left[ \sum_{t = 1}^{H} \log \pi_{\theta}(a_{t} \mid s_{t},\ \ell) \right] \approx -\mathcal{E}_{(s_{t},\ a_{t},\ c,\ u,\ \ell) \sim \mathcal{D}_{\mathrm{prior}}} \left[ \sum_{k = 1}^{K} \sum_{t \in u_{k}} \log \pi_{\theta}(a_{t} \mid s_{t},\ c_{k}) \right] \\[7mm] &\Rightarrow -\mathcal{E}_{(s_{t},\ a_{t},\ c,\ u,\ \ell) \sim \mathcal{D}_{\mathrm{prior}}} \left[ \sum_{k = 1}^{K} \sum_{t \in u_{k}} \log \pi_{\theta}(a_{t} \mid s_{t},\ c_{k}^{H},\ c_{k}^{L}) + \log \pi_{\theta}(a_{t} \mid s_{t},\ c_{k}^{H},\ \boldsymbol{0}) + \log \pi_{\theta}(a_{t} \mid s_{t},\ \boldsymbol{0},\ c_{k}^{L}) \right] \end{aligned}

where the subtask instructions ckc_{k} is decomposed into high-level part ckHc_{k}^{H} and low-level part ckLc_{k}^{L}. The final form of objective encourages the policy to learn to follow instructions at both abstraction level

Few-Shot Adaptation

To solve out-of-distribution tasks, PALO decomposes them into in-distribution subtasks, and search for optimal subtask decomposition c1:Kc_{1:K}^{\star} and optimal subtask partition u1:Ku_{1:K}^{\star} by randomly sampling

The cc and uu are optimized jointly to minimize the cost function over a handful of expert demonstration in Dtarget\mathcal{D}_{\mathrm{target}}

mincM(s0, )EτπβminuUJ(c, u, τ)mincc(1:M)τDtargetminuu(1:N)[k=1Ktuklogπθ(atst, ck)]\min_{c \in \mathcal{M}(s_{0},\ \ell)} \mathcal{E}_{\tau \sim \pi_{\beta}} \min_{u \in \mathcal{U}} J(c,\ u,\ \tau) \approx \min_{c \in c^{(1:M)}} \sum_{\tau \in \mathcal{D}_{\mathrm{target}}} \min_{u \in u^{(1:N)}} \left[ \sum_{k = 1}^{K} \sum_{t \in u_{k}} \log \pi_{\theta}(a_{t} \mid s_{t},\ c_{k}) \right]

The optimal subtask decomposition cc^{\star} and partition uu^{\star} can be further deployed for execution


PALO
http://example.com/2024/09/14/PALO/
Author
木辛
Posted on
September 14, 2024
Licensed under