PALO
Instruction Decomposition
In a contextual MDP, the policy π(at∣st, ℓ) is conditioned on a free-form language instruction ℓ∈L, where ℓ can be hierarchically decomposed into ordered subtasks c1:K∈LK∼pM(c1:K∣ℓ, s0) through VLM M
The full task can be accomplished by solving these subtasks step by step, i.e. conditioning the policy on each subtask π(at∣st, ck) within corresponding segament uk, where u1:K is an ordered partition of {0, 1, ⋯, H}
BC Pretraining
An instruction-conditioned policy is pretrained via BC on expert trajectories generated by expert policy πβ, which are further augmented with additional subtask labels c1:K and u1:K to form a dataset Dprior
LBC(θ)=−Eℓ∼ρpriorE(st, at)∼πβ[t=1∑Hlogπθ(at∣st, ℓ)]≈−E(st, at, c, u, ℓ)∼Dprior[k=1∑Kt∈uk∑logπθ(at∣st, ck)]⇒−E(st, at, c, u, ℓ)∼Dprior[k=1∑Kt∈uk∑logπθ(at∣st, ckH, ckL)+logπθ(at∣st, ckH, 0)+logπθ(at∣st, 0, ckL)]
where the subtask instructions ck is decomposed into high-level part ckH and low-level part ckL. The final form of objective encourages the policy to learn to follow instructions at both abstraction level
Few-Shot Adaptation
To solve out-of-distribution tasks, PALO decomposes them into in-distribution subtasks, and search for optimal subtask decomposition c1:K⋆ and optimal subtask partition u1:K⋆ by randomly sampling
The c and u are optimized jointly to minimize the cost function over a handful of expert demonstration in Dtarget
c∈M(s0, ℓ)minEτ∼πβu∈UminJ(c, u, τ)≈c∈c(1:M)minτ∈Dtarget∑u∈u(1:N)min[k=1∑Kt∈uk∑logπθ(at∣st, ck)]
The optimal subtask decomposition c⋆ and partition u⋆ can be further deployed for execution