COMBO
World Model Decomposition
COMBO formulates the multi-agent world model as a compositional video diffusion model conditioned on joint action
p(x∣a)=p(a)1p(x)p(a1, a2, ⋯, an∣x)∝p(x)i=1∏np(ai∣x)∝p(x)i=1∏np(ai)p(ai∣x)=p(x)i=1∏np(x)p(x∣ai)
each factor of which is modeled as a diffusion model and the sampling process is equivalent to an energy-based model
Model |
Definition |
Sampling |
Sampling Formulation |
Diffusion |
pθ(x0:T)=p(xT)∏t=T1pθ(xt−1∣xt) |
DDPM |
xt−1←xt−ϵθ(xt, t)+N(0, σt2I) |
Energy-Based |
pθ(x)∝exp(−εθ(xt)) |
Langevin |
xt−1←xt−λ∇xεθ(xt)+N(0, σt2I) |
The composed energy-based model can be represented through the sum operation of energy factors
pθ(x1, x2, ⋯, xn)=i=1∏npθ(xi)∝exp[−i=1∑nεθi(xi)]=exp[εθ(x1, x2, ⋯, xn)]
Similarly, the joint score function of world model can also be decomposed by cumulating the score function of each factor
ϵ^(xt, t∣a)=ϵθ(xt, t)+i=1∑n[ϵθ(xt, t∣ai)−ϵθ(xt, t)]
When sampling at inference time, the score function on conditioned terms are scaled with a temperature coefficient ω
ϵ^(xt, t∣a)=ϵθ(xt, t)+ωi=1∑n[ϵθ(xt, t∣ai)−ϵθ(xt, t)]
COMBO trains such diffusion model with two stages for individual components and the composed model, respectively
LIndividual=i=1∑nCi∥ϵθ(xt, t∣ai)−ϵ∥2LComposed∥∥∥∥∥∥n1i=1∑nϵθ(xt, t∣ai)−ϵ∥∥∥∥∥∥2
where the Ci is used for agent-depedent loss scaling, such as the mask on the agent’s reachable region in the image
Planning with World Model
When planning with the world model, COMBO first estimates the global state from partial egocentric observations
The estimated global state serves as the initial frame. COMBO then adopts VLM to propose possible actions of an agent and track others’ intent. The simulated futures are evaluated by VLM and the outcomes are used for tree-search planning