COMBO

COMBO

World Model Decomposition

COMBO formulates the multi-agent world model as a compositional video diffusion model conditioned on joint action

p(xa)=1p(a)p(x)p(a1, a2, , anx)p(x)i=1np(aix)p(x)i=1np(aix)p(ai)=p(x)i=1np(xai)p(x)p(x \mid a) = \frac{1}{p(a)} p(x) p(a_{1},\ a_{2},\ \cdots,\ a_{n} \mid x) \propto p(x) \prod_{i = 1}^{n} p(a_{i} \mid x) \propto p(x) \prod_{i = 1}^{n} \frac{p(a_{i} \mid x)}{p(a_{i})} = p(x) \prod_{i = 1}^{n} \frac{p(x \mid a_{i})}{p(x)}

each factor of which is modeled as a diffusion model and the sampling process is equivalent to an energy-based model

Model Definition Sampling Sampling Formulation
Diffusion pθ(x0:T)=p(xT)t=T1pθ(xt1xt)p_{\theta}(x_{0:\mathrm{T}}) = p(x_{\mathrm{T}}) \prod_{t = \mathrm{T}}^{1} p_{\theta}(x_{t - 1} \mid x_{t}) DDPM xt1xtϵθ(xt, t)+N(0, σt2I)x_{t - 1} \leftarrow x_{t} - \epsilon_{\theta}(x_{t},\ t) + \mathcal{N}(0,\ \sigma_{t}^{2} \boldsymbol{I})
Energy-Based pθ(x)exp(εθ(xt))p_{\theta}(x) \propto \exp(-\varepsilon_{\theta}(x_{t})) Langevin xt1xtλxεθ(xt)+N(0, σt2I)x_{t - 1} \leftarrow x_{t} - \lambda \nabla_{x} \varepsilon_{\theta}(x_{t}) + \mathcal{N}(0,\ \sigma_{t}^{2} \boldsymbol{I})

The composed energy-based model can be represented through the sum operation of energy factors

pθ(x1, x2, , xn)=i=1npθ(xi)exp[i=1nεθi(xi)]=exp[εθ(x1, x2, , xn)]p_{\theta}(x_{1},\ x_{2},\ \cdots,\ x_{n}) = \prod_{i = 1}^{n} p_{\theta}(x_{i}) \propto \exp \left[ -\sum_{i = 1}^{n} \varepsilon_{\theta}^{i}(x_{i}) \right] = \exp \Big[ \varepsilon_{\theta}(x_{1},\ x_{2},\ \cdots,\ x_{n}) \Big]

Similarly, the joint score function of world model can also be decomposed by cumulating the score function of each factor

ϵ^(xt, ta)=ϵθ(xt, t)+i=1n[ϵθ(xt, tai)ϵθ(xt, t)]\hat{\epsilon}(x_{t},\ t \mid a) = \epsilon_{\theta}(x_{t},\ t) + \sum_{i = 1}^{n} \Big[ \epsilon_{\theta}(x_{t},\ t \mid a_{i}) - \epsilon_{\theta}(x_{t},\ t) \Big]

When sampling at inference time, the score function on conditioned terms are scaled with a temperature coefficient ω\omega

ϵ^(xt, ta)=ϵθ(xt, t)+ωi=1n[ϵθ(xt, tai)ϵθ(xt, t)]\hat{\epsilon}(x_{t},\ t \mid a) = \epsilon_{\theta}(x_{t},\ t) + \omega \sum_{i = 1}^{n} \Big[ \epsilon_{\theta}(x_{t},\ t \mid a_{i}) - \epsilon_{\theta}(x_{t},\ t) \Big]

COMBO trains such diffusion model with two stages for individual components and the composed model, respectively

LIndividual=i=1nCiϵθ(xt, tai)ϵ2LComposed1ni=1nϵθ(xt, tai)ϵ2\mathcal{L}_{\text{Individual}} = \sum_{i = 1}^{n} C_{i} \| \epsilon_{\theta}(x_{t},\ t \mid a_{i}) - \epsilon \|^{2} \qquad \mathcal{L}_{\text{Composed}} \left\| \frac{1}{n} \sum_{i = 1}^{n} \epsilon_{\theta}(x_{t},\ t \mid a_{i}) - \epsilon \right\|^{2}

where the CiC_{i} is used for agent-depedent loss scaling, such as the mask on the agent’s reachable region in the image

Planning with World Model

When planning with the world model, COMBO first estimates the global state from partial egocentric observations

The estimated global state serves as the initial frame. COMBO then adopts VLM to propose possible actions of an agent and track others’ intent. The simulated futures are evaluated by VLM and the outcomes are used for tree-search planning


COMBO
http://example.com/2024/09/25/COMBO/
Author
木辛
Posted on
September 25, 2024
Licensed under