COMBO

World Model Decomposition

COMBO formulates the multi-agent world model as a compositional video diffusion model conditioned on joint action

p(x \mid a) = \frac{1}{p(a)} p(x) p(a_{1},\ a_{2},\ \cdots,\ a_{n} \mid x) \propto p(x) \prod_{i = 1}^{n} p(a_{i} \mid x) \propto p(x) \prod_{i = 1}^{n} \frac{p(a_{i} \mid x)}{p(a_{i})} = p(x) \prod_{i = 1}^{n} \frac{p(x \mid a_{i})}{p(x)}

each factor of which is modeled as a diffusion model and the sampling process is equivalent to an energy-based model

Model	Definition	Sampling	Sampling Formulation
Diffusion	$p_{\theta}(x_{0:\mathrm{T}}) = p(x_{\mathrm{T}}) \prod_{t = \mathrm{T}}^{1} p_{\theta}(x_{t - 1} \mid x_{t})$	DDPM	$x_{t - 1} \leftarrow x_{t} - \epsilon_{\theta}(x_{t},\ t) + \mathcal{N}(0,\ \sigma_{t}^{2} \boldsymbol{I})$
Energy-Based	$p_{\theta}(x) \propto \exp(-\varepsilon_{\theta}(x_{t}))$	Langevin	$x_{t - 1} \leftarrow x_{t} - \lambda \nabla_{x} \varepsilon_{\theta}(x_{t}) + \mathcal{N}(0,\ \sigma_{t}^{2} \boldsymbol{I})$

The composed energy-based model can be represented through the sum operation of energy factors

p_{\theta}(x_{1},\ x_{2},\ \cdots,\ x_{n}) = \prod_{i = 1}^{n} p_{\theta}(x_{i}) \propto \exp \left[ -\sum_{i = 1}^{n} \varepsilon_{\theta}^{i}(x_{i}) \right] = \exp \Big[ \varepsilon_{\theta}(x_{1},\ x_{2},\ \cdots,\ x_{n}) \Big]

Similarly, the joint score function of world model can also be decomposed by cumulating the score function of each factor

\hat{\epsilon}(x_{t},\ t \mid a) = \epsilon_{\theta}(x_{t},\ t) + \sum_{i = 1}^{n} \Big[ \epsilon_{\theta}(x_{t},\ t \mid a_{i}) - \epsilon_{\theta}(x_{t},\ t) \Big]

When sampling at inference time, the score function on conditioned terms are scaled with a temperature coefficient $\omega$

\hat{\epsilon}(x_{t},\ t \mid a) = \epsilon_{\theta}(x_{t},\ t) + \omega \sum_{i = 1}^{n} \Big[ \epsilon_{\theta}(x_{t},\ t \mid a_{i}) - \epsilon_{\theta}(x_{t},\ t) \Big]

COMBO trains such diffusion model with two stages for individual components and the composed model, respectively

\mathcal{L}_{\text{Individual}} = \sum_{i = 1}^{n} C_{i} \| \epsilon_{\theta}(x_{t},\ t \mid a_{i}) - \epsilon \|^{2} \qquad \mathcal{L}_{\text{Composed}} \left\| \frac{1}{n} \sum_{i = 1}^{n} \epsilon_{\theta}(x_{t},\ t \mid a_{i}) - \epsilon \right\|^{2}

where the $C_{i}$ is used for agent-depedent loss scaling, such as the mask on the agent’s reachable region in the image

Planning with World Model

When planning with the world model, COMBO first estimates the global state from partial egocentric observations

The estimated global state serves as the initial frame. COMBO then adopts VLM to propose possible actions of an agent and track others’ intent. The simulated futures are evaluated by VLM and the outcomes are used for tree-search planning

RL > Multi-Agent

#COMBO

COMBO

http://example.com/2024/09/25/COMBO/

Author

木辛

Posted on

September 25, 2024

Licensed under

GenRL Previous

Tesseract Next