Tesseract

Model-Based + DP

Due to the exponentially growing action space w.r.t. the number of agents, the common representations of joint action value function also suffer from the blowup of hypothesis space, resulting in huge sample and computational complexity

Q(s,\ u) = Q(s,\ u^{1:n}) \in \mathcal{Q}(\mathcal{S},\ \mathcal{U},\ \mathcal{P},\ \mathcal{R},\ \gamma,\ n) \subseteq \mathbb{R}^{|\mathcal{S}| + n |\mathcal{U}|}

While $Q(s)$ can be viewed as an order $n$ tensor, which can be approximated by rank- $k$ CP tensor decomposition

Q(s) \approx \sum_{r = 1}^{k} w_{r}(s) \otimes_{i = 1}^{n} u_{r}^{i}(s) = \sum_{r = 1}^{k} w_{r}(s) \Big( u_{r}^{1}(s) \otimes u_{r}^{2}(s) \otimes \cdots \otimes u_{r}^{n}(s) \Big) \qquad \left\| u_{r}^{i}(s) \in \mathbb{R}^{|\mathcal{U}|} \right\|_{2} = 1

Such low-rank representation can reduce the hypothesis space and make a balance between expressibility and learnability by adjusting decomposition rank $k$ , which is more flexible than previous value-based methods like VDN, QMIX, etc.

The Bellman Expectation operator for $n$ agents can be viewed as sum and product manipulation on tensor

\mathcal{T}^{\pi} Q = \mathcal{R}(s,\ u) + \gamma \sum_{s'} \mathcal{P}(s' \mid s,\ u) \sum_{u'} \pi(u' \mid s') Q(s',\ u')

where the reward and dynamics function are also represented by CP decomposition and estimated by historical episodes

The model-based Tesseract algorithm adopts DP to perform policy evaluation and improve policy subsequently

Model-Free + TD

With the centralized training manner, the CP decomposed joint action value function can be parameterized as

Q_{\phi}(s,\ u) = \left\langle Q_{\phi}(s),\ \otimes_{i = 1}^{n} \operatorname{one-hot}(u^{i}) \right\rangle = \left\langle \sum_{r = 1}^{k} w_{\phi}^{r}(s) \otimes_{i = 1}^{n} g_{\phi}^{r}(s),\ \otimes_{i = 1}^{n} \operatorname{one-hot}(u^{i}) \right\rangle

More expressibility can be added by using abstract representation of actions with continuous space $\mathcal{U} \subseteq \mathbb{R}^{d}$

Q_{\phi,\ \eta}(s,\ u) = \left\langle \sum_{r = 1}^{k} w_{\phi}^{r}(s) \otimes_{i = 1}^{n} g_{\phi}^{r}(s),\ \otimes_{i = 1}^{n} f_{\eta}(u^{i}) \right\rangle \qquad g_{\phi}^{r} : \mathcal{S} \mapsto \mathbb{R}^{m} \quad f_{\eta} : \mathcal{U} \mapsto \mathbb{R}^{m}

which can be further simplified as

Q_{\phi,\ \eta}(s,\ u) = \sum_{r = 1}^{k} w_{\phi}^{r}(s) \prod_{i = 1}^{n} \left\langle g_{\phi}^{r}(s),\ f_{\eta}(u^{i}) \right\rangle

Such representation can be applied in any actor-critic or value-based methods in CTDE settings

PAC Analysis

RL > Multi-Agent

#Tesseract

Tesseract

http://example.com/2024/09/21/Tesseract/

Author

木辛

Posted on

September 21, 2024

Licensed under

COMBO Previous

Lang4Sim2Real Next