Tesseract

Tesseract

Model-Based + DP

Due to the exponentially growing action space w.r.t. the number of agents, the common representations of joint action value function also suffer from the blowup of hypothesis space, resulting in huge sample and computational complexity

Q(s, u)=Q(s, u1:n)Q(S, U, P, R, γ, n)RS+nUQ(s,\ u) = Q(s,\ u^{1:n}) \in \mathcal{Q}(\mathcal{S},\ \mathcal{U},\ \mathcal{P},\ \mathcal{R},\ \gamma,\ n) \subseteq \mathbb{R}^{|\mathcal{S}| + n |\mathcal{U}|}

While Q(s)Q(s) can be viewed as an order nn tensor, which can be approximated by rank-kk CP tensor decomposition

Q(s)r=1kwr(s)i=1nuri(s)=r=1kwr(s)(ur1(s)ur2(s)urn(s))uri(s)RU2=1Q(s) \approx \sum_{r = 1}^{k} w_{r}(s) \otimes_{i = 1}^{n} u_{r}^{i}(s) = \sum_{r = 1}^{k} w_{r}(s) \Big( u_{r}^{1}(s) \otimes u_{r}^{2}(s) \otimes \cdots \otimes u_{r}^{n}(s) \Big) \qquad \left\| u_{r}^{i}(s) \in \mathbb{R}^{|\mathcal{U}|} \right\|_{2} = 1

Such low-rank representation can reduce the hypothesis space and make a balance between expressibility and learnability by adjusting decomposition rank kk, which is more flexible than previous value-based methods like VDN, QMIX, etc.

The Bellman Expectation operator for nn agents can be viewed as sum and product manipulation on tensor

TπQ=R(s, u)+γsP(ss, u)uπ(us)Q(s, u)\mathcal{T}^{\pi} Q = \mathcal{R}(s,\ u) + \gamma \sum_{s'} \mathcal{P}(s' \mid s,\ u) \sum_{u'} \pi(u' \mid s') Q(s',\ u')

where the reward and dynamics function are also represented by CP decomposition and estimated by historical episodes

The model-based Tesseract algorithm adopts DP to perform policy evaluation and improve policy subsequently

Model-Free + TD

With the centralized training manner, the CP decomposed joint action value function can be parameterized as

Qϕ(s, u)=Qϕ(s), i=1none-hot(ui)=r=1kwϕr(s)i=1ngϕr(s), i=1none-hot(ui)Q_{\phi}(s,\ u) = \left\langle Q_{\phi}(s),\ \otimes_{i = 1}^{n} \operatorname{one-hot}(u^{i}) \right\rangle = \left\langle \sum_{r = 1}^{k} w_{\phi}^{r}(s) \otimes_{i = 1}^{n} g_{\phi}^{r}(s),\ \otimes_{i = 1}^{n} \operatorname{one-hot}(u^{i}) \right\rangle

More expressibility can be added by using abstract representation of actions with continuous space URd\mathcal{U} \subseteq \mathbb{R}^{d}

Qϕ, η(s, u)=r=1kwϕr(s)i=1ngϕr(s), i=1nfη(ui)gϕr:SRmfη:URmQ_{\phi,\ \eta}(s,\ u) = \left\langle \sum_{r = 1}^{k} w_{\phi}^{r}(s) \otimes_{i = 1}^{n} g_{\phi}^{r}(s),\ \otimes_{i = 1}^{n} f_{\eta}(u^{i}) \right\rangle \qquad g_{\phi}^{r} : \mathcal{S} \mapsto \mathbb{R}^{m} \quad f_{\eta} : \mathcal{U} \mapsto \mathbb{R}^{m}

which can be further simplified as

Qϕ, η(s, u)=r=1kwϕr(s)i=1ngϕr(s), fη(ui)Q_{\phi,\ \eta}(s,\ u) = \sum_{r = 1}^{k} w_{\phi}^{r}(s) \prod_{i = 1}^{n} \left\langle g_{\phi}^{r}(s),\ f_{\eta}(u^{i}) \right\rangle

Such representation can be applied in any actor-critic or value-based methods in CTDE settings

PAC Analysis


Tesseract
http://example.com/2024/09/21/Tesseract/
Author
木辛
Posted on
September 21, 2024
Licensed under