Tesseract
Model-Based + DP
Due to the exponentially growing action space w.r.t. the number of agents, the common representations of joint action value function also suffer from the blowup of hypothesis space, resulting in huge sample and computational complexity
Q(s, u)=Q(s, u1:n)∈Q(S, U, P, R, γ, n)⊆R∣S∣+n∣U∣
While Q(s) can be viewed as an order n tensor, which can be approximated by rank-k CP tensor decomposition
Q(s)≈r=1∑kwr(s)⊗i=1nuri(s)=r=1∑kwr(s)(ur1(s)⊗ur2(s)⊗⋯⊗urn(s))∥∥∥∥uri(s)∈R∣U∣∥∥∥∥2=1
Such low-rank representation can reduce the hypothesis space and make a balance between expressibility and learnability by adjusting decomposition rank k, which is more flexible than previous value-based methods like VDN, QMIX, etc.
The Bellman Expectation operator for n agents can be viewed as sum and product manipulation on tensor
TπQ=R(s, u)+γs′∑P(s′∣s, u)u′∑π(u′∣s′)Q(s′, u′)
where the reward and dynamics function are also represented by CP decomposition and estimated by historical episodes
The model-based Tesseract algorithm adopts DP to perform policy evaluation and improve policy subsequently
Model-Free + TD
With the centralized training manner, the CP decomposed joint action value function can be parameterized as
Qϕ(s, u)=⟨Qϕ(s), ⊗i=1none-hot(ui)⟩=⟨r=1∑kwϕr(s)⊗i=1ngϕr(s), ⊗i=1none-hot(ui)⟩
More expressibility can be added by using abstract representation of actions with continuous space U⊆Rd
Qϕ, η(s, u)=⟨r=1∑kwϕr(s)⊗i=1ngϕr(s), ⊗i=1nfη(ui)⟩gϕr:S↦Rmfη:U↦Rm
which can be further simplified as
Qϕ, η(s, u)=r=1∑kwϕr(s)i=1∏n⟨gϕr(s), fη(ui)⟩
Such representation can be applied in any actor-critic or value-based methods in CTDE settings
PAC Analysis