ODIS
Latent Skill Discovery
To achieve effective coordination on unseen tasks, ODIS discovers domain latent skills for high-level behavior learning
The latent skills are modeled as latent variables underlying behind actions, which can be learned through β-VAE
ϕs, θamaxE(st, τt, at)∼D[i=1∑nEzti∼q(zti∣st, at; ϕs)logp(ati∣τti, zti; θa)−βDKL(q(zti∣st, at; ϕs) ∥ p~(zti))]
where D is a multi-task offline dataset, zti∈Z is discrete latent skill, q(zti∣st, at) and p(ati∣τti, zti) are state encoder and action decoder respectively, the prior distribution of latent skills p~(zti) are assumed to be uniform categorical distribution
Offline Policy Learning
Based on the discovered latent skills, ODIS performs multi-task offline Q-Learning with QMIX-style network
LTD(θv)=E(st, τt, at, rt, st+1, τt+1)∼DEzti∼q(zti∣st, at; ϕs)[rt+γQtot(τt+1, zt+1⋆; θv−, st+1)−Qtot(τt, zt; θv, st)]2s.t.Qtot(τt, zt; θv, st)=fmix[Qi(τti, zti)∣st, θv]zt⋆={ztiargmaxQi(τti, zti; θv)}i=1n
As the latent skills are discovered from global state, it remians inefficient to learn to choose appropriate skills with local observation. ODIS introduces the consistent loss term to guide observation representation learning towards coordination
Lc(ϕo)=E(st, τt, at)∼D[i=1∑nDKL(q^(zti∣τti) ∥ q(zti∣st, at; ϕs))]
where q^(zti∣τti) predicts skills from observation representation through last layer of state encoder
To tackle the OOD issue in offline RL, CQL method is also adopted and the overall loss function is presented as
L(θv, ϕo)=LTD(θv)+αLCQL(θv)+λLc(ϕo)
The learned observation encoder and individual value networks are directly deployed on new tasks with a decentralized manner. The chosen high-level coordination skills are further used to omit low-level actions through action decoder