CSRO

CSRO

Task Representation Learning

As a context-based OMRL algorithm, CSRO extracts the task information from offline context through context encoder

qϕ(zc={(si, ai, ri, si)}i=1nc)=1nci=1ncqϕ(zsi, ai, ri, si)q_{\phi}(z \mid c = \{ (s_{i},\ a_{i},\ r_{i},\ s_{i}') \}_{i = 1}^{n_{c}}) = \frac{1}{n_{c}} \sum_{i = 1}^{n_{c}} q_{\phi}(z \mid s_{i},\ a_{i},\ r_{i},\ s_{i}')

The context encoder can be trained to maximizing I(z; M)I(z;\ M) through contrastive learning objective in FOCAL

LmaxMI(ϕ)=EMi, Mjp(M)EciD(Mi), cjD(Mj)Eziqϕ(ci), zjqϕ(cj)[1(Mi=Mj)zizj22+1(MiMj)βzizj2n+ϵ]\mathcal{L}_{\text{maxMI}}(\phi) = \mathcal{E}_{M_{i},\ M_{j} \sim p(M)} \mathcal{E}_{c_{i} \sim \mathcal{D}(M_{i}),\ c_{j} \sim \mathcal{D}(M_{j})} \mathcal{E}_{z_{i} \sim q_{\phi}(\cdot \mid c_{i}),\ z_{j} \sim q_{\phi}(\cdot \mid c_{j})} \left[ \boldsymbol{1}(M_{i} = M_{j}) | z_{i} - z_{j} |_{2}^{2} + \boldsymbol{1}(M_{i} \ne M_{j}) \frac{\beta}{| z_{i} - z_{j} |_{2}^{n} + \epsilon} \right]

To alleviate testing performance decline caused by context distribution shift, CSRO adopts CLUB to reduce I(z; s, a)I(z;\ s,\ a)

LminMI(ϕ)=ICLUB(z; s, a)=Ez, s, ap(z, s, a)logp(zs, a)Ezp(z)Es, ap(s, a)logp(zs, a)I(z; s, a)EMp(M)E(s, a, r, s)D(M)[Ezqϕ(s, a, r, s)logqψ(zs, a)+EM~p(M)E(s~, a~)D(M~)logqψ(zs~, a~)]\begin{aligned} \mathcal{L}_{\text{minMI}}(\phi) &= I_{\text{CLUB}}(z;\ s,\ a) = \mathcal{E}_{z,\ s,\ a \sim p(z,\ s,\ a)} \log p(z \mid s,\ a) - \mathcal{E}_{z \sim p(z)} \mathcal{E}_{s,\ a \sim p(s,\ a)} \log p(z \mid s,\ a) \ge I(z;\ s,\ a) \\[5mm] \Downarrow &\approx \mathcal{E}_{M \sim p(M)} \mathcal{E}_{(s,\ a,\ r,\ s') \sim \mathcal{D}(M)} \Big[ \mathcal{E}_{z \sim q_{\phi}(\cdot \mid s,\ a,\ r,\ s')} \log q_{\psi}(z \mid s,\ a) + \mathcal{E}_{\tilde{M} \sim p(M)} \mathcal{E}_{(\tilde{s},\ \tilde{a}) \sim \mathcal{D}(\tilde{M})} \log q_{\psi}(z \mid \tilde{s},\ \tilde{a}) \Big] \end{aligned}

where p(zs, a)p(z \mid s,\ a) is approximated by an unbiased estimator qψ(zs, a)q_{\psi}(z \mid s,\ a) trained in parallel with context encoder

LVD(ψ)=EMp(M)E(s, a, r, s)D(M)Ezqϕ(s, a, r, s)logqψ(zs, a)\mathcal{L}_{\text{VD}}(\psi) = -\mathcal{E}_{M \sim p(M)} \mathcal{E}_{(s,\ a,\ r,\ s') \sim \mathcal{D}(M)} \mathcal{E}_{z \sim q_{\phi}(\cdot \mid s,\ a,\ r,\ s')} \log q_{\psi}(z \mid s,\ a)

CSRO combines the above two objective and get the total loss of context encdoer Lencoder=LmaxMI+λLminMI\mathcal{L}_{\text{encoder}} = \mathcal{L}_{\text{maxMI}} + \lambda \mathcal{L}_{\text{minMI}}

Meta Behavior Learning

Based on the context encoder, CSRO conducts meta offline behavior learning on behavior regularized actor and critic

At meta test phase, CSRO performs random exploration to collect experience at early stage instead of πθ(as, z0)\pi_{\theta}(a \mid s,\ z_{0})

Such non-prior context collection strategy can elimiate the influence of initially sampled task representation z0z_{0}


CSRO
http://example.com/2024/10/29/CSRO/
Author
木辛
Posted on
October 29, 2024
Licensed under