CSRO

Task Representation Learning

As a context-based OMRL algorithm, CSRO extracts the task information from offline context through context encoder

q_{\phi}(z \mid c = \{ (s_{i},\ a_{i},\ r_{i},\ s_{i}') \}_{i = 1}^{n_{c}}) = \frac{1}{n_{c}} \sum_{i = 1}^{n_{c}} q_{\phi}(z \mid s_{i},\ a_{i},\ r_{i},\ s_{i}')

The context encoder can be trained to maximizing $I(z;\ M)$ through contrastive learning objective in FOCAL

\mathcal{L}_{\text{maxMI}}(\phi) = \mathcal{E}_{M_{i},\ M_{j} \sim p(M)} \mathcal{E}_{c_{i} \sim \mathcal{D}(M_{i}),\ c_{j} \sim \mathcal{D}(M_{j})} \mathcal{E}_{z_{i} \sim q_{\phi}(\cdot \mid c_{i}),\ z_{j} \sim q_{\phi}(\cdot \mid c_{j})} \left[ \boldsymbol{1}(M_{i} = M_{j}) | z_{i} - z_{j} |_{2}^{2} + \boldsymbol{1}(M_{i} \ne M_{j}) \frac{\beta}{| z_{i} - z_{j} |_{2}^{n} + \epsilon} \right]

To alleviate testing performance decline caused by context distribution shift, CSRO adopts CLUB to reduce $I(z;\ s,\ a)$

\begin{aligned} \mathcal{L}_{\text{minMI}}(\phi) &= I_{\text{CLUB}}(z;\ s,\ a) = \mathcal{E}_{z,\ s,\ a \sim p(z,\ s,\ a)} \log p(z \mid s,\ a) - \mathcal{E}_{z \sim p(z)} \mathcal{E}_{s,\ a \sim p(s,\ a)} \log p(z \mid s,\ a) \ge I(z;\ s,\ a) \\[5mm] \Downarrow &\approx \mathcal{E}_{M \sim p(M)} \mathcal{E}_{(s,\ a,\ r,\ s') \sim \mathcal{D}(M)} \Big[ \mathcal{E}_{z \sim q_{\phi}(\cdot \mid s,\ a,\ r,\ s')} \log q_{\psi}(z \mid s,\ a) + \mathcal{E}_{\tilde{M} \sim p(M)} \mathcal{E}_{(\tilde{s},\ \tilde{a}) \sim \mathcal{D}(\tilde{M})} \log q_{\psi}(z \mid \tilde{s},\ \tilde{a}) \Big] \end{aligned}

where $p(z \mid s,\ a)$ is approximated by an unbiased estimator $q_{\psi}(z \mid s,\ a)$ trained in parallel with context encoder

\mathcal{L}_{\text{VD}}(\psi) = -\mathcal{E}_{M \sim p(M)} \mathcal{E}_{(s,\ a,\ r,\ s') \sim \mathcal{D}(M)} \mathcal{E}_{z \sim q_{\phi}(\cdot \mid s,\ a,\ r,\ s')} \log q_{\psi}(z \mid s,\ a)

CSRO combines the above two objective and get the total loss of context encdoer $\mathcal{L}_{\text{encoder}} = \mathcal{L}_{\text{maxMI}} + \lambda \mathcal{L}_{\text{minMI}}$

Meta Behavior Learning

Based on the context encoder, CSRO conducts meta offline behavior learning on behavior regularized actor and critic

At meta test phase, CSRO performs random exploration to collect experience at early stage instead of $\pi_{\theta}(a \mid s,\ z_{0})$

Such non-prior context collection strategy can elimiate the influence of initially sampled task representation $z_{0}$

RL > Meta-Learning

#CSRO

CSRO

http://example.com/2024/10/29/CSRO/

Author

木辛

Posted on

October 29, 2024

Licensed under

GENTLE Previous

UNICORN Next