UNICORN

Theoretic Framework

The probabilistic model of COMRL consists of the following random variables and (conditional) disribution terms

Random Variable	Description	CPD
$M$	Task (Instance of MDP)	$p(M)$
$X_{b} = (s,\ a)$	Behavior-Related Context	$p(X_{b})$
$X_{t} = (r,\ s')$	Task-Related Context	$p(X_{t} \mid X_{b},\ M)$
$Z$	Task Representation	$p(Z \mid X_{b},\ X_{t})$

The task representation learning in COMRL aims to find a minimal sufficient statistics $Z$ of task $M$ based on context $X$

\begin{gathered} \max_{p(z \mid x)} I(Z;\ M) \\[5mm] \text{s.t.} \quad I(Z;\ M;\ X_{b}) \ge 0 \end{gathered}

Direct optimization is intractable in practice, with the assumption of $I(Z;\ M; X_{b}) \ge 0$ , $I(Z;\ M)$ is lower bounded as

\begin{aligned} I(Z;\ M) &= I(Z;\ M \mid X_{b}) + \underset{\ge 0}{\underbrace{I(Z;\ M;\ X_{b})}} \ge I(Z;\ M \mid X_{b}) = I(Z,\ X_{t};\ M \mid X_{b}) - I(X_{t};\ M \mid Z,\ X_{b}) \\[7mm] &= I(X_{t};\ M \mid X_{b}) + \underset{= 0}{\underbrace{I(Z;\ M \mid X_{t},\ X_{b})}} - I(X_{t};\ M \mid Z,\ X_{b}) \quad \Leftarrow \quad (Z \perp M \mid X_{t},\ X_{b}) \\[7mm] &= I(X_{t};\ M \mid X_{b}) - H(X_{t} \mid Z,\ X_{b}) + \underset{\ge 0}{\underbrace{H(X_{t} \mid Z,\ X_{b},\ M)}} \ge I(X_{t};\ M \mid X_{b}) - H(X_{t} \mid Z,\ X_{b}) \\[7mm] &= I(X_{t};\ M \mid X_{b}) - H(X_{t}) + I(Z,\ X_{b};\ X_{t}) = \underset{\text{const}}{\underbrace{I(X_{t};\ M \mid X_{b})}} - \underset{\text{const}}{\underbrace{H(X_{t})}} + \underset{\text{const}}{\underbrace{I(X_{t};\ X_{b})}} + I(Z;\ X_{t} \mid X_{b}) \end{aligned}

Besides, $I(Z;\ M)$ is also upper bounded by $I(Z;\ X)$ due to the Markov chain $M \to X \to Z$ in the dependency graph

I(Z;\ M) - I(Z;\ X) = \mathcal{E}_{M,\ x,\ z} \left[ \log \frac{p(z \mid M)}{p(z \mid x)} \right] \le \log \sum_{M} \sum_{x} \sum_{z} p(M) p(x \mid M) \cancel{p(z \mid x)} \frac{p(z \mid M)}{\cancel{p(z \mid x)}} = 0

Consider the aforementioned two bounds of $I(Z;\ M)$ , some pre-existing COMRL algorithms can be interpreted as

Algorithm	Essential Optimization Objective	Description
FOCAL	$\max I(Z;\ X)$	Upper Bound
CORRO	$\max I(Z;\ X_{t} \mid X_{b})$	Lower Bound
CSRO	$\max \lambda I(Z;\ X_{t} \mid X_{b}) + (1 - \lambda) I(Z;\ X)$	Hybrid

FOCAL

FOCAL tries to maximize the upper bound $I(Z;\ X)$ , which is equivalent to the negative distance metric loss

\begin{aligned} I(Z;\ X) &= \mathcal{E}_{x,\ z} \bigg[ \log \frac{p(z,\ x)}{p(z) p(x)} \bigg] = \mathcal{E}_{x,\ z} \bigg[ \log \underset{h(x,\ z)}{\underbrace{\frac{p(z \mid x)}{p(z)}}} \bigg/ |\mathcal{M}| \underset{= 1}{\underbrace{\mathcal{E}_{x'} \left[ \frac{p(z \mid x')}{p(z)} \right]}} \bigg] + \underset{\text{const}}{\underbrace{\log |\mathcal{M}|}} \\[7mm] &\approx \sum_{M \in \mathcal{M}} \sum_{x \in \mathcal{D}(\mathcal{M})} \log \frac{h(x,\ z)}{\sum_{M' \in \mathcal{M}} \sum_{x' \in \mathcal{D}(M')} h(x',\ z)} \quad \Leftarrow \quad z \sim f_{\phi}(\cdot \mid x) \end{aligned}

However, such objective may lead to spurious correlation under distribution shift of $X_{b}$ ( $Z$ is solely conditioned on $X_{b}$ )

CORRO

To alleviate the degeneration caused by distribution shift, CORRO proposes to maximize the lower bound $I(Z;\ X_{t} \mid X_{b})$

\begin{aligned} I(Z;\ X_{t} \mid X_{b}) &= \mathcal{E}_{x,\ z} \bigg[ \log \underset{h(x_{b},\ x_{t},\ z)}{\underbrace{\frac{p(z \mid x_{t},\ x_{b})}{p(z \mid x_{b})}}} \bigg/ |\mathcal{M}| \underset{= 1}{\underbrace{\mathcal{E}_{M^{*} \sim p(M)} \mathcal{E}_{x_{t}^{*} \sim p(x_{t} \mid M^{*},\ x_{b})} \left[ \frac{p(z \mid x_{t}^{*},\ x_{b})}{p(z \mid x_{b})} \right]}} \bigg] + \underset{\text{const}}{\underbrace{\log |\mathcal{M}|}} \\[7mm] &\approx \sum_{M \in \mathcal{M}} \sum_{x \in \mathcal{D}(\mathcal{M})} \log \frac{h(x_{b},\ x_{t},\ z)}{\sum_{M^{*} \sim \mathcal{D}} h(x_{b},\ x_{t}^{*},\ z)} \quad \Leftarrow \quad z \sim f_{\phi}(\cdot \mid x_{b},\ x_{t}) \quad x_{t}^{*} \sim g_{\psi}(\cdot \mid x_{b},\ M^{*}) \end{aligned}

CSRO

CSRO maximizes $I(Z;\ X)$ and minimizes the CLUB of $I(Z;\ X_{b})$ to alleviate the distribution shift problem of context

\begin{aligned} &I(Z;\ X) - \lambda I_{\text{CLUB}}(Z;\ X_{b}) = I(Z;\ X) - \lambda \Big[ \mathcal{E}_{(z,\ x_{b}) \sim p(z,\ x_{b})} \log p(z \mid x_{b}) - \mathcal{E}_{z \sim p(z)} \mathcal{E}_{x_{b} \sim p(x_{b})} \log p(z \mid x_{b}) \Big] \\[5mm] \ge\ &I(Z;\ X) - \lambda I(Z;\ X_{b}) = I(Z;\ X_{t},\ X_{b}) - \lambda \Big[ I(Z;\ X_{t},\ X_{b}) - I(Z;\ X_{t} \mid X_{b}) \Big] = \lambda I(Z;\ X_{t} \mid X_{b}) + (1 - \lambda) I(Z;\ X) \end{aligned}

General Implementation

With the derived theoretic framework, UNICORN formulates the optimization objective based on information bottleneck

\min_{p(z \mid x)} I(Z;\ X) \quad \text{s.t.} \quad I(Z;\ M) \ge I_{c} \quad \Longrightarrow \quad \min_{p(z \mid x)} \mathcal{L}_{\text{IB}} = I(Z;\ X) - \beta I(Z;\ M)

The first term $I(Z;\ X)$ is implement as the FOCAL objective and the second term is approximated as

I(Z;\ M) \approx \alpha I(Z;\ X) + (1 - \alpha) I(Z;\ X_{t} \mid X_{b}) \qquad \alpha \in [0,\ 1]

which is a convex combinition of FOCAL and CORRO like CSRO. Substitute the approximation into $\mathcal{L}_{\text{IB}}$ and scale it as

\mathcal{L}_{\text{IB}} = I(Z;\ X) - \alpha \beta I(Z;\ X) - (1 - \alpha) \beta I(Z;\ X_{t} \mid X_{b}) \Rightarrow -\left[ I(Z;\ X_{t} \mid X_{b}) + \frac{\alpha \beta - 1}{(1 - \alpha) \beta} I(Z;\ X) \right]

Instead of using CORRO or CLUB in CSRO to approximate $I(Z;\ X_{t} \mid X_{b})$ , UNICORN proposes to rewrite it as

I(Z;\ X_{t} \mid X_{b}) = I(Z,\ X_{b};\ X_{t}) - \underset{\text{const}}{\underbrace{I(X_{t};\ X_{b})}} \approx \underset{-\mathcal{L}_{\text{recon}}}{\underbrace{\mathcal{E}_{(x_{t},\ x_{b}) \sim p(x_{t},\ x_{b})} \mathcal{E}_{z \sim q_{\phi}(z \mid x_{t},\ x_{b})} \log p_{\theta}(x_{t} \mid z,\ x_{b})}} + \text{const}

where $p_{\theta}(x_{t} \mid z,\ x_{b})$ is introduced as an unbiased estimator for $p(x_{t} \mid z,\ x_{b})$ and also able to serve as the world model

RL > Meta-Learning

#UNICORN

UNICORN

http://example.com/2024/10/26/UNICORN/

Author

木辛

Posted on

October 26, 2024

Licensed under

CSRO Previous

TRAMA Next