ELLM

Goal Generation

ELLM uses autoregressive LLMs (GPT-3) to generate suggested goals for agent pretraining, which statisfies

diverse: the goal distribution becomes as large as the space of natural language strings
common-sense sensitive: generated goals are compatible with human prior knowledge absorbed in LLMs
context sensitive: goals are generated from current environment configuration

At each timestep $t$ , LLM is queried for suggested goals $g_{t}^{1:k}$ through contructed prompts made up of

guidance towards desirable suggestions (few-shot prompting)
list of agent’s valid actions
text caption of current observation generated by a state captioner $C_{obs} : \Omega \mapsto \Sigma^{*}$

To impose a novelty bias, the suggestions that the agent has already achieved earlier are filtered out

Intrinsic Reward

ELLM pretrains the agent to adopt human-meaningful behavior in an intrinsically motivated manner

\pi(a \mid o,\ g) = \argmax_{\pi} \mathcal{E}_{\pi} \sum_{t = 0}^{\infty} \gamma^{t} \mathcal{R}_{\mathrm{int}}(o_{t},\ a_{t},\ o_{t + 1}) = \argmax_{\pi} \mathcal{E}_{\pi} \sum_{t = 0}^{\infty} \gamma^{t} \mathcal{E}_{g \sim \mathcal{G}} \mathcal{R}_{\mathrm{int}}(o_{t},\ a_{t},\ o_{t + 1} \mid g_{t}^{1:k})

The goal-conditioned intrinsic reward $\mathcal{R}_{\mathrm{int}}(o_{t},\ a_{t},\ o_{t + 1} \mid g_{t}^{1:k})$ is computed by measuring the semantic similarity (cosine similarity) between generated goals $g_{t}^{1:k}$ and the caption of the agent’s transition $(o_{t},\ a_{t},\ o_{t + 1})$

\Delta_{\max} = \max_{i = 1..k} \Delta(C_{\mathrm{transition}}(o_{t},\ a_{t},\ o_{t + 1}),\ g_{t}^{i}) = \max_{i = 1..k} \frac{E \big[ C_{\mathrm{transition}}(o_{t},\ a_{t},\ o_{t + 1}) \big] \cdot E \big[ g_{t}^{i} \big]}{\left\| E \big[ C_{\mathrm{transition}}(o_{t},\ a_{t},\ o_{t + 1}) \big] \right\| \left\| E \big[ g_{t}^{i} \big] \right\|}

\mathcal{R}_{\mathrm{int}}(o_{t},\ a_{t},\ o_{t + 1} \mid g_{t}^{1:k}) = \left\{ \begin{matrix} \Delta_{\max} & \Delta_{\max} > T \\[5mm] 0 & \Delta_{\max} \le T \end{matrix} \right.

in terms of which

the caption of transition is computed by a transition captioner $C_{\mathrm{transition}} : \Omega \times \mathcal{A} \times \Omega \mapsto \Sigma^{*}$
the goals and captions are embedded into semantic space by an LLM encoder (SentenceBERT) $E[\cdot]$
the intrinsic reward $\mathcal{R}_{\mathrm{int}}$ is truncated by a similarity threshold hyperparameter $T$

RL > Language-Assistant

#ELLM

ELLM

http://example.com/2024/09/10/ELLM/

Author

木辛

Posted on

September 10, 2024

Licensed under

RAP Previous

Dynalang Next