ELLM

ELLM

Goal Generation

ELLM uses autoregressive LLMs (GPT-3) to generate suggested goals for agent pretraining, which statisfies

  1. diverse: the goal distribution becomes as large as the space of natural language strings
  2. common-sense sensitive: generated goals are compatible with human prior knowledge absorbed in LLMs
  3. context sensitive: goals are generated from current environment configuration

At each timestep tt, LLM is queried for suggested goals gt1:kg_{t}^{1:k} through contructed prompts made up of

  1. guidance towards desirable suggestions (few-shot prompting)
  2. list of agent’s valid actions
  3. text caption of current observation generated by a state captioner Cobs:ΩΣC_{obs} : \Omega \mapsto \Sigma^{*}

To impose a novelty bias, the suggestions that the agent has already achieved earlier are filtered out

Intrinsic Reward

ELLM pretrains the agent to adopt human-meaningful behavior in an intrinsically motivated manner

π(ao, g)=arg maxπEπt=0γtRint(ot, at, ot+1)=arg maxπEπt=0γtEgGRint(ot, at, ot+1gt1:k)\pi(a \mid o,\ g) = \argmax_{\pi} \mathcal{E}_{\pi} \sum_{t = 0}^{\infty} \gamma^{t} \mathcal{R}_{\mathrm{int}}(o_{t},\ a_{t},\ o_{t + 1}) = \argmax_{\pi} \mathcal{E}_{\pi} \sum_{t = 0}^{\infty} \gamma^{t} \mathcal{E}_{g \sim \mathcal{G}} \mathcal{R}_{\mathrm{int}}(o_{t},\ a_{t},\ o_{t + 1} \mid g_{t}^{1:k})

The goal-conditioned intrinsic reward Rint(ot, at, ot+1gt1:k)\mathcal{R}_{\mathrm{int}}(o_{t},\ a_{t},\ o_{t + 1} \mid g_{t}^{1:k}) is computed by measuring the semantic similarity (cosine similarity) between generated goals gt1:kg_{t}^{1:k} and the caption of the agent’s transition (ot, at, ot+1)(o_{t},\ a_{t},\ o_{t + 1})

Δmax=maxi=1..kΔ(Ctransition(ot, at, ot+1), gti)=maxi=1..kE[Ctransition(ot, at, ot+1)]E[gti]E[Ctransition(ot, at, ot+1)]E[gti]\Delta_{\max} = \max_{i = 1..k} \Delta(C_{\mathrm{transition}}(o_{t},\ a_{t},\ o_{t + 1}),\ g_{t}^{i}) = \max_{i = 1..k} \frac{E \big[ C_{\mathrm{transition}}(o_{t},\ a_{t},\ o_{t + 1}) \big] \cdot E \big[ g_{t}^{i} \big]}{\left\| E \big[ C_{\mathrm{transition}}(o_{t},\ a_{t},\ o_{t + 1}) \big] \right\| \left\| E \big[ g_{t}^{i} \big] \right\|}

Rint(ot, at, ot+1gt1:k)={ΔmaxΔmax>T0ΔmaxT\mathcal{R}_{\mathrm{int}}(o_{t},\ a_{t},\ o_{t + 1} \mid g_{t}^{1:k}) = \left\{ \begin{matrix} \Delta_{\max} & \Delta_{\max} > T \\[5mm] 0 & \Delta_{\max} \le T \end{matrix} \right.

in terms of which

  1. the caption of transition is computed by a transition captioner Ctransition:Ω×A×ΩΣC_{\mathrm{transition}} : \Omega \times \mathcal{A} \times \Omega \mapsto \Sigma^{*}
  2. the goals and captions are embedded into semantic space by an LLM encoder (SentenceBERT) E[]E[\cdot]
  3. the intrinsic reward Rint\mathcal{R}_{\mathrm{int}} is truncated by a similarity threshold hyperparameter TT

ELLM
http://example.com/2024/09/10/ELLM/
Author
木辛
Posted on
September 10, 2024
Licensed under