ELLM
Goal Generation
ELLM uses autoregressive LLMs (GPT-3) to generate suggested goals for agent pretraining, which statisfies
- diverse: the goal distribution becomes as large as the space of natural language strings
- common-sense sensitive: generated goals are compatible with human prior knowledge absorbed in LLMs
- context sensitive: goals are generated from current environment configuration
At each timestep t, LLM is queried for suggested goals gt1:k through contructed prompts made up of
- guidance towards desirable suggestions (few-shot prompting)
- list of agent’s valid actions
- text caption of current observation generated by a state captioner Cobs:Ω↦Σ∗
To impose a novelty bias, the suggestions that the agent has already achieved earlier are filtered out
Intrinsic Reward
ELLM pretrains the agent to adopt human-meaningful behavior in an intrinsically motivated manner
π(a∣o, g)=πargmaxEπt=0∑∞γtRint(ot, at, ot+1)=πargmaxEπt=0∑∞γtEg∼GRint(ot, at, ot+1∣gt1:k)
The goal-conditioned intrinsic reward Rint(ot, at, ot+1∣gt1:k) is computed by measuring the semantic similarity (cosine similarity) between generated goals gt1:k and the caption of the agent’s transition (ot, at, ot+1)
Δmax=i=1..kmaxΔ(Ctransition(ot, at, ot+1), gti)=i=1..kmax∥∥∥E[Ctransition(ot, at, ot+1)]∥∥∥∥∥∥E[gti]∥∥∥E[Ctransition(ot, at, ot+1)]⋅E[gti]
Rint(ot, at, ot+1∣gt1:k)=⎩⎪⎨⎪⎧Δmax0Δmax>TΔmax≤T
|
|
in terms of which
- the caption of transition is computed by a transition captioner Ctransition:Ω×A×Ω↦Σ∗
- the goals and captions are embedded into semantic space by an LLM encoder (SentenceBERT) E[⋅]
- the intrinsic reward Rint is truncated by a similarity threshold hyperparameter T