EMMmA

EMMmA

MESSENGER Environment

The MESSENGER environment is designed as a multi-task grid navigation game in which the agent need to retrive the message from a specific entity and deliver it to a target entity while avoiding the enermy entities

An instance of the environment contains task-relevant free-form textual manual where each item describes

  1. the role of an entity (message, target or enermy)
  2. the movement dynamics of an entity (stationary, chasing or fleeing)

Different from other similar environments, the mapping from entities to manual M:ZE\mathcal{M} : \mathcal{Z} \mapsto \mathcal{E} isn’t available

EMMmA Model

For a MESSENGER game (S, A, Ω, P, O, RE, Z, M)(\mathcal{S},\ \mathcal{A},\ \Omega,\ \mathcal{P},\ \mathcal{O},\ \mathcal{R} \mid \mathcal{E},\ \mathcal{Z},\ \mathcal{M}), the policy network πθ(atotk:t, Z)\pi_{\theta}(a_{t} \mid o_{t - k : t},\ \mathcal{Z}) consists of

Text Encoder

Each manual item zZz \in \mathcal{Z} is encoded to a sequence of tokens t1:nt_{1:n} through a pretrained BERT-based model. The key vector kzk_{z} and value vector vzv_{z} of manual item zz are obtained through the following transformation

kz=i=1nαiWkti+bkα=softmaxi(uvti)vz=i=1nβiWvti+bvβ=softmaxi(ukti)\begin{matrix} k_{z} = \sum_{i = 1}^{n} \alpha_{i} W_{k} t_{i} + b_{k} & \alpha = \operatorname{softmax}_{i}(u_{v} \cdot t_{i}) \\[7mm] v_{z} = \sum_{i = 1}^{n} \beta_{i} W_{v} t_{i} + b_{v} & \beta = \operatorname{softmax}_{i}(u_{k} \cdot t_{i}) \end{matrix}

Entity Query

Each entity eEe \in \mathcal{E} is assigned with an embedding qeq_{e} as a query vector to obtain adaptive representation xex_{e}

xe=zZγzvzγ=softmaxz(qekzd)x_{e} = \sum_{z \in \mathcal{Z}} \gamma_{z} v_{z} \quad \gamma = \operatorname{softmax}_{z} \left( \frac{q_{e} \cdot k_{z}}{\sqrt{d}} \right)

The representation xex_{e} is further plugged into a tensor XRh×w×dX \in \mathbb{R}^{h \times w \times d} at the same position of ee on the grid map

Action Output

The categorical distribution of actions is calculated based on the queried representation tensors

πθ(atotk:t, Z)=softmax(FFN(Flatten(Conv2D(Xk))))\pi_{\theta}(a_{t} \mid o_{t - k : t},\ \mathcal{Z}) = \operatorname{softmax}(\operatorname{FFN}(\operatorname{Flatten}(\operatorname{Conv2D}(X_{k}))))

where the representation tensors of the kk recent observation are concatenated into XkRh×w×kdX_{k} \in \mathbb{R}^{h \times w \times kd}


EMMmA
http://example.com/2024/09/16/EMMmA/
Author
木辛
Posted on
September 16, 2024
Licensed under