MuZero

Planning

MuZero makes decision via MCTS with UCB with learned representation, dynamics, policy and value model

Each internal state $s$ in the search tree maintains a set of statistics for each action

Statistics	Definition	Initialized by
visit counts	$N(s,\ a)$	0
mean value	$Q(s,\ a)$	0
policy	$P(s,\ a)$	prediction model
reward	$R(s,\ a)$	dynamics model
state transition	$S(s,\ a)$	dynamics model

Each simulation starts from current state $s^{0} = s_{t}$ and finishes at a leaf node $s^{l}$ , the search is divided into

Selection

For each hypothetical step $k = 1,\ 2, \cdots,\ l$ , an action is selected by maximizing over UCB (pUCT)

a^{k} = \argmax_{a} \left[ Q(s^{k - 1},\ a) + P(s^{k - 1},\ a) \frac{\sqrt{\sum_{b} N(s^{k - 1},\ b)}}{1 + N(s^{k - 1},\ a)} \left( c_{1} + \log \frac{\sum_{b} N(s^{k - 1},\ b) + c_{2} + 1}{c_{2}} \right) \right]

To allow the combination of value and policy in the pUCT rule, the mean value is normalized as

\bar{Q}(s^{k - 1},\ a) = \frac{Q(s^{k - 1},\ a) - \min_{s,\ a \in \mathrm{Tree}} Q(s,\ a)}{\max_{s,\ a \in \mathrm{Tree}} Q(s,\ a) - \min_{s,\ a \in \mathrm{Tree}} Q(s,\ a)}

The next state $s^{k}$ and reward $r^{k}$ are looked up in the state transition and reward table of state $s^{k - 1}$ when $k < l$

Expansion

At the final time-step $l$ of the simulaion, the state and reward are computed through learned dynamics model

s^{l},\ r^{l} = g_{\theta}(s^{l - 1},\ a^{l})

The state transition and reward table of state $s^{l - 1}$ is updated as $S(s^{l - 1},\ a^{l}) = s^{l}$ and $R(s^{l - 1},\ a^{l}) = r^{l}$ . State $s^{l}$ is added to search tree with policy table initialized by policy of prediction model $P(s^{l},\ a) = \boldsymbol{p}^{l}$

Backup

For $k = l,\ l - 1,\ \cdots,\ 0$ , the mean value and visit count of $(s^{k - 1},\ a^{k})$ on the simulated trajectory is updated as

\begin{gathered} N(s^{k - 1},\ a^{k}) \leftarrow N(s^{k - 1},\ a^{k}) + 1 \\[5mm] Q(s^{k - 1},\ a^{k}) \leftarrow Q(s^{k - 1},\ a^{k}) + \frac{1}{N(s^{k - 1},\ a^{k})} \Big[ G^{k} - Q(s^{k - 1},\ a^{k}) \Big] \end{gathered}

Where $G^{k}$ is made up of $(l - k)$ -step estimated cumulative discounted reward and value of prediction model

G^{k} = \sum_{\tau = 0}^{l - k - 1} \gamma^{\tau} R(s^{k + \tau},\ a^{k + \tau + 1}) + \gamma^{l - k} v^{l}

After a certain number of simulation, MCTS outputs an estimated value $\nu_{t}$ and a recommended policy $\pi_{t}(\cdot)$ based on the visit count of root node

\pi_{t}(a) = \frac{N(s^{0},\ a)^{1 / T}}{\sum_{b} N(s^{0},\ b)^{1 / T}}

where temperature parameter $T$ is used for training of model and decayes from 1 w.r.t. training steps. This ensures that the action selection becomes greedier as training progresses

Training

The model of MuZero $\mu_{\theta}$ consists of a representation model, a dynamics model and a prediction model

SubPart	Type	Description	Definition
representation	world model	encodes the past observations	$s^{0} = h_{\theta}(o_{1},\ o_{2},\ \cdots,\ o_{t})$
dynamics	world model	dynamics and reward on internal state	$s^{k},\ r^{k} = g_{\theta}(s^{k - 1},\ a^{k})$
prediction	policy	policy and value on internal state	$\boldsymbol{p}^{k},\ v^{k} = f_{\theta}(s^{k})$

A trajectory is sampled from replay buffer for training and the model is unrolled recurrently for $K$ steps

s^{0} = h_{\theta}(o_{1},\ o_{2},\ \cdots,\ o_{t}) \quad s^{k},\ r^{k} = g_{\theta}(s^{k - 1},\ a^{k}) \quad \boldsymbol{p}^{k},\ v^{k} = f_{\theta}(s^{k})

The model $\mu_{\theta} = \{ h_{\theta},\ g_{\theta},\ f_{\theta} \}$ is trained jointly to accurately match the policy, value, and reward on a trajectory

\ell(\theta) = \sum_{k = 0}^{K} \ell^{r}(u_{t + k},\ r_{t}^{k}) + \ell^{v}(z_{t + k},\ v_{t}^{k}) + \ell^{p}(\pi_{t + k},\ \boldsymbol{p}_{t}^{k}) + c \| \theta \|_{2}^{2}

where expected return $z_{t}$ is computed by intermediate rewards and $n$ -step bootstrapping

z_{t} = u_{t + 1} + \gamma u_{t + 2} + \cdots + \gamma^{n - 1} u_{t + n} + \gamma^{n} \nu_{t + n}

The latest checkpoint of the network is used to play games with MCTS to generate training data in replay buffer

RL > Model-Based

#MCTS #MuZero

MuZero

http://example.com/2024/08/02/MuZero/

Author

木辛

Posted on

August 2, 2024

Licensed under

MBPO Previous

Dreamer Next