RAP

LLM World Model

RAP repurpose LLM as an internal world model, which enables problem-specific definition of state and action

Blocksworld Planning	Math Reasoning	Logical Reasoning

The default policy $\pi(a_{t} \mid s_{t},\ c)$ and dynamics function $p(s_{t + 1} \mid s_{t},\ a_{t},\ c')$ are modeled by generative LLM, where $c$ and $c'$ are task-specific prompt for LLM to behave as policy and dynamics, respectively

Compared to previous reasoning method like CoT, augmenting the reasoning process with the help of states predicted by LLM as internal world model makes more grounded and coherent inference

Reward Assessment

Similarly, the reward function $r(s_{t},\ a_{t})$ can be specified in different ways depends on the reasoning problem

likelihood of action
1. incorporate the log probability of the action as a reward $r(s_{t},\ a_{t}) = \log \pi(a_{t} \mid s_{t})$
2. the probability of the specific action reflects the LLM’s preference
confidence of state
1. draw multiple predicted state $s_{t + 1}$ from the world model $s_{t + 1} \sim p(\cdot \mid s_{t},\ a_{t})$
2. use the proportion of the most frequent result (confidence) as the reward
3. higher confidence indicates that the state prediction is more consistent with the knowledge of LLMs
self-evaluation by the LLM
1. use the LLM to criticize itself with the question Is this reasoning step correct ?
2. use the next-word probability of the token Yes as the reward
3. this evaluates LLM’s own estimation of the correctness of reasoning
task-specific heuristics

MCTS Planning

RAP adopts MCTS to strategically explores the reasoning space and balance exploration and exploitation

Each internal node of the search tree maintains statistics like state-value function $Q(s,\ a)$ , visit count $N(s)$

Selection	Expansion	Simulation	Backup

The reasoning process continues with the following phases untils a specified computational budget

selection
1. an action is selected at each level of reasoning tree via UCB value until a leaf node is encountered
$a^{\star} = \argmax_{a \in A(s)} \left[ Q(s,\ a) + w \sqrt{\frac{\ln N(s)}{N(c(s,\ a))}} \right]$
1. the exploration weight $w$ controls the balance between exploration and exploitation
expansion
1. sample $d$ possible actions $a^{(1:d)}$ from LLM policy $\pi(a \mid s,\ c)$ rather than enumerate all actions
2. use LLM world model $p(s' \mid s,\ a)$ to predict respective next states for sampled actions
simulation
1. use light-weight rollout policy and reward assessment to perform quick simulation
2. the reasoning tree is recursively expanded at each level until a terminal state
backup
1. a reasoning path $\{ s_{0:\mathrm{T}},\ a_{0:\mathrm{T} - 1} \}$ from the root node to terminal node is obtained from previous phases
2. the state-value function of each node $Q(s_{t},\ a_{t})$ on the reasoning path is updated

The final reasoning trace is selected from the constructed tree, which can be implemented as

choose the action with the highest $Q$ value iteratively until reaching a terminal
select the path from the iterations that yielded the highest reward
choose the leaf node and the respective root-to-leaf path that has been visited the most

RL > Model-Based

#RAP

RAP

http://example.com/2024/09/11/RAP/

Author

木辛

Posted on

September 11, 2024

Licensed under

DLLM Previous

ELLM Next