RAP
RAP
LLM World Model
RAP repurpose LLM as an internal world model, which enables problem-specific definition of state and action
Blocksworld Planning | Math Reasoning | Logical Reasoning |
---|---|---|
![]() |
![]() |
![]() |
The default policy and dynamics function are modeled by generative LLM, where and are task-specific prompt for LLM to behave as policy and dynamics, respectively

Compared to previous reasoning method like CoT, augmenting the reasoning process with the help of states predicted by LLM as internal world model makes more grounded and coherent inference
Reward Assessment
Similarly, the reward function can be specified in different ways depends on the reasoning problem
- likelihood of action
- incorporate the log probability of the action as a reward
- the probability of the specific action reflects the LLM’s preference
- confidence of state
- draw multiple predicted state from the world model
- use the proportion of the most frequent result (confidence) as the reward
- higher confidence indicates that the state prediction is more consistent with the knowledge of LLMs
- self-evaluation by the LLM
- use the LLM to criticize itself with the question
Is this reasoning step correct ?
- use the next-word probability of the token
Yes
as the reward - this evaluates LLM’s own estimation of the correctness of reasoning
- use the LLM to criticize itself with the question
- task-specific heuristics
MCTS Planning
RAP adopts MCTS to strategically explores the reasoning space and balance exploration and exploitation

Each internal node of the search tree maintains statistics like state-value function , visit count
Selection | Expansion | Simulation | Backup |
---|---|---|---|
![]() |
![]() |
![]() |
![]() |
The reasoning process continues with the following phases untils a specified computational budget
- selection
- an action is selected at each level of reasoning tree via UCB value until a leaf node is encountered
- the exploration weight controls the balance between exploration and exploitation
- expansion
- sample possible actions from LLM policy rather than enumerate all actions
- use LLM world model to predict respective next states for sampled actions
- simulation
- use light-weight rollout policy and reward assessment to perform quick simulation
- the reasoning tree is recursively expanded at each level until a terminal state
- backup
- a reasoning path from the root node to terminal node is obtained from previous phases
- the state-value function of each node on the reasoning path is updated
The final reasoning trace is selected from the constructed tree, which can be implemented as
- choose the action with the highest value iteratively until reaching a terminal
- select the path from the iterations that yielded the highest reward
- choose the leaf node and the respective root-to-leaf path that has been visited the most