Multi-Turn Training

Many complex problems cannot be solved in a single turn. Agents need to interact with the environment to obtain useful feedback from other models or tools involved in the system, enabling iterative refinement and exploration of multiple solution paths.

Multi-Turn MAGRPO#

MAGRPO in the multi-turn setting (MAGRPO-MT) forms a tree-structured rollout expansion where branches represent different joint responses (TreeRPO).

In each episode, a task is sampled from the dataset to construct initial observations $\mathbf{o}_0=\{o_{1, 0}, \cdots, o_{n, 0}\}$ and histories $\mathbf{h}_0=\{h_{1, 0}, \cdots, h_{n, 0}\}$ for all agents. At each turn, agents generate a group of joint responses $\mathbf{a}^{\mathcal{G}}_t\gets\boldsymbol{\pi}^{\mathcal{G}}(\cdot|\mathbf{h}_t)$ from their current observation-action history $\mathbf{h}_t$ , with each response initiating a distinct rollout. Agents receive joint rewards $r^{(g)}_{t}$ for each response based on the accumulated history $\mathbf{a}^{(g)}_{t} \in \mathbf{a}^{\mathcal{G}}_{t}$ and current action. Each rollout then evolves independently, producing new joint observations $\mathbf{o}^{\mathcal{G}}_{t+1}$ as the environment dynamics unfold and spawning more rollouts at the next turn $$ t+1 $$ . This process continues until the terminal turn is reached $$ H $$ .

Joint Mode#

MAGRPO supports two modes for forming joint responses at each turn:

Align: Provides flexibility in the number of joint responses generated per turn, allowing any number of generations at each turn. However, generations are not fully utilized since only aligned responses across agents are combined. As training progresses over $$ T $$ turns with $$ N $$ agents, the total number of leaves grows as $$ G^T $$ , where $$ G $$ is the number of generations per turn.

Cross: Maximizes the utilization of generations and provides more accurate value estimation with more samples by forming the Cartesian product of all agent responses. As training progresses over $$ T $$ turns with $$ N $$ agents, the total number of leaves grows as $G^{N \cdot T}$ , where each node has $$ G^N $$ sibling joint actions.

Note that only responses originating from the same rollout can be combined, as rollouts evolve independently.

Environment Transition#

External feedback mechanisms control how environment observations are incorporated into prompts for subsequent turns.

Custom External Feedback#

Users can implement custom external feedback by defining a function with the following interface:

Custom External Feedback Interface:
prompt: Original task prompt/problem description (required)
agent_completions: List or tuple of completions from the previous turn, one per agent (required)
num_agents: Number of agents in the system (required)
prompt_history_per_agent: List of prompt histories for each agent, where each history is a list of prompts from previous turns (optional)
response_history_per_agent: List of response histories for each agent, where each history is a list of responses from previous turns (optional)
**kwargs: Additional mode-specific parameters (optional)
The function should return a list or tuple of formatted prompts for the next turn, one for each agent.

For example:

def custom_external(
    prompt: str,
    agent_completions: List[str],
    num_agents: int,
    prompt_history_per_agent: Optional[List[List[str]]] = None,
    response_history_per_agent: Optional[List[List[str]]] = None,
    **kwargs
) -> List[str]:
    # Custom logic to format next-turn prompts
    # Access environment feedback, tool outputs, etc.
    next_turn_prompts = []
    for i in range(num_agents):
        # Format prompt for agent i based on history and feedback
        next_prompt = f"{prompt}\nPrevious attempt: {agent_completions[i]}\nPlease revise."
        next_turn_prompts.append(next_prompt)
    return next_turn_prompts

This interface allows full flexibility in how environment feedback, tool outputs, or other contextual information is integrated into the multi-turn training loop.

Example Modes (Expert, Diagnosis, and Self-Improvement)#

An environment for code generation includes 3 example external transition modes:

external.mode=expert_edits: Uses an external LLM (default: DeepSeek-Coder) to propose code edits. Follow-up prompts include edit suggestions with context from previous turns. It can be configured via expert_model for different experts (e.g., Claude, GPT) when API keys are available.
external.mode=level_feedback: Static AST checks and dynamically executes code to provide diagnosis. The default sandbox test includes the first test; configurable via sandbox_slice to include all tests (0, None, or ‘all’), specific number of tests (negative values enabled).
external.mode=plain: Self-improvement mode that just includes prompts and responses in the previous turns and a revision instruction.