Multi-Agent REINFORCE

REINFORCE is a class of policy gradient methods that optimize the policy directly using sampled returns. It has been widely used to fine-tune LLMs because of its simplicity and efficiency, e.g., GRPO, Dr. GRPO, RLOO, ReMax, TreeRPO, and REINFORCE++. REINFORCE can be extended to multi-agent settings, where multiple LLM agents response synchronously and their joint responses form a solution at each turn to receive a shared reward at each turn.

MA-REINFORCE#

The naive Multi‑Agent REINFORCE (MA-REINFORCE) can be expressed as:

J(\theta_i) = \mathbb{E}_{\mathbf{o}_0 \sim \mathcal{D}, \mathbf{h}_t \sim \boldsymbol{\pi}_{\boldsymbol{\theta}}} \Bigg[\sum_{t=0}^{H-1} R_t \cdot \log \pi_{\theta_i}(a_{i,t}\mid h_{i,t})\Bigg],

where

\( R_t \)

is the return at turn

\( t \)

and

\( H \)

is the horizon (i.e., number of dialog turns). The expectation is taken over initial observations from the dataset

\mathcal{D}

and the joint action history of all episodes following policy

\boldsymbol{\pi}_{\boldsymbol{\theta}}

REINFORCE methods do not use a critic model for value estimation. Their policy gradients estimation can have high variance, due to the stochasticity of the environment and the long-term credit assignment. There are two common approaches to reduce the variance: using an action-independent baseline or update with more samples, e.g., using $$ K $$ samples for value estimation of each joint history $\mathbf{h}_t$ .

J(\theta_i) = \mathbb{E}_{\mathbf{o}_0 \sim \mathcal{D}, \mathbf{h}_t \sim \boldsymbol{\pi}_{\boldsymbol{\theta}}} \Bigg[\frac{1}{K}\sum_{k=1}^{K} \sum_{t=0}^{H-1} \left(R^{k}_t - b(\mathbf{h}_t)\right) \cdot \log \pi_{\theta_i}(a^{k}_{i,t}\mid h_{i,t})\Bigg],

where the baseline

b(\mathbf{h}_t)

is action-independent.

MAGRPO#

Multi‑Agent Group‑Relative Policy Optimization (MAGRPO) is an instantiation of MA-REINFORCE inspired from GRPO, where the group-average baseline is the mean return of $$ K $$ samples:

J(\theta_i) = \mathbb{E}_{\mathbf{o}_0 \sim \mathcal{D}, \mathbf{h}_t \sim \boldsymbol{\pi}_{\boldsymbol{\theta}}} \Bigg[\frac{1}{K}\sum_{k=1}^{K}\sum_{t=0}^{H-1} \left(R^{k}_t - \frac{1}{K}\sum_{l=1}^{K}R^{l}_t\right) \cdot \log \pi_{\theta_i}(a^{k}_{i,t}\mid h_{i,t})\Bigg].

MAGRPOConfig parameters:
num_agents: Number of agents
num_turns: Number of turns per episode
num_train_epochs: Number of training epochs
agent_learning_rate: Learning rate
logging_steps: Log every N steps
num_generations: Number of generations to sample per prompt for each agent
max_new_tokens: Maximum number of new tokens to generate
temperature: Temperature for sampling
top_p: Top-p for sampling
top_k: Top-k for sampling
discount: Discount factor gamma over turns for returns
joint_mode: Joint action composition (aligned for index-aligned, cross for Cartesian product)
early_termination_threshold: Stop rollouts with mean reward exceeds a threshold
rollout_buffer_size: Number of node samples to buffer before update
train_batch_size: Mini-batch size within each update
advantage_normalization: Whether to normalize advantages
eval_interval: Run evaluation every N training batches
eval_num_samples: Number of samples to evaluate per evaluation run
eval_batch_size: Eval dataloader batch size
external_prompt_passthrough: Use external prompts directly in multi-turn
advantage_mode: Baseline mode (mean, max, rloo, raw)

MAGRPOTrainer setup:
agent_model or agents: Model identifier string for homogeneous agents, or list of agent models (multi-agent agent_model must be a string)
num_agents: Number of agents
tokenizer: The tokenizer (required)
reward_func: Callable that returns a list of floats (required)
reward_processor: Optional processor to apply to rewards (e.g., scaling)
formatters: Single callable or list of callables for each agent to format prompts
args: Instance of MAGRPOConfig (optional)
train_dataset: Training dataset (required)
eval_dataset: Evaluation dataset (optional)
model_config: Model configuration dict (optional)
wandb_config: Configuration for Weights & Biases logging (optional)
external_transition: Function providing transitions between turns
eval_logger: Evaluation logger function (optional)
eval_aggregator: Evaluation aggregator function (optional)

For simplicity, MAGRPO computes the policy gradient using the current policy’s samples without importance sampling or ratio clipping. And since it does not use a critic model, there is no value_clip_range applicable.

The trainer uses a fixed training DataLoader batch size of 1 and requires at least num_generations=2 generations for group baseline computation. The training use batch gradient descent by default, where train_batch_size=rollout_buffer_size.

Other Variants#

CoMLRL also provides other MA-REINFORCE variants with different baselines:

MARLOO: Multi‑Agent REINFORCE Leave‑One‑Out. Baseline is the mean return of other agents (leave‑one‑out) at the same step.

J(\theta_i) = \mathbb{E}_{\mathbf{o}_0 \sim \mathcal{D}, \mathbf{h}_t \sim \boldsymbol{\pi}_{\boldsymbol{\theta}}} \Bigg[\frac{1}{K}\sum_{k=1}^{K}\sum_{t=0}^{H-1} \left(R^{k}_t - \frac{1}{K-1}\sum_{l=1, l\neq k}^{K}R^{l}_t\right) \cdot \log \pi_{\theta_i}(a^{k}_{i,t}\mid h_{i,t})\Bigg].

MAREMAX: Multi‑Agent REINFORCE with Group Max. Baseline is the maximum group return at the step.

J(\theta_i) = \mathbb{E}_{\mathbf{o}_0 \sim \mathcal{D}, \mathbf{h}_t \sim \boldsymbol{\pi}_{\boldsymbol{\theta}}} \Bigg[\frac{1}{K}\sum_{k=1}^{K}\sum_{t=0}^{H-1} \left(R^{k}_t - \mathrm{max}_l\, R^l_t \right) \cdot \log \pi_{\theta_i}(a^{k}_{i,t}\mid h_{i,t})\Bigg].

These classes and MA-REINFORCE are derived from comlrl.trainers.reinforce.MAGRPOTrainer. Interfaces for the trainer and configuration classes are the same as MAGRPOTrainer and MAGRPOConfig.