REINFORCE is a class of policy gradient methods that optimize the policy directly using sampled returns. It has been widely used to fine-tune LLMs because of its simplicity and efficiency, e.g., GRPO, Dr. GRPO, RLOO, ReMax, TreeRPO, and REINFORCE++. REINFORCE can be extended to multi-agent settings, where multiple LLM agents response synchronously and their joint responses form a solution at each turn to receive a shared reward at each turn.
MA-REINFORCE#
The naive Multi‑Agent REINFORCE (MA-REINFORCE) can be expressed as:
REINFORCE methods do not use a critic model for value estimation. Their policy gradients estimation can have high variance, due to the stochasticity of the environment and the long-term credit assignment. There are two common approaches to reduce the variance: using an action-independent baseline or update with more samples, e.g., using \( K \) samples for value estimation of each joint history \( \mathbf{h}_t \).
MAGRPO#
Multi‑Agent Group‑Relative Policy Optimization (MAGRPO) is an instantiation of MA-REINFORCE inspired from GRPO, where the group-average baseline is the mean return of \( K \) samples:
MAGRPOConfig parameters:
num_agents: Number of agentsnum_turns: Number of turns per episodenum_train_epochs: Number of training epochsagent_learning_rate: Learning ratelogging_steps: Log every N stepsnum_generations: Number of generations to sample per prompt for each agentmax_new_tokens: Maximum number of new tokens to generatetemperature: Temperature for samplingtop_p: Top-p for samplingtop_k: Top-k for samplingdiscount: Discount factor gamma over turns for returnsjoint_mode: Joint action composition (alignedfor index-aligned,crossfor Cartesian product)early_termination_threshold: Stop rollouts with mean reward exceeds a thresholdrollout_buffer_size: Number of node samples to buffer before updatetrain_batch_size: Mini-batch size within each updateadvantage_normalization: Whether to normalize advantageseval_interval: Run evaluation every N training batcheseval_num_samples: Number of samples to evaluate per evaluation runeval_batch_size: Eval dataloader batch sizeexternal_prompt_passthrough: Use external prompts directly in multi-turnadvantage_mode: Baseline mode (mean,max,rloo,raw)
MAGRPOTrainer setup:
agent_modeloragents: Model identifier string for homogeneous agents, or list of agent models (multi-agentagent_modelmust be a string)num_agents: Number of agentstokenizer: The tokenizer (required)reward_func: Callable that returns a list of floats (required)reward_processor: Optional processor to apply to rewards (e.g., scaling)formatters: Single callable or list of callables for each agent to format promptsargs: Instance ofMAGRPOConfig(optional)train_dataset: Training dataset (required)eval_dataset: Evaluation dataset (optional)model_config: Model configuration dict (optional)wandb_config: Configuration for Weights & Biases logging (optional)external_transition: Function providing transitions between turnseval_logger: Evaluation logger function (optional)eval_aggregator: Evaluation aggregator function (optional)
For simplicity, MAGRPO computes the policy gradient using the current policy’s samples without importance sampling or ratio clipping. And since it does not use a critic model, there is no
value_clip_rangeapplicable.
The trainer uses a fixed training DataLoader batch size of 1 and requires at least
num_generations=2generations for group baseline computation. The training use batch gradient descent by default, wheretrain_batch_size=rollout_buffer_size.
Other Variants#
CoMLRL also provides other MA-REINFORCE variants with different baselines:
- MARLOO: Multi‑Agent REINFORCE Leave‑One‑Out. Baseline is the mean return of other agents (leave‑one‑out) at the same step.
- MAREMAX: Multi‑Agent REINFORCE with Group Max. Baseline is the maximum group return at the step.
These classes and MA-REINFORCE are derived from
comlrl.trainers.reinforce.MAGRPOTrainer. Interfaces for the trainer and configuration classes are the same asMAGRPOTrainerandMAGRPOConfig.