REINFORCE optimizes the policy directly using sampled returns. An action-independent baseline can be included to reduce variance for REINFORCE methods. REINFORCE methods have been widely used to fine-tune LLMs because of their simplicity and effectiveness, e.g., GRPO, Dr. GRPO, RLOO, ReMax, TreeRPO, and REINFORCE++.
MAREINFORCE#
In the LLM collaboration setting, REINFORCE can be extended to optimize each agent’s policy with joint returns from multiple agents.
- MAREINFORCE: The naive Multi‑Agent REINFORCE without a baseline can be expressed by:
These classes are derived from
comlrl.trainers.magrpo.MAGRPOTrainer. Interfaces for the trainer and configuration classes are the same asMAGRPOTrainerandMAGRPOConfig.
MAGRPO#
Multi‑Agent Group‑Relative Policy Optimization optimizes each agent with a group‑relative baseline computed among sibling joint actions at the same node.
MAGRPOConfig inherits from
TrainingArgumentsand provides parameters for both single-turn and multi-turn training:
num_agents: Number of agents (default: 2)num_generations: Number of generations to sample per prompt for each agent (default: 4)max_new_tokens: Maximum number of new tokens to generate (default: 256)temperature: Temperature for sampling (default: 0.7)top_p: Top-p for sampling (default: 0.9)num_turns: Number of turns per episode; set >1 for multi-turn (default: 1)discount: Discount factor gamma over turns for returns (default: 0.9)joint_mode: Joint action composition -'aligned'(index-aligned, default) or'cross'(Cartesian product)termination_threshold: Early stop a branch if mean reward exceeds this threshold (default: None)eval_interval: Run evaluation every N training batches (default: 4)eval_num_samples: Number of samples to evaluate per evaluation run (default: 4)
MAGRPOTrainer accepts either a model string/object for homogeneous agents or a list of
agentsfor heterogeneous setups:
modeloragents: Model string/object for homogeneous agents, or list of agent modelsnum_agents: Number of agents (default: 2)tokenizer: The tokenizer (required)train_dataset: Training dataset (required)reward_func: Callable that returns a list of floats (required)reward_processor: Optional processor to apply to rewards (e.g., scaling)formatters: Single callable or list of callables for each agent to format dataset items into promptsexternal_transition: Function providing transitions between turns (required for multi-turn training)eval_dataset: Evaluation dataset (optional)eval_logger: Evaluation logger function (optional)eval_aggregator: Evaluation aggregator function (optional)wandb_config: Configuration for Weights & Biases logging (optional)model_config: Model configuration dict (optional)args: Instance ofMAGRPOConfig(optional)
CoMLRL implements on-policy GRPO, which computes the policy gradient using the current policy’s samples without importance sampling or ratio clipping.
The trainer enforces
per_device_train_batch_size=1and requires at least 2 generations for group baseline computation.
Other Variants#
CoMLRL also implements other Multi-Agent REINFORCE variants with different baselines:
- MARLOO: Multi‑Agent REINFORCE Leave‑One‑Out. Baseline is the mean return of other agents (leave‑one‑out) at the same step.
- MAREMAX: Multi‑Agent REINFORCE with Group Max. Baseline is the maximum group return at the step.
These classes are derived from
comlrl.trainers.magrpo.MAGRPOTrainer. Interfaces for the trainer and configuration classes are the same asMAGRPOTrainerandMAGRPOConfig.