REINFORCE optimizes the policy directly using sampled returns. An action-independent baseline can be included to reduce variance for REINFORCE methods. REINFORCE methods have been widely used to fine-tune LLMs because of their simplicity and effectiveness, e.g., GRPO, Dr. GRPO, RLOO, ReMax, TreeRPO, and REINFORCE++.

MAREINFORCE#

In the LLM collaboration setting, REINFORCE can be extended to optimize each agent’s policy with joint returns from multiple agents.

  • MAREINFORCE: The naive Multi‑Agent REINFORCE without a baseline can be expressed by:
\[ J(\theta_i) = \mathbb{E}_{\mathbf{o}_0 \sim \mathcal{D}, \mathbf{h}^\mathcal{G} \sim \mathbf{\pi}_{\mathbf{\theta}}} \Bigg[\frac{1}{|\mathcal{G}|}\sum_{g \in \mathcal{G}} R^{(g)}_t \cdot \log \pi_{\theta_i}(a^{(g)}_{i,t}\mid h_{i,t})\Bigg]. \]

These classes are derived from comlrl.trainers.magrpo.MAGRPOTrainer. Interfaces for the trainer and configuration classes are the same as MAGRPOTrainer and MAGRPOConfig.

MAGRPO#

Multi‑Agent Group‑Relative Policy Optimization optimizes each agent with a group‑relative baseline computed among sibling joint actions at the same node.

\[ J(\theta_i) = \mathbb{E}_{\mathbf{o}_0 \sim \mathcal{D}, \mathbf{h}^\mathcal{G} \sim \mathbf{\pi}_{\mathbf{\theta}}}\left[ \frac{1}{|\mathcal{G}|}\sum_{g \in \mathcal{G}} \Big(R^{(g)}_t - \operatorname{mean}(R^{\mathcal{G}}_t)\Big) \cdot \log \pi_{\theta_i}\big(a^{(g)}_{i,t} \mid h_{i,t}\big) \right]. \]

MAGRPOConfig inherits from TrainingArguments and provides parameters for both single-turn and multi-turn training:

  • num_agents: Number of agents (default: 2)
  • num_generations: Number of generations to sample per prompt for each agent (default: 4)
  • max_new_tokens: Maximum number of new tokens to generate (default: 256)
  • temperature: Temperature for sampling (default: 0.7)
  • top_p: Top-p for sampling (default: 0.9)
  • num_turns: Number of turns per episode; set >1 for multi-turn (default: 1)
  • discount: Discount factor gamma over turns for returns (default: 0.9)
  • joint_mode: Joint action composition - 'aligned' (index-aligned, default) or 'cross' (Cartesian product)
  • termination_threshold: Early stop a branch if mean reward exceeds this threshold (default: None)
  • eval_interval: Run evaluation every N training batches (default: 4)
  • eval_num_samples: Number of samples to evaluate per evaluation run (default: 4)

MAGRPOTrainer accepts either a model string/object for homogeneous agents or a list of agents for heterogeneous setups:

  • model or agents: Model string/object for homogeneous agents, or list of agent models
  • num_agents: Number of agents (default: 2)
  • tokenizer: The tokenizer (required)
  • train_dataset: Training dataset (required)
  • reward_func: Callable that returns a list of floats (required)
  • reward_processor: Optional processor to apply to rewards (e.g., scaling)
  • formatters: Single callable or list of callables for each agent to format dataset items into prompts
  • external_transition: Function providing transitions between turns (required for multi-turn training)
  • eval_dataset: Evaluation dataset (optional)
  • eval_logger: Evaluation logger function (optional)
  • eval_aggregator: Evaluation aggregator function (optional)
  • wandb_config: Configuration for Weights & Biases logging (optional)
  • model_config: Model configuration dict (optional)
  • args: Instance of MAGRPOConfig (optional)

CoMLRL implements on-policy GRPO, which computes the policy gradient using the current policy’s samples without importance sampling or ratio clipping.

The trainer enforces per_device_train_batch_size=1 and requires at least 2 generations for group baseline computation.

Other Variants#

CoMLRL also implements other Multi-Agent REINFORCE variants with different baselines:

  • MARLOO: Multi‑Agent REINFORCE Leave‑One‑Out. Baseline is the mean return of other agents (leave‑one‑out) at the same step.
\[ J(\theta_i) = \mathbb{E}_{\mathbf{o}_0 \sim \mathcal{D}, \mathbf{h}^\mathcal{G} \sim \mathbf{\pi}_{\mathbf{\theta}}} \Bigg[\frac{1}{|\mathcal{G}|}\sum_{g \in \mathcal{G}} \Big( R^{(g)}_t - \sum_{k\in \mathcal{G},\, k\neq g}\tfrac{R^{(k)}_t}{|\mathcal{G}|-1} \Big) \cdot \log \pi_{\theta_i}(a^{(g)}_{i,t}\mid h_{i,t}) \Bigg]; \]
  • MAREMAX: Multi‑Agent REINFORCE with Group Max. Baseline is the maximum group return at the step.
\[ J(\theta_i) = \mathbb{E}_{\mathbf{o}_0 \sim \mathcal{D}, \mathbf{h}^\mathcal{G} \sim \mathbf{\pi}_{\mathbf{\theta}}} \Bigg[\frac{1}{|\mathcal{G}|}\sum_{g \in \mathcal{G}} \Big( R^{(g)}_t - \max(R_t^{\mathcal{G}}) \Big) \cdot \log \pi_{\theta_i}(a^{(g)}_{i,t}\mid h_{i,t}) \Bigg]. \]

These classes are derived from comlrl.trainers.magrpo.MAGRPOTrainer. Interfaces for the trainer and configuration classes are the same as MAGRPOTrainer and MAGRPOConfig.