Multi-Agent Actor-Critic

Actor-Critic (AC) methods are widely-used policy gradient that employ critics to facilitate training. AC methods can achieve lower variance and better sample efficiency than REINFORCE, but this requires careful design and tuning of the critic to ensure stable training. In Multi-Agent Reinforcement Learning (MARL), Actor-Critic methods can be instantiated as Multi-Agent Actor-Critic (MAAC) and Independent Actor-Critic (IAC).

MAAC#

Multi-Agent Actor-Critic (MAAC) uses a Centralized Critic (CC) across agents to evaluate the values of joint histories $V_{\boldsymbol{\phi}}(\mathbf{h}_t)$ or joint history-action pairs $Q_{\boldsymbol{\psi}}(\mathbf{h}_t, \mathbf{a}_t)$ . The policy gradient of each agent is:

\nabla_{\theta_i} J(\theta_i) = \mathbb{E}_{\boldsymbol{\pi}}\left[\sum_{t=0}^{H-1} \nabla_{\theta_i} \log \pi_{\theta_i}(a_{i,t}|h_{i,t})\,\boldsymbol{\delta}_t\right]

where $\boldsymbol{\delta}_t = r_t + \gamma V_{\boldsymbol{\phi}}(\mathbf{h}_{t+1}) - V_{\boldsymbol{\phi}}(\mathbf{h}_{t})$ , and the critic is updated by:

\mathcal{L}(\boldsymbol{\phi}) = \mathbb{E}_{\boldsymbol{\pi}}\left[\sum_{t=0}^{H-1} \big(r_t + \gamma V_{\phi}(\mathbf{h}_{t+1}) - V_{\phi}(\mathbf{h}_{t})\big)^2\right].

MAACConfig parameters:
num_agents: Number of agents
num_turns: Number of turns
critic_type: Critic target type (v for V(h), q for Q(h,a))
num_train_epochs: Number of training epochs
agent_learning_rate: Learning rate for agents
critic_learning_rate: Learning rate for shared critic
value_loss_coef: Weight on critic loss
advantage_normalization: Whether to normalize advantages before updates
rollout_buffer_size: Number of samples to collect per agent before an update
train_batch_size: Mini-batch size within each update
max_new_tokens: Maximum tokens to generate per completion
temperature: Temperature for sampling
top_p: Top-p for nucleus sampling
top_k: Top-k for sampling
num_generations: Number of generations per prompt per agent
external_prompt_passthrough: Use external prompts directly in multi-turn
discount: Discount factor for multi-turn returns
early_termination_threshold: Optional early-stop threshold for multi-turn
eval_interval: Evaluation interval (in training batches)
eval_num_samples: Number of evaluation samples per interval
eval_batch_size: Eval dataloader batch size
logging_steps: Log every N training batches

MAACTrainer setup:
agent_model or agents: Actor model identifier string for homogeneous agents, or list of agent models (multi-agent agent_model must be a string)
critic_model or critics: Required single shared critic (either one identifier or a 1-element list)
tokenizer: Tokenizer (required)
reward_func: Callable returning rewards (required)
reward_processor: Optional reward post-processor
formatters: Single callable or list for per-agent prompt formatting
args: Instance of MAACConfig (optional)
train_dataset: Training dataset (required)
eval_dataset: Optional evaluation dataset
model_config: Extra model kwargs (optional)
wandb_config: Weights & Biases logging config (optional)
metrics_callback: Optional callback for custom metrics

For simplicity, IAC computes the policy gradient using the current policy’s samples without importance sampling or ratio clipping. The value_clip_range is not applicable in MAAC.

The trainer uses a fixed training DataLoader batch size of 1. For num_turns > 1, provide an external_transition and set num_generations=1. The training use batch gradient descent by default, where train_batch_size=rollout_buffer_size.

IAC#

Independent Actor-Critic (IAC) optimizes each agent’s policy independently while using joint returns from multiple agents. Each agent maintains its own actor and critic, other agents serve as part of the environment. The policy gradient for each agent is:

\nabla_{\theta_i} J(\theta_i) = \mathbb{E}_{\boldsymbol{\pi}}\left[\sum_{t=0}^{H-1} \nabla_{\theta_i} \log \pi_{\theta_i}(a_{i,t}|h_{i,t})\,\delta_{i,t}\right]

where $\delta_{i,t} = r_t + \gamma V_{\phi_i}(h_{i,t+1}) - V_{\phi_i}(h_{i,t})$ .

CoMLRL supports two IAC variants:

Separate Critic: Uses an independent model for value estimation separate from the actor. It provides more stable training but occupies larger storage and VRAM usage.
Shared Model: Attaches a small value-head to the transformer backbone, sharing the actor model’s history (or history-action) representations to reduce the space costs.

The critics are updated by minimizing the TD error:

\mathcal{L}(\phi_i) = \mathbb{E}_{\boldsymbol{\pi}}\left[\sum_{t=0}^{H-1} \big(r_t + \gamma V_{\phi_i}(h_{i,t+1}) - V_{\phi_i}(h_{i,t})\big)^2\right].

When using shared model use_separate_critic=false, a value clip value_clip_range can be applied to improve training stability.

L(\phi_i) = \max\Big( (V_{\phi_i}(h_t) - \hat{V}_t)^2,\ (V_{\phi_i}^{\text{clip}}(h_t) - \hat{V}_t)^2 \Big) \\ V_{\phi_i}^{\text{clip}}(h_t) = V_{\phi_i}^{\text{old}}(h_t) + \mathrm{clip}(V_{\phi_i}(h_t) - V_{\phi_i}^{\text{old}}(h_t), -\epsilon, \epsilon),

IACConfig provides parameters for configuring Independent Actor-Critic training:
num_agents: Number of agents
num_turns: Number of turns
num_train_epochs: Number of training epochs
agent_learning_rate: Learning rate for agents
critic_learning_rate: Learning rate for critic
value_loss_coef: Coefficient for value loss
value_clip_range: Clipping range for value function
advantage_normalization: Whether to normalize advantages
rollout_buffer_size: Number of samples to collect before update
train_batch_size: Mini-batch size for policy updates
max_new_tokens: Maximum new tokens to generate
temperature: Temperature for sampling
top_p: Top-p for nucleus sampling
top_k: Top-k for sampling
num_generations: Number of generations per prompt per agent
use_separate_critic: Whether to use separate critic model
critic_type: Critic target type (v for V(h), q for Q(h,a))
critic_value_head_hidden_dim: Hidden dimension for critic value head
value_head_hidden_dim: Hidden dimension for value head in shared-critic mode
external_prompt_passthrough: Use external prompts directly in multi-turn
discount: Discount factor for multi-turn returns
early_termination_threshold: Optional early-stop threshold for multi-turn
eval_interval: Evaluation interval (in training batches)
eval_num_samples: Number of evaluation samples per interval
eval_batch_size: Eval dataloader batch size
logging_steps: Log every N training batches

IACTrainer trains agents using Independent Actor-Critic:
agent_model or agents: Model identifier string for homogeneous agents, or list of agent models (multi-agent agent_model must be a string)
critic_model or critics: Critic identifier or list of critic models when use_separate_critic=true
tokenizer: The tokenizer (required)
reward_func: Callable that returns a list of floats (required)
reward_processor: Optional processor to apply to rewards
formatters: Single callable or list of callables for each agent to format dataset items into prompts
args: Instance of IACConfig (optional)
train_dataset: Training dataset (required)
eval_dataset: Evaluation dataset (optional)
model_config: Model configuration dict (optional)
wandb_config: Configuration for Weights & Biases logging (optional)
metrics_callback: Optional callback for custom metrics
external_transition: Optional transition function required for multi-turn training

For simplicity, IAC computes the policy gradient using the current policy’s samples without importance sampling or ratio clipping. In shared-critic mode (use_separate_critic=false), value heads are attached to the actor models (do not pass critic_model/critics; passing them raises an error), and agents may be homogeneous or heterogeneous; this mode can be less stable, and value_clip_range only applies there. In separate-critic mode (use_separate_critic=true), pass a critics list with length equal to num_agents or a single critic_model to be broadcast; critic models may differ from actor models.

The trainer uses a fixed training DataLoader batch size of 1. For num_turns > 1, provide an external_transition and set num_generations=1. The training use batch gradient descent by default, where train_batch_size=rollout_buffer_size.