Actor-Critic (AC) methods are widely-used policy gradient that employ critics to facilitate training. AC methods can achieve lower variance and better sample efficiency than REINFORCE, but this requires careful design and tuning of the critic to ensure stable training. In Multi-Agent Reinforcement Learning (MARL), Actor-Critic methods can be instantiated as Multi-Agent Actor-Critic (MAAC) and Independent Actor-Critic (IAC).
MAAC#
Multi-Agent Actor-Critic (MAAC) uses a Centralized Critic (CC) across agents to evaluate the values of joint histories \( V_{\boldsymbol{\phi}}(\mathbf{h}_t) \) or joint history-action pairs \( Q_{\boldsymbol{\psi}}(\mathbf{h}_t, \mathbf{a}_t) \). The policy gradient of each agent is:
where \( \boldsymbol{\delta}_t = r_t + \gamma V_{\boldsymbol{\phi}}(\mathbf{h}_{t+1}) - V_{\boldsymbol{\phi}}(\mathbf{h}_{t}) \), and the critic is updated by:
MAACConfig parameters:
num_agents: Number of agentsnum_turns: Number of turnscritic_type: Critic target type (vfor V(h),qfor Q(h,a))num_train_epochs: Number of training epochsagent_learning_rate: Learning rate for agentscritic_learning_rate: Learning rate for shared criticvalue_loss_coef: Weight on critic lossadvantage_normalization: Whether to normalize advantages before updatesrollout_buffer_size: Number of samples to collect per agent before an updatetrain_batch_size: Mini-batch size within each updatemax_new_tokens: Maximum tokens to generate per completiontemperature: Temperature for samplingtop_p: Top-p for nucleus samplingtop_k: Top-k for samplingnum_generations: Number of generations per prompt per agentexternal_prompt_passthrough: Use external prompts directly in multi-turndiscount: Discount factor for multi-turn returnsearly_termination_threshold: Optional early-stop threshold for multi-turneval_interval: Evaluation interval (in training batches)eval_num_samples: Number of evaluation samples per intervaleval_batch_size: Eval dataloader batch sizelogging_steps: Log every N training batches
MAACTrainer setup:
agent_modeloragents: Actor model identifier string for homogeneous agents, or list of agent models (multi-agentagent_modelmust be a string)critic_modelorcritics: Required single shared critic (either one identifier or a 1-element list)tokenizer: Tokenizer (required)reward_func: Callable returning rewards (required)reward_processor: Optional reward post-processorformatters: Single callable or list for per-agent prompt formattingargs: Instance ofMAACConfig(optional)train_dataset: Training dataset (required)eval_dataset: Optional evaluation datasetmodel_config: Extra model kwargs (optional)wandb_config: Weights & Biases logging config (optional)metrics_callback: Optional callback for custom metrics
For simplicity, IAC computes the policy gradient using the current policy’s samples without importance sampling or ratio clipping. The
value_clip_rangeis not applicable in MAAC.
The trainer uses a fixed training DataLoader batch size of 1. For
num_turns > 1, provide anexternal_transitionand setnum_generations=1. The training use batch gradient descent by default, wheretrain_batch_size=rollout_buffer_size.
IAC#
Independent Actor-Critic (IAC) optimizes each agent’s policy independently while using joint returns from multiple agents. Each agent maintains its own actor and critic, other agents serve as part of the environment. The policy gradient for each agent is:
where \( \delta_{i,t} = r_t + \gamma V_{\phi_i}(h_{i,t+1}) - V_{\phi_i}(h_{i,t}) \).
CoMLRL supports two IAC variants:
Separate Critic: Uses an independent model for value estimation separate from the actor. It provides more stable training but occupies larger storage and VRAM usage.
Shared Model: Attaches a small value-head to the transformer backbone, sharing the actor model’s history (or history-action) representations to reduce the space costs.
The critics are updated by minimizing the TD error:
When using shared model use_separate_critic=false, a value clip value_clip_range can be applied to improve training stability.
IACConfig provides parameters for configuring Independent Actor-Critic training:
num_agents: Number of agentsnum_turns: Number of turnsnum_train_epochs: Number of training epochsagent_learning_rate: Learning rate for agentscritic_learning_rate: Learning rate for criticvalue_loss_coef: Coefficient for value lossvalue_clip_range: Clipping range for value functionadvantage_normalization: Whether to normalize advantagesrollout_buffer_size: Number of samples to collect before updatetrain_batch_size: Mini-batch size for policy updatesmax_new_tokens: Maximum new tokens to generatetemperature: Temperature for samplingtop_p: Top-p for nucleus samplingtop_k: Top-k for samplingnum_generations: Number of generations per prompt per agentuse_separate_critic: Whether to use separate critic modelcritic_type: Critic target type (vfor V(h),qfor Q(h,a))critic_value_head_hidden_dim: Hidden dimension for critic value headvalue_head_hidden_dim: Hidden dimension for value head in shared-critic modeexternal_prompt_passthrough: Use external prompts directly in multi-turndiscount: Discount factor for multi-turn returnsearly_termination_threshold: Optional early-stop threshold for multi-turneval_interval: Evaluation interval (in training batches)eval_num_samples: Number of evaluation samples per intervaleval_batch_size: Eval dataloader batch sizelogging_steps: Log every N training batches
IACTrainer trains agents using Independent Actor-Critic:
agent_modeloragents: Model identifier string for homogeneous agents, or list of agent models (multi-agentagent_modelmust be a string)critic_modelorcritics: Critic identifier or list of critic models whenuse_separate_critic=truetokenizer: The tokenizer (required)reward_func: Callable that returns a list of floats (required)reward_processor: Optional processor to apply to rewardsformatters: Single callable or list of callables for each agent to format dataset items into promptsargs: Instance ofIACConfig(optional)train_dataset: Training dataset (required)eval_dataset: Evaluation dataset (optional)model_config: Model configuration dict (optional)wandb_config: Configuration for Weights & Biases logging (optional)metrics_callback: Optional callback for custom metricsexternal_transition: Optional transition function required for multi-turn training
For simplicity, IAC computes the policy gradient using the current policy’s samples without importance sampling or ratio clipping. In shared-critic mode (
use_separate_critic=false), value heads are attached to the actor models (do not passcritic_model/critics; passing them raises an error), and agents may be homogeneous or heterogeneous; this mode can be less stable, andvalue_clip_rangeonly applies there. In separate-critic mode (use_separate_critic=true), pass acriticslist with length equal tonum_agentsor a singlecritic_modelto be broadcast; critic models may differ from actor models.
The trainer uses a fixed training DataLoader batch size of 1. For
num_turns > 1, provide anexternal_transitionand setnum_generations=1. The training use batch gradient descent by default, wheretrain_batch_size=rollout_buffer_size.