Actor-Critic methods are widely used policy gradient approaches that employ generalized advantage estimation to estimate advantages, reducing the high variance and long rollout times in Monte Carlo methods, e.g., REINFORCE. Many LLM fine-tuning frameworks implement actor-critic training (e.g., trl, verl, LLaMA Factory).
IAC#
Independent Actor-Critic (IAC) optimizes each agent’s policy independently while using joint returns from multiple agents. Each agent maintains its own actor and critic, other agents serve as part of the environment. The policy objective is:
where \( \delta_{i,t} = r_{i,t} + \gamma V_{\phi_i}(h_{i,t+1}) - V_{\phi_i}(h_{i,t}) \) is the (single-step) temporal difference error, \( \gamma \) is the discount factor, and \( \mathcal{H}(\pi_{\theta_i}) \) is the entropy bonus with coefficient \( \beta \).
CoMLRL supports two IAC architectures for critic implementation:
Separate Critic: Uses an independent model dedicated to value estimation, completely separate from the actor. It provides more stable training but requires longer training time and larger VRAM usage.
Shared Model: Attaches a small value prediction head directly to the transformer backbone, sharing the actor model’s representations to reduce the time and space costs.
IACConfig provides parameters for configuring Independent Actor-Critic training:
output_dir: Directory to save outputsactor_learning_rate: Learning rate for actorcritic_learning_rate: Learning rate for criticweight_decay: Weight decay for AdamW optimizeradam_beta1,adam_beta2,adam_epsilon: Adam optimizer parametersmax_grad_norm: Maximum gradient norm for clippingrollout_buffer_size: Number of samples to collect before updatemini_batch_size: Mini-batch size for policy updatesvalue_clip_range: Clipping range for value functionvalue_loss_coef: Coefficient for value lossentropy_coef: Coefficient for entropy bonusadvantage_normalization: Whether to normalize advantagesmax_new_tokens: Maximum new tokens to generatetemperature: Temperature for samplingtop_p: Top-p for nucleus samplingtop_k: Top-k for samplingdo_sample: Whether to use samplingnum_train_epochs: Number of training epochsper_device_train_batch_size: Batch size per device, must be 1use_separate_critic: Whether to use separate critic modelcritic_model_name_or_path: Model identifier for separate criticcritic_value_head_hidden_dim: Hidden dimension for critic value headvalue_head_hidden_dim: Hidden dimension for actor value headnum_agents: Number of agentsnum_turns: Number of turnsdiscount: Discount factor for multi-turn returnsearly_termination_threshold: Optional early-stop threshold for multi-turneval_interval: Evaluation interval (in training batches)eval_num_samples: Number of evaluation samples per interval
IACTrainer trains agents using Independent Actor-Critic:
model: Model string or PreTrainedModel instance (required for single-agent, must be string for multi-agent)tokenizer: The tokenizer (required)reward_func: Callable that returns a list of floats (required)reward_processor: Optional processor to apply to rewardsformatters: Single callable or list of callables for each agent to format dataset items into promptsargs: Instance ofIACConfig(optional)train_dataset: Training dataset (required)eval_dataset: Evaluation dataset (optional)model_config: Model configuration dict (optional)wandb_config: Configuration for Weights & Biases logging (optional)metrics_callback: Optional callback for custom metricsexternal_transition: Optional transition function required for multi-turn training
For simplicity, IAC computes the policy gradient using the current policy’s samples without importance sampling or ratio clipping.
The trainer enforces
per_device_train_batch_size=1. Fornum_turns > 1, provide anexternal_transitionand setnum_return_sequences=1.
MAAC#
Multi-Agent Actor-Critic (MAAC) shares a centralized critic across agents. The policy objective mirrors IAC with a joint value baseline:
where \( \mathbf{\delta}_t = r_t + \gamma V_{\phi}(\mathbf{h}_{t+1}) - V_{\phi}(\mathbf{h}_{t}) \) uses the shared critic on the joint prompt/history, and \( \beta \) is the entropy coefficient.
MAACConfig parameters:
output_dir: Directory to save outputsactor_learning_rate: Learning rate for actorscritic_learning_rate: Learning rate for shared criticweight_decay: Weight decay for AdamWadam_beta1,adam_beta2,adam_epsilon: Adam optimizer parametersmax_grad_norm: Gradient clipping normrollout_buffer_size: Number of samples to collect per agent before an updatemini_batch_size: Mini-batch size within each updatevalue_loss_coef: Weight on critic lossentropy_coef: Entropy bonus coefficientadvantage_normalization: Whether to normalize advantages before updatesmax_new_tokens: Maximum tokens to generate per completiontemperature,top_p,top_k,do_sample: Sampling parametersnum_train_epochs: Number of training epochsper_device_train_batch_size: Must be 1pad_token_id: Padding token idnum_agents: Number of actorsnum_return_sequences: Number of generations per prompt per agentcritic_model_name_or_path: Required identifier for the shared critic
MAACTrainer setup:
model: Actor model identifier/string (required)tokenizer: Tokenizer (required)reward_func: Callable returning rewards (required)reward_processor: Optional reward post-processorformatters: Single callable or list for per-agent prompt formattingargs: Instance ofMAACConfig(optional)train_dataset: Training dataset (required)eval_dataset: Optional evaluation datasetmodel_config: Extra model kwargs (optional)wandb_config: Weights & Biases logging config (optional)metrics_callback: Optional callback for custom metrics