Actor-Critic methods are widely used policy gradient approaches that employ generalized advantage estimation to estimate advantages, reducing the high variance and long rollout times in Monte Carlo methods, e.g., REINFORCE. Many LLM fine-tuning frameworks implement actor-critic training (e.g., trl, verl, LLaMA Factory).

IAC#

Independent Actor-Critic (IAC) optimizes each agent’s policy independently while using joint returns from multiple agents. Each agent maintains its own actor and critic, other agents serve as part of the environment. The policy objective is:

\[ J(\theta_i) = \mathbb{E}_{o_{i,0} \sim \mathcal{D}, h_i \sim \pi_{\theta_i}}\left[\log \pi_{\theta_i}(a_{i,t}|h_{i,t}) \cdot \delta_{i,t} + \beta \mathcal{H}(\pi_{\theta_i})\right] \]

where \( \delta_{i,t} = r_{i,t} + \gamma V_{\phi_i}(h_{i,t+1}) - V_{\phi_i}(h_{i,t}) \) is the (single-step) temporal difference error, \( \gamma \) is the discount factor, and \( \mathcal{H}(\pi_{\theta_i}) \) is the entropy bonus with coefficient \( \beta \).

CoMLRL supports two IAC architectures for critic implementation:

  • Separate Critic: Uses an independent model dedicated to value estimation, completely separate from the actor. It provides more stable training but requires longer training time and larger VRAM usage.

  • Shared Model: Attaches a small value prediction head directly to the transformer backbone, sharing the actor model’s representations to reduce the time and space costs.

IACConfig provides parameters for configuring Independent Actor-Critic training:

  • output_dir: Directory to save outputs
  • actor_learning_rate: Learning rate for actor
  • critic_learning_rate: Learning rate for critic
  • weight_decay: Weight decay for AdamW optimizer
  • adam_beta1, adam_beta2, adam_epsilon: Adam optimizer parameters
  • max_grad_norm: Maximum gradient norm for clipping
  • rollout_buffer_size: Number of samples to collect before update
  • mini_batch_size: Mini-batch size for policy updates
  • value_clip_range: Clipping range for value function
  • value_loss_coef: Coefficient for value loss
  • entropy_coef: Coefficient for entropy bonus
  • advantage_normalization: Whether to normalize advantages
  • max_new_tokens: Maximum new tokens to generate
  • temperature: Temperature for sampling
  • top_p: Top-p for nucleus sampling
  • top_k: Top-k for sampling
  • do_sample: Whether to use sampling
  • num_train_epochs: Number of training epochs
  • per_device_train_batch_size: Batch size per device, must be 1
  • use_separate_critic: Whether to use separate critic model
  • critic_model_name_or_path: Model identifier for separate critic
  • critic_value_head_hidden_dim: Hidden dimension for critic value head
  • value_head_hidden_dim: Hidden dimension for actor value head
  • num_agents: Number of agents
  • num_turns: Number of turns
  • discount: Discount factor for multi-turn returns
  • early_termination_threshold: Optional early-stop threshold for multi-turn
  • eval_interval: Evaluation interval (in training batches)
  • eval_num_samples: Number of evaluation samples per interval

IACTrainer trains agents using Independent Actor-Critic:

  • model: Model string or PreTrainedModel instance (required for single-agent, must be string for multi-agent)
  • tokenizer: The tokenizer (required)
  • reward_func: Callable that returns a list of floats (required)
  • reward_processor: Optional processor to apply to rewards
  • formatters: Single callable or list of callables for each agent to format dataset items into prompts
  • args: Instance of IACConfig (optional)
  • train_dataset: Training dataset (required)
  • eval_dataset: Evaluation dataset (optional)
  • model_config: Model configuration dict (optional)
  • wandb_config: Configuration for Weights & Biases logging (optional)
  • metrics_callback: Optional callback for custom metrics
  • external_transition: Optional transition function required for multi-turn training

For simplicity, IAC computes the policy gradient using the current policy’s samples without importance sampling or ratio clipping.

The trainer enforces per_device_train_batch_size=1. For num_turns > 1, provide an external_transition and set num_return_sequences=1.

MAAC#

Multi-Agent Actor-Critic (MAAC) shares a centralized critic across agents. The policy objective mirrors IAC with a joint value baseline:

\[ J(\theta_i) = \mathbb{E}_{h_t \sim \mathcal{D},\, a_t \sim \pi_{\theta}}\left[\log \pi_{\theta_i}(a_{i,t}|h_{i,t}) \cdot \mathbf{\delta}_t + \beta \mathcal{H}(\pi_{\theta_i})\right] \]

where \( \mathbf{\delta}_t = r_t + \gamma V_{\phi}(\mathbf{h}_{t+1}) - V_{\phi}(\mathbf{h}_{t}) \) uses the shared critic on the joint prompt/history, and \( \beta \) is the entropy coefficient.

MAACConfig parameters:

  • output_dir: Directory to save outputs
  • actor_learning_rate: Learning rate for actors
  • critic_learning_rate: Learning rate for shared critic
  • weight_decay: Weight decay for AdamW
  • adam_beta1, adam_beta2, adam_epsilon: Adam optimizer parameters
  • max_grad_norm: Gradient clipping norm
  • rollout_buffer_size: Number of samples to collect per agent before an update
  • mini_batch_size: Mini-batch size within each update
  • value_loss_coef: Weight on critic loss
  • entropy_coef: Entropy bonus coefficient
  • advantage_normalization: Whether to normalize advantages before updates
  • max_new_tokens: Maximum tokens to generate per completion
  • temperature, top_p, top_k, do_sample: Sampling parameters
  • num_train_epochs: Number of training epochs
  • per_device_train_batch_size: Must be 1
  • pad_token_id: Padding token id
  • num_agents: Number of actors
  • num_return_sequences: Number of generations per prompt per agent
  • critic_model_name_or_path: Required identifier for the shared critic

MAACTrainer setup:

  • model: Actor model identifier/string (required)
  • tokenizer: Tokenizer (required)
  • reward_func: Callable returning rewards (required)
  • reward_processor: Optional reward post-processor
  • formatters: Single callable or list for per-agent prompt formatting
  • args: Instance of MAACConfig (optional)
  • train_dataset: Training dataset (required)
  • eval_dataset: Optional evaluation dataset
  • model_config: Extra model kwargs (optional)
  • wandb_config: Weights & Biases logging config (optional)
  • metrics_callback: Optional callback for custom metrics