PPO is a widely used policy gradient method that employs generalized advantage estimation to estimate advantages, reducing the high variance and long rollout times in Monte Carlo methods, e.g., REINFORCE. PPO has also been used for LLM fine-tuning, e.g., trl, verl, LLaMA Factory.
IPPO#
Independent PPO (IPPO) optimizes each agent’s policy independently while using joint returns from multiple agents. Each agent maintains its own actor and critic, other agents serve as part of the environment. The policy objective is:
where \( \delta_{i,t} = r_{i,t} + \gamma V_{\phi_i}(h_{i,t+1}) - V_{\phi_i}(h_{i,t}) \) is the temporal difference error, \( \gamma \) is the discount factor, \( \lambda \) is the GAE parameter that balances bias and variance, and \( \mathcal{H}(\pi_{\theta_i}) \) is the entropy bonus with coefficient \( \beta \).
CoMLRL supports two IPPO architectures for critic implementation:
Separate Critic: Uses an independent model dedicated to value estimation, completely separate from the actor. It provides more stable training but requires longer training time and larger VRAM usage.
Value Head: Attaches a small value prediction head directly to the actor model, sharing the base model’s representations. It reduces VRAM usage, but since both actor and critic share the same model, gradient errors can be amplified during training.
IPPOConfig provides parameters for configuring the PPO training:
output_dir: Directory to save outputsactor_learning_rate: Learning rate for actorcritic_learning_rate: Learning rate for criticweight_decay: Weight decay for AdamW optimizeradam_beta1,adam_beta2,adam_epsilon: Adam optimizer parametersmax_grad_norm: Maximum gradient norm for clippingrollout_buffer_size: Number of samples to collect before updatemini_batch_size: Mini-batch size for PPO updatesppo_epochs: Number of optimization epochs per rolloutvalue_clip_range: Clipping range for value functionvalue_loss_coef: Coefficient for value lossentropy_coef: Coefficient for entropy bonusadvantage_normalization: Whether to normalize advantagesmax_new_tokens: Maximum new tokens to generatetemperature: Temperature for samplingtop_p: Top-p for nucleus samplingtop_k: Top-k for samplingdo_sample: Whether to use samplingnum_train_epochs: Number of training epochsper_device_train_batch_size: Batch size per device, must be 1use_separate_critic: Whether to use separate critic modelcritic_model_name_or_path: Model identifier for separate criticcritic_value_head_hidden_dim: Hidden dimension for critic value headvalue_head_hidden_dim: Hidden dimension for actor value headnum_agents: Number of agentsnum_turns: Number of turns, currently only supports 1reward_norm_eps: Epsilon for reward normalization
IPPOTrainer trains agents using Independent PPO:
model: Model string or PreTrainedModel instance (required for single-agent, must be string for multi-agent)tokenizer: The tokenizer (required)reward_func: Callable that returns a list of floats (required)reward_processor: Optional processor to apply to rewardsformatters: Single callable or list of callables for each agent to format dataset items into promptsargs: Instance ofIPPOConfig(optional)train_dataset: Training dataset (required)eval_dataset: Evaluation dataset (optional)model_config: Model configuration dict (optional)wandb_config: Configuration for Weights & Biases logging (optional)metrics_callback: Optional callback for custom metrics
CoMLRL implements on-policy IPPO, which computes the policy gradient using the current policy’s samples without importance sampling or ratio clipping.
The trainer enforces
per_device_train_batch_size=1and currently only supports single-turn training (num_turns=1).