Welcome to CoMLRL's documentation 👋
Cooperative Multi-LLM Reinforcement Learning (CoMLRL) is an open-source library for training multiple LLMs to collaborate using Multi-Agent Reinforcement Learning (MARL). It provides implementations of various MARL algorithms for LLM collaboration and supports different environments and benchmarks.
About#
“What is LLM collaboration?"
LLM collaboration refers to the problems where LLM agents cooperatively solve tasks in multi-agent systems. The tasks are specified in language and provided to each agent as a prompt, and the agent generates a response synchronously based on its instructions. The set of all agents’ responses jointly forms a solution. Users and systems may validate the solutions to provide additional requirements or suggestions for LLMs. These components form part of the environment for LLM collaboration, with states that may be updated based on the agents’ outputs. The updates are embedded into prompts for subsequent turns. This process iterates until the task is completed or a turn limit is reached.
“Why should we fine-tune multi-LLM systems with MARL?"
Many studies have explored LLM-based multi-agent systems for completing tasks with multiple interacting agents. However, most of these models are pretrained separately and are not specifically optimized for coordination, which would limit their performance. In addition, designing effective prompts remains difficult and unclear. Cooperative MARL methods have been extensively studied for years, which optimize a team of agents towards a shared objective. They naturally fit LLM collaboration and motivate us to bring advances from the well-established MARL community to LLM-based MAS.
“What are the benefits of decentralized reasoning?"
Cooperative MARL methods are grounded in the theory of Dec-POMDP. The agents execute in a decentralized manner, which has many advantages. Unlike knowledge distillation, pruning, or quantization, it accelerates LLM inference without incurring information loss. Moreover, decentralization reduces the computational and memory burden of maintaining long-context dependencies and conducting joint decision-making within a single model. By assigning specific subtasks to individual agents, the system achieves more modular, efficient, and lightweight reasoning. In addition, effective cooperation among small local language models can offer a safe and cost-efficient solution for offline and edge intelligence.
“Does CoMLRL support single-agent fine-tuning?"
Yes! The simplest way is to set num_agents=1 in your trainer. But since we omit fancy optimizations for simplicity of multi-agent training, you may not find the single-agent trainers optimal.
“Does CoMLRL support advanced multi-agent methods at test-time?"
No. This library primarily focuses on optimizing LLM collaboration by MARL. Designing multi-agent test-time interactions is not our strength. Users can refer to AutoGen, langroid, MARTI for help.
“Does CoMLRL support self-play/self-improvement/self-evolving by MARL?"
Yes! Although we focus on LLM collaboration formalized as Dec-POMDP, users can still customize the interactions with environment to implement pipeline like self-play (Spiral) and self-improvement (MAFT). Users can refer to our multi-turn training for more details.
“Does CoMLRL support distributed training?"
Not yet. We are currently focusing on CTDE on proving the concepts of training small-scale LLMs with cooperative MARL. Resource-consuming distributed training with slow and complex gradient accumulation will be open-sourced in the near future.
Features#
- MARL trainers to optimize LLM collaboration:
- Multi-Agent REINFORCE: Critic-free policy gradient methods, including MAREINFORCE, MAGRPO, MARLOO, MAREMAX.
- Aligned individual response joint with
joint_mode='align'. - Memory-efficient cross joint with
joint_mode='cross'.
- Aligned individual response joint with
- Multi-Agent PPO: Critic-based policy gradient methods, including IPPO.
- Canonical IPPO with a separate critic with
use_separate_critic=True. - Memory-efficient critic with value-head over actor with
use_separate_critic=False.
- Canonical IPPO with a separate critic with
- Multi-Agent REINFORCE: Critic-free policy gradient methods, including MAREINFORCE, MAGRPO, MARLOO, MAREMAX.
- Environments that simulate real-world tasks for training and evaluating LLM collaboration:
- Writing Collaboration: Multiple LLM agents collaborate on processing articles.
- Code Generation: Generate code solutions for programming problems.
- MBPP - Mostly basic python problems.
- HumanEval - Handwritten evaluation problems
- CoopHumanEval - HumanEval with cooperative nature.
- Code Completion: Complete code snippets based on given contexts.
- ClassEval - Complete class-level code based on method stubs and docstrings.
