Fast and Expressive Multi-Agent Coordination

Multi-agent coordination is tricky because joint action distributions are inherently multi-modal. Even in the simplest cooperative settings, the space of "what agents could do together" quickly branches into many possible modes—tight formations, asymmetric roles, fallback behaviors, and everything in between. The same state can lead to wildly different coordinated outcomes depending on subtle cues.

And this complexity puts us in an awkward spot. Most existing approaches force you to choose between two bad options:

  • Capture the full multimodal structure, but suffer painfully slow inference
  • Do inference quickly, but collapse to brittle, poorly coordinated behavior
This trade-off has been baked into multi-agent RL for a few years. But it doesn’t actually have to be.

teaser

A snapshot of the classic trade-off: faster methods tend to be less expressive, while more expressive models come with higher inference cost

Desiderata for practical multi-agent coordination

Offline MARL brings its own set of challenges. Since you’re learning entirely from fixed demonstrations or logs, the model really needs to capture the full range of behaviors present in the data, which are messy, diverse, and highly multi-modal. If it can’t represent those different coordination patterns, you simply won’t see meaningful teamwork emerge.

But when it comes time to deploy a policy, the requirements flip. Real systems need to act quickly within just a few milliseconds. A method that requires dozens of iterative denoising steps, like diffusion-based policies, just isn’t practical (for "low-level decision-making") in environments where timing matters.

So we end up with this tension: the training phase demands expressiveness, while the deployment phase demands speed. Which naturally leads to a simple but important question:

Do these two goals really have to be tied together as a learning objective? Or can we learn rich joint behavior first and then convert it into something that runs fast at test time?

The challenges

Most prior systems blend "representing the joint distribution" and "choosing an action quickly" into a single learning objective. That’s the core bottleneck. More precisely, diffusion-based model give us rich, multimodal distributions. They are great! But sampling requires an iterative denoising process—terrible for real-time use. On the other hand, Gaussian policies are fast! But they oversimplify joint interactions, so agents often fail to coordinate under complexity.

The mistake is treating expressive representation and fast per-agent policy as the same object. They don’t have to be.

Learn joint behavior first, then distill it

Here’s the nugget as my favorite part of this whole framing. Learn the joint action distribution in its full, expressive glory, then distill that knowledge into decentralized one-step policies. In other words, front-load the complexity, and make execution cheap. This idea is motivated from seminal research realted to flow short-cut.

macflow-overview-diagram

MAC-Flow first extract joint behaviors, which are multi-modal and complex, from offline data via flow-matching, and then distill them into fast individual policies using BC distillation.

Thanks to the simple and efficient idea behind flow matching, we can extract the expressiveness of complex multi-agent behavior without paying a heavy computational price. Flow matching learns a continuous transformation between distributions, giving us a smooth ODE-based mapping from noise to actions. This means we can model multi-modal joint action patterns without the iterative multi-step denoising loop that makes diffusion models slow. Despite being lightweight, the method captures the diverse coordination modes present in offline multi-agent datasets, effectively modeling the joint structure we actually care about.

Once this expressive joint policy is learned, the next step is to break it apart in a meaningful way. Through flow distillation, we preserve the multi-modality and structure of the joint model while converting it into fast, one-step policies for each agent. And with the help of the Individual-global-max principle, these per-agent policies remain aligned with the optimal joint behavior, even after decentralization.

The end result feels almost counterintuitive: the system understands complex group behavior because it learned it in a rich joint space, yet each agent acts through a lightweight, single-step policy distilled from that knowledge. Expressive during training, efficient at inference.

Exactly the combination we’ve been looking for.

Why is this idea good and matters?

By separating these roles, everything becomes simpler.

During training, we allow the model to be as expressive as it needs to be. It can explore the full multimodal structure of the dataset, capture nuanced coordination patterns, and learn behaviors that are hard to represent with simple parametric forms. Nothing about training time demands that we be cheap.

But at deployment, the priorities flip entirely. Real systems care about latency, stability, and predictability. A single-step per-agent policy, which is distilled from a richer joint model, fits that requirement perfectly. You get fast inference, decentralized execution, and robustness in environments where timing matters.

This matters because coordination isn’t about individual agents behaving intelligently in isolation. It's about the structure of the group. If that structure isn’t learned properly, no amount of local cleverness will fix the behavior. Our approach recognizes this: learn the structure where it’s easiest (the joint space), and execute where it’s fastest (the per-agent policies).

Closing thoughts

I like this perspective because it cleans up the story we tell about multi-agent systems:

  • First, learn the behavior of the group as a whole.
  • Then, convert that knowledge into something each agent can use quickly.
  • Avoid the false trade-off between smart and fast.
If you're interested in designing multi-agent systems that actually scale into real-world settings, I think this separation of representation and execution is one of the cleanest ideas we have.


This blog posting is from Multi-agent Coordination via Flow Matching, supervised by Amy Zhang.