Research Statement: Towards Generalizable MARL

Dongsu Lee
UT Austin
August 2025

TL;DR Machine learning achieved remarkable progress by aligning data, compute, and training paradigms. In contrast, control, especially in reinforcement learning (RL), has not had its "GPT moment." This is because control is inherently a multi‑agent problem, not a single‑agent one. To build AI that functions naturally in complex societies, we need multi‑agent reinforcement learning (MARL) that generalizes. This requires three capabilities: (i) capturing joint distributions, (ii) learning rich representations for multi‑agent systems, and (iii) achieving rapid role compatibility and adaptation. Below, I outline why control lags, what the core bottlenecks are, and how a research roadmap can address them.

Why control lags where learning thrives

Machine learning has consistently advanced when three ingredients align: large datasets, scalable compute, and new training paradigms. The success of large language models (LLMs), such as GPT, Gemini, and Grok, illustrates this convergence. Similar progress is evident in multimodal models that integrate text, vision, and audio into coherent generative systems.

Control, however, tells a different story. Despite decades of research, learning‑based control has not experienced a transformative leap. As noted by Sergey Levine in The Promise of Generalist Robotic Policies, scaling control demands not only larger models but also diverse, cross‑embodiment data and foundation‑style training. While recent efforts such as RT‑X and \(\pi_0\) move in that direction, the field remains far from practical, general‑purpose deployment.

The limitation is structural. As Silver and Sutton argue in Welcome to the Era of Experience, true progress in RL will not come from larger datasets alone, but from agents that learn through interaction and adaptation. The Second Half of AI perspective similarly notes that benchmark‑driven RL has failed to transfer its gains to real‑world impact. These insights converge on a central point: the bottleneck is not scale, but structure: control problems are inherently multi‑agent, partially observable, and dynamically coupled.

The core bottleneck: From single‑agent to multi‑agent generalization

In single‑agent RL, generalization has been achieved through self‑play and representation learning. Landmark systems such as AlphaGo, AlphaZero, and MuZero demonstrate that self‑play combined with complete observability can yield transferable policies.

Complementary advances in representation learning, e.g., successor features, forward‑backward representations, bisimulation, Quasimetric, Hilbert, temporal distance, and contrastive learning, show that better state abstractions enable stronger generalization across related tasks.

However, extending these principles to multi‑agent settings is non‑trivial. In multi‑agent RL (MARL), the environment is partially observable and non‑stationary; interactions grow combinatorially with the number of agents; and agents must infer roles, relationships, and shared dynamics. Existing methods often model one of these aspects in isolation, but few integrate them into a unified representation that captures the relational and structural factors driving coordination. I argue that such an omni‑representation (as a scalable embedding of multi‑agent structure) is the missing piece in modern learning‑based control. Solving this challenge would enable systems that reason jointly about perception, communication, and cooperation—bridging robotics, simulation, and adaptive decision‑making.

A research roadmap for practical MARL

Off‑policy algorithms for multi‑agent systems: Current MARL methods are predominantly on‑policy (e.g., MAPPO). Off‑policy learning would permit broader dataset reuse and improved sample efficiency. We must first understand why existing methods fail under off‑policy training and identify principled modifications to stabilize learning.
Balancing expressiveness and efficiency: Capturing complex inter‑agent distributions often requires heavy computation. Techniques inspired by shortcut diffusion, flow Q‑learning, or mean flow can approximate iterative sampling in a single step, achieving faster yet expressive inference.
Transferring representation learning to MARL: Hierarchical representation learning can provide a foundation: lower‑level embeddings model each agent’s MDP, while higher‑level representations encode inter‑agent relations. This structure can unify local reasoning and global coordination.
Constructing omni‑representations: A unified model should encode agent states, interactions, and shared world dynamics within a scalable latent space. Dual‑encoder or dual‑decoder pipelines, inspired by world models (e.g., dual encoder), can combine multiple information streams into coherent joint representations.
Rapid role alignment and adaptation: We must identify mechanisms that enable agents to align roles and adapt policies quickly in unseen settings. Once an omni‑representation exists, rapid adaptation becomes the critical step toward real‑world compatibility.

Outlook

The path to practical, general‑purpose AI agents will not emerge from scaling single‑agent systems. It will require multi‑agent learning that understands, represents, and coordinates within complex, dynamic environments. Achieving this is not just an algorithmic challenge. It is a conceptual one. The next breakthrough in AI control will come from understanding the structure of interaction itself.

Acknowledgement

I would like to thank Daehee Lee, Yaru Niu, and Amy Zhang for their helpful feedback on this post.