Unsupervised agent discovery is the ability to identify and model intentional agents from raw perceptual data without explicit supervision. While neurocognitive theories propose different neural mechanisms for agent perception—including mirror neurons and the superior temporal sulcus (STS), we lack computational algorithms that can fully describe agent perception. Existing computational models of agent perception operate on simplified symbolic inputs rather than the raw perceptual data that biological systems process.
We introduce a variational objective ($\mathcal{L}_{\text{VAD}}$) that formulates vision-based agent discovery as structured inference over latent actions. Based on $\mathcal{L}_{\text{VAD}}$, we implement a deep conditional slot-based variational autoencoder called VAD (Variational Agent Discovery) model. Our model learns internal agent representations directly from raw pixel-based observations, outperforming baselines on predictive tasks including agent action and goal inference in three video-game settings.
VAD's internal representations generalize robustly to novel agents and environmental configurations, demonstrating up to 33% advantage in transfer scenarios. The VAD model exhibits predictive capabilities analogous to those observed in infant cognition studies, correctly predicting that agents will take efficient paths to goals when environmental constraints change. Analysis of learned representations reveals functional decomposition of visual scenes along agent-centric lines, with certain neural features exhibiting human mirror-neuron-like activation patterns across different agents performing the same actions. When incorporated as an auxiliary loss in multi-agent reinforcement learning, our $\mathcal{L}_{\text{VAD}}$ objective improves sample efficiency by 21.8% and final performance by 7.6%.
Our remarkable capacity to distinguish agents from non-agents based on minimal visual information is a fundamental aspect of human cognition. When we observe a sequence of frames from Heider and Simmel's classic 1944 experiment, consisting of simple geometric shapes—two triangles, a circle, and a rectangular box—most of us intuitively identify the triangles and circle as intentional agents while perceiving the rectangular walls as inanimate objects. This phenomenon demonstrates that our perception of agency does not require biological features like faces or limbs.
When we observe moving entities, our brains analyze motion patterns, identifying signatures of agency. Self-propelled movement, sudden changes in direction, and contingent reactivity to other entities trigger neural processing that leads us to perceive intention beyond mere motion. Agency detection engages visual processing mechanisms triggered by specific motion cues that violate expected physical dynamics. This perception also allows us to attribute goals to entities based on their movement patterns.
Despite this foundational role in human cognition, artificial intelligence models struggle with perceiving agency from visual input. While modern deep learning approaches have successfully modeled physical object interactions, scene representations, and object recognition, these models lack an integrated understanding of agency. They may identify people or animals in scenes and label their actions based on visual features, but cannot fundamentally distinguish between an entity acting with intention and one merely following physical dynamics.
Building on the foundations of agent perception from cognitive science and object-centric representation learning from machine learning, we now formalize the problem of unsupervised agent discovery from visual observations.
We situate our problem setting as a Partially Observable Multi-Agent Markov Decision Processes (POMDPs), defined as $(S, \{A_i\}_{i=1}^N, T, \{R_i\}_{i=1}^N, \{\mathbf{X}_i\}_{i=1}^N, \gamma)$. Within this formalism, we operate under observational constraints: we access only a sequence of rendered visual observations $\mathbf{X} = \{x_0, x_1, ..., x_T\}$, where each $x_t \in \mathbb{R}^{H \times W \times C}$ represents an image frame containing multiple entities. The agents' actions, rewards, and individual observations remain unobserved—all information must be inferred from the pixel-level visual data.
Object-centric representation learning, particularly slot attention mechanisms, provides our computational foundation for decomposing these complex scenes. These approaches factor visual observations into a set of slot representations $\{\tilde{s}_t^1, \tilde{s}_t^2, ..., \tilde{s}_t^K\}$, where each $\tilde{s}_t^k$ can be assumed to encode some structural entity. When extended to video sequences as in SAVi, these models can be optimized using maximum likelihood to predict the next frame:
\[ \max_\theta \log p(x_{t+1} | \{\tilde{s}_t^1, \tilde{s}_t^2, ..., \tilde{s}_t^K\}) \approx -\|x_{t+1} - \text{Decoder}(f_{\text{dyn}}(\{\tilde{s}_t^1, \tilde{s}_t^2, ..., \tilde{s}_t^K\}))\|^2 \]
This approximation maps the probabilistic objective to a deterministic reconstruction loss, where the negative squared error term corresponds to the log-likelihood under a Gaussian observation model with fixed variance. While effective for general scene decomposition, SAVi treats all entities uniformly. We introduce a variational inference perspective that explicitly models actions as latent variables, creating an inductive bias toward identifying entities whose transitions reflect agency rather than merely physical/environmental dynamics.
Our approach formalizes agent discovery as a structured probabilistic inference problem with latent variables. Unlike standard object-centric models that directly predict state transitions, we model these transitions as consequences of agent decisions—actions sampled from internal policies.
We make the following assumptions about our problem: First, agents follow coherent policies that can be modeled as conditional distributions $p(a_i|\tilde{s}_i)$ over actions given states. Second, slot attention mechanisms provide sufficiently structured representations for entity-centric modeling. Third, observed transitions between states are generated by unobservable actions that must be inferred from dynamics. Fourth, these dynamics arise from agents executing actions according to their policies, treating observed data as samples from a process where agent decisions drive environmental dynamics.
In our slot-based representation framework, the state of slot $i$ at time $t$ is denoted as $\tilde{s}_t^i$. We explicitly model the transition probability $p(\tilde{s}_{t+1}^i | \tilde{s}_t^1, \tilde{s}_t^2, ..., \tilde{s}_t^K)$ while accounting for a latent action variable $a_i$ that potentially drives this transition. The true transition probability involves marginalizing over all possible actions:
\[ p(\tilde{s}_{t+1}^i | \tilde{\mathbf{s}}_t) = \int p(\tilde{s}_{t+1}^i, a_i | \tilde{\mathbf{s}}_t) \, da_i \]
where $\tilde{\mathbf{s}}_t = \{\tilde{s}_t^1, \tilde{s}_t^2, ..., \tilde{s}_t^K\}$ represents the complete set of slots at time $t$. This marginalization becomes computationally intractable, especially since we do not have direct access to the joint distribution $p(\tilde{s}_{t+1}^i, a_i | \tilde{\mathbf{s}}_t)$. By applying Bayes' rule, we can decompose this joint probability into more familiar quantities, and then employ variational inference techniques to approximate the marginalization efficiently.
To formalize our approach mathematically, we begin with the log marginal likelihood for a single slot transition and derive a tractable evidence lower bound (ELBO). We introduce a variational posterior $q(a_i | \tilde{s}_t^i, \tilde{s}_{t+1}^i)$ to approximate the true posterior distribution over actions—an inverse model that infers the actions most likely to have generated observed transitions.
Starting with the log marginal likelihood:
\[ \log p(\tilde{s}_{t+1}^i | \tilde{\mathbf{s}}_t) = \log \int p(\tilde{s}_{t+1}^i, a_i | \tilde{\mathbf{s}}_t) \, da_i \]
We apply the standard variational inference trick:
\[ \log p(\tilde{s}_{t+1}^i | \tilde{\mathbf{s}}_t) = \log \int q(a_i | \tilde{s}_t^i, \tilde{s}_{t+1}^i) \frac{p(\tilde{s}_{t+1}^i, a_i | \tilde{\mathbf{s}}_t)}{q(a_i | \tilde{s}_t^i, \tilde{s}_{t+1}^i)} \, da_i \]
Applying Jensen's inequality and expanding the joint probability using the chain rule, assuming that an agent's action depends primarily on its own state:
\[ p(\tilde{s}_{t+1}^i, a_i | \tilde{\mathbf{s}}_t) = p(\tilde{s}_{t+1}^i | a_i, \tilde{\mathbf{s}}_t) p(a_i | \tilde{s}_t^i) \]
We derive the ELBO for each slot:
\[ \mathcal{L}_i(\tilde{\mathbf{s}}_t, \tilde{s}_{t+1}^i; \theta, \phi) = \mathbb{E}_{q_\phi(a_i | \tilde{s}_t^i, \tilde{s}_{t+1}^i)} \left[ \log p_\theta(\tilde{s}_{t+1}^i | a_i, \tilde{\mathbf{s}}_t) \right] - D_{KL}(q_\phi(a_i | \tilde{s}_t^i, \tilde{s}_{t+1}^i) || p_\theta(a_i | \tilde{s}_t^i)) \]
This ELBO decomposes into the following conceptual components:
Component | Function |
---|---|
$p_\theta(\tilde{s}_{t+1}^i | a_i, \tilde{\mathbf{s}}_t)$ | Forward dynamics model predicts the next state given current state's slots and inferred action. It forms the core of our predictive model. |
$q_\phi(a_i | \tilde{s}_t^i, \tilde{s}_{t+1}^i)$ | Inverse action model infers the action that most likely caused the observed transition between $\tilde{s}_t^i$ and $\tilde{s}_{t+1}^i$. |
$p_\theta(a_i | \tilde{s}_t^i)$ | Agent policy represents a conditional distribution over latent action variables given the current entity slot $\tilde{s}_t^i$. |
Loss terms:
$\mathbb{E}_{q_\phi(a_i | \tilde{s}_t^i, \tilde{s}_{t+1}^i)} \left[ \log p_\theta(\tilde{s}_{t+1}^i | a_i, \tilde{\mathbf{s}}_t) \right]$ | Reconstruction term encourages accurate prediction of the next state given the inferred action. |
$D_{KL}(q_\phi(a_i | \tilde{s}_t^i, \tilde{s}_{t+1}^i) || p_\theta(a_i | \tilde{s}_t^i))$ | Policy regularization keeps inferred actions close to the agent's learned policy. |
To apply this objective to scenes with multiple entities, we sum the ELBO across all slots and combine it with a reconstruction loss to form our full Variational Agent Discovery objective $\mathcal{L}_{\text{VAD}}$:
\[ \mathcal{L}_{\text{VAD}} = \lambda_{\text{recon}} \mathcal{L}_{\text{recon}} + \lambda_{\text{ELBO}} \sum_{i \in \text{slots}} \mathcal{L}_i \]
where $\mathcal{L}_{\text{recon}}$ is a reconstruction loss that encourages accurate prediction of future frames, and $\lambda_{\text{recon}}$ and $\lambda_{\text{ELBO}}$ are hyperparameters that balance the importance of reconstruction versus structured latent action modeling.
This $\mathcal{L}_{\text{VAD}}$ objective serves as the foundation for our Variational Agent Discovery (VAD) model and can also be used as an auxiliary loss to improve sample efficiency in reinforcement learning.
Our VAD model significantly outperforms baseline object-centric approaches across multiple environments. The model not only achieves higher accuracy in predicting agent actions and goals but also shows remarkable generalization capabilities when presented with novel agents, goals, and environmental configurations.
A fascinating emergent property of our model is the development of mirror-like neural representations within the learned slot vectors. For different directional actions (Left, Right, Up, Down, Stay), we identified specific features that show strong correlations across different agent slots. This behavior parallels biological mirror neurons, which activate both when an animal performs an action and when it observes another agent performing the same action.
To evaluate our VAD model's capacity to capture agent intentionality, we designed an experiment inspired by Gergely & Csibra's work on teleological reasoning in infants. During training, the model only observed scenarios where an agent jumped over an obstacle to reach a goal. For the test condition, we removed the obstacle and observed the model's predictions without additional training.
Remarkably, our model correctly predicted that the agent would take a direct, efficient path to the goal when the obstacle was removed, despite never having observed this scenario during training. This behavior mirrors the expectations documented in 12-month-old infants, who show increased looking times when agents maintain unnecessary jumping trajectories after obstacles are removed.
When incorporated as an auxiliary objective in reinforcement learning, our $\mathcal{L}_{\text{VAD}}$ loss improves sample efficiency and final performance in multi-agent tasks. The VAD-augmented model achieves higher rewards earlier in training and maintains this advantage throughout, demonstrating the practical utility of our approach for multi-agent systems.
Our work introduces a variational framework for unsupervised agent discovery that operates directly on visual observations. The VAD model successfully learns agent-centric representations that enable accurate prediction of agent actions and goals, generalize to novel scenarios, and exhibit properties analogous to human cognitive mechanisms for agency perception.
The discovery of mirror-like neural patterns and teleological reasoning capabilities in our model suggests deeper connections between our computational approach and cognitive mechanisms in biological systems. These findings not only advance our understanding of agent perception but also provide practical benefits for multi-agent reinforcement learning systems.
Future directions include extending the model to more complex environments with hierarchical action structures, exploring the development of theory of mind capabilities within the same framework, and investigating how these agent-centric representations might integrate with other aspects of visual cognition such as scene understanding and physical reasoning.
For more details about this research, please refer to our paper and code repository. The complete thesis contains additional experiments, mathematical derivations, and in-depth analysis of the results presented here.
Muppidi, A. (2023). Learning to See Agents with Deep Variational Inference. Senior Thesis, Harvard University, Cambridge, MA.