Aneesh Muppidi - Statement of Purpose

Statement of Purpose

How can agents learn in unknown worlds? As a computer science and neuroscience major, I'm fascinated by how humans adapt rapidly to unfamiliar settings. Human decision-making and learning is what drew me to deep reinforcement learning (RL).

My undergraduate research has focused on how agents, like humans, can rapidly learn in the unknown. This work includes three completed projects—one resulting in a first-author paper at NeurIPS 2024—and an ongoing effort. My interests lie in online optimization for RL, generative world models, and language-action models. Below are core questions I've asked within these areas for lifelong RL, multi-agent RL, and Q-learning:

Q1) Can learned policies adapt to unknown tasks in new environments? While modifying a problem set in my reinforcement learning class, I began questioning a standard practice: resetting network weights before training a policy on a new task. What would happen if I trained policies sequentially without resets? I was surprised to observe degraded learning capacity over time when adapting to new tasks. I learned this phenomenon was called loss of plasticity, which inspired my research at the Harvard Computational Robotics Lab with Professor Hank Yang.

From RL literature, I learned that regularizing a policy to its initialization can reduce plasticity loss, but we don't know when or how much to regularize, especially in lifelong settings. My project's key insight: lifelong RL without task boundaries closely resembles online convex optimization, where loss functions shift arbitrarily. Drawing inspiration from parameter-free online learning, where iterative updates are adaptively regularized to a "reference point," I reframed lifelong RL as an online optimization problem. In this framework, the "base" optimizer (e.g., Adam or SGD) updates the policy, the reference point corresponds to the policy's initialization, and a theoretical meta-tuner from online learning literature adaptively scales the base optimizer's updates to regularize the policy towards this reference point.

This led to FastTRAC(NeurIPS 2024; Spotlight at RSS and RLC 2024 Workshops), a parameter-free optimizer that mitigates plasticity loss, accelerates forward transfer, and avoids policy collapse. At its core, FastTRAC excels at optimizing objectives closely related in task space. This naturally left me with an unanswered question: Can parameter-free optimization extend to rapid Sim2Real transfer? Or real-world on-the-flys adaptation?

I further addressed loss of plasticity with Professor Ila Fiete at MIT, hypothesizing that sequential tasks degrade policy plasticity compared to mixed-task batch data. Co-leading a project, I developed a particle filter optimizer inspired by Bayesian model averaging and demonstrated its theoretical and empirical performance invariance to sequential task orderings during training. This optimizer significantly improved policy plasticity and has been submitted to ICLR 2025.

Q2) Can we learn with unknown agents? In most multi-agent RL (MARL) algorithms, agent representations and policies are explicitly provided to all agents. But what happens when that assumption is removed? I wondered: Could agents discover one another through vision alone?

My first attempt at this question, as a final project in Professor Pulkit Agrawal's class at MIT, failed. But I couldn't let it go. With support from Professor Samuel Gershman, my senior thesis proposal was selected to be funded by Harvard's Kempner Institute.

Over nine months, I developed a generative agent model that processes pixel observations from multi-agent RL games and predicts future frames. My key insight was to learn an object-centric inverse action dynamics model using a conditional variational autoencoder (CVAE). The CVAE's prior implicitly learned a policy for each object in the scene, enabling the model to decompose agents' goals, rewards, and policies directly from visual input. Through linear probing, I showed that the model generalized to unseen agents and game settings, capturing agent-centric properties without supervision. Most importantly, I showed this model accelerates IPPO's convergence to a collaborative policy in MARL games, where standard IPPO policies fail. This work earned a Harvard Kempner Institute Poster Award and is in preparation for ICML 2025.

The project also left me with deeper questions: Could we scale this generative model beyond 2D games to the real world, learning to predict agent dynamics directly from real-world observations?

Q3) Do we always have to start from scratch? While studying TD learning and tabular Q-learning in my classes, I found it frustrating that they start tabula rasa—Q-updates felt like a dog chasing its tail. I wondered: Can we initialize the target Q-values with something better than random? My idea was simple: use LLMs as zero-shot value approximators to warm-start the target Q-table or network, distilling LLM priors into Q-learning.

In collaboration with Professors Ila Fiete and Samuel Gershman, I prompt-engineered LLaMA-70b by providing text representations of states and the Bellman update equation, framing it as a Q-network agent tasked with predicting value estimates. The LLM-generated value table significantly accelerated tabular Q-learning convergence. Additionally, I demonstrated rapid Deep Q-learning convergence by warm-starting with LLM-estimated target Q-values for the TD update.

Moving forward, I wonder: How can we leverage foundation model priors to make Q-learning more sample-efficient for continuous, real-world tasks like robotic manipulation?

Future Plans

This past year, I've focused on how agents learn in unknown worlds. But every project has left me asking: how do we scale these ideas so agents can learn in the real world—especially in robotics?

During my PhD, I'm eager to work on robot learning and generalizable real-world policies. I'm particularly interested in improving long-horizon planning for robotics by building on three unanswered questions from my previous work:

1) Can we pretrain vision-language-action (VLA) models with simulation rollouts from RL agents and use parameter-free optimization for online adaptation with minimal data?

2) Can we train generative world models from multimodal data—like internet videos, audio, and tactile streams—and enable multi-abstraction hierarchical planning in these interactive simulators?

3) Can we bootstrap real-world RL for robotics using VLAs to guide efficient exploration in hard tasks, like precise dexterous manipulation?

Moreover, I believe enabling real-world robot learning means drawing inspiration from human cognition. I want to explore like chain-of-thought reasoning, process reward models, and test-time compute strategies to help VLAs decompose long-horizon problems into manageable subproblems. I'm also excited to study few-shot function learning—not just for robotics but for abstract domains like ARC-AGI.

By working on VLAs, generative world simulators, real-world RL, and few-show function learning, my goal is clear: teach robots to adapt and learn in the real world as quickly as possible.