Bootstrapping Temporal-Difference Learning with LLM Priors

In my reinforcement learning projects—optimization for lifelong RL [1], policy particle filters, and generative agent models—I often found myself circling back to the fundamentals: temporal-difference (TD) learning [2], [3]. While elegant in theory, Q-learning often surprises me, as it starts tabula rasa—and it often feels like, to me at least, that Q-updates are like a dog chasing its own tail.

This realization has motivated me to ask: why not initialize the Q-values with something better than random? Specifically, can we use large language models (LLMs) as zero-shot value approximators to warm-start the Q-table or network?

LLMs as Function Approximators in RL

LLMs for decision-making tasks is nothing new. Examples include VOYAGER [4], an LLM-driven lifelong learning agent in Minecraft. Similarly, LLaRP [5] finetunes LLMs with actor and critic heads for embodied visual tasks. In-context learning for policy iteration has also been explored [6]. In robotics, vision-language-action models, such as OpenVLA [7], finetune LLMs to directly output motor commands. LLMs and VLMs have also been used as reward approximators [8] and reward designers [9], [10], [11].

Distinct from using LLMs as policy and reward approximators, I want to investigate the use of LLMs as value function approximators in Q learning. My idea draws on the idea of distilling LLM priors into Q-learning, where LLMs predict initial value estimates for states based on their encoded world knowledge. Unlike policy-based applications, this approach focuses on bootstrapping TD methods by providing more informative priors to the Q-network or Q-table, trying to tackle the "inefficiency" of tabula rasa learning.

Proof of Concept - Simple Grid World

I quickly designed a 12x12 grid-world environment to quickly test how different Q-table initializations affect convergence. The agent's goal is to reach the terminal state at (11, 11), incurring a reward of -1 for each non-terminal step.

LLMs for Grid-world Value Approximation

To initialize the Q-table with LLM estimates, I instructed GPT-4o to estimate Q-values for state-action pairs. The model was prompt-engineered with descriptions of the environment and the Bellman equation to output structured JSON representations. Below is a partial version of the system prompt:

You are a helpful assistant who reasons through Q-learning calculations step by step. Guide the user through the solution to estimate Q-values for each action (up, down, left, right) for State (STATE), and then provide the best action. We are working with a 12x12 grid world environment. The agent can move in four directions: up, down, left, or right. The bottom-right corner (state (11, 11)) is the terminal state, and all other states give a reward of -1 for each step. The discount factor (γ) is {gamma}.

For each action (up, down, left, right), compute the Q-value by considering the immediate reward and the discounted value of the next state, based on the formula:

Q(s, a) = r(s, a) + γ max_a' Q(s', a')

(This is a summary of the full prompt, which includes additional details about the task setup and response structure.)

Comparison of value initialization methods in a 12x12 grid world: GPT-4o generates suboptimal Q-values (middle) compared to true value iteration (left). Tabular Q-learning with GPT-4o-initialized Q-table significantly accelerates convergence over random initialization across 30 runs (right).

Scaling LLM Value Approximation for High-Dimensional Problems

Challenges

Inference Budget: Querying LLMs at scale is computationally (and sometimes financially) expensive, with each inference call taking seconds. This makes frequent online queries infeasible for online RL. Instead, we need to rely on efficient caching strategies to pre-compute and store value estimates for explored states rather than compute it on-the-fly.
Large State Space: In high-dimensional environments like Crafter, the number of unique states is vast, often running into millions. Storing value estimates for all possible states is impractical. Thus, we must prioritize which states to cache and develop mechanisms to generalize cached values to uncached states.
State Representation: LLMs excel at leveraging semantic inductive biases in text-based representations for reasoning but struggle to do the same with raw image-based observations [13], such as the ones in Crafter.
Value Prediction: While the Bellman equation guided value prediction in grid-world experiments, adapting this approach to more complex environments requires additional context. For instance, Crafter's text-based actions (e.g., "chop wood") naturally align with LLM priors, but achieving precise value estimates for high-dimensional states remains a challenge.

Proposed Solutions

Efficient State Caching: We can cache value estimates at strategic intervals during exploration, guided by state visitation frequency or curiosity-driven mechanisms [14], [15]. This ensures that cached states are representative of the agent's experience. Additionally, we can draw on methods for guided state selection from the literature on offline RL [16] to optimize caching strategies.
Generalizing to Uncached States: To assign values to uncached states, we can develop similarity-based value transfer functions. Specifically, the value of an uncached state is determined based on its similarity to cached states. A naive approach might involve computing similarity between states using pixel distances or CLIP embeddings. However, pixel-similar states can have drastically different values (e.g., in Pong, the paddle's position can mean the difference between scoring and missing).
To address this, we adapt techniques like Generalized Similarity Functions (GSFs) [16] or Object-Centric Generalized Value Functions (GVFs) [17], which compute similarities grounded in expected future trajectories or object-centric features.
Textual State Representation: Crafter's image-based observations pose a unique challenge for LLM integration, which thrives on text-based representations [18]. Converting image states into text using methods like CLIPCaption [19] or GIT [20] can help exploit the LLM's semantic biases. For instance, an image of a tree could be captioned as "a tree with wood resources," providing richer context for value estimation than raw pixel data.
Goal-Conditioned Value Approximation: To address precise value estimates, we build on the findings of [21], who demonstrate that LLMs excel at in-context value prediction when goals are described in rich detail and value examples are provided in-context.

Incorporating Cached Value Estimates

Analysis

References

[1] Muppidi, A., Zhang, Z., & Yang, H. (2024). Fast TRAC: A Parameter-Free Optimizer for Lifelong Reinforcement Learning.

[2] Vidyasagar, M. (2023). A Tutorial Introduction to Reinforcement Learning.

[3] Sutton, R. S., & Barto, A. G. (1998). Reinforcement Learning: An Introduction. The MIT Press.

[4] Wang, G., et al. (2023). Voyager: An Open-Ended Embodied Agent with Large Language Models.

[5] Szot, A., et al. (2024). Large Language Models as Generalizable Policies for Embodied Tasks.

[6] Brooks, E., Walls, L., Lewis, R. L., & Singh, S. (2023). Large Language Models can Implement Policy Iteration.

[7] Kim, M. J., et al. (2024). OpenVLA: An Open-Source Vision-Language-Action Model.

[8] Rocamonde, J., Montesinos, V., Nava, E., Perez, E., & Lindner, D. (2024). Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning.

[9] Kwon, M., Xie, S. M., Bullard, K., & Sadigh, D. (2023). Reward Design with Language Models.

[10] Yu, W., et al. (2023). Language to Rewards for Robotic Skill Synthesis.

[11] Ma, Y. J., et al. (2024). Eureka: Human-Level Reward Design via Coding Large Language Models.

[12] Hafner, D. (2022). Benchmarking the Spectrum of Agent Capabilities.

[13] Zhang, Y., et al. (2024). How Far Are We from Intelligent Visual Deductive Reasoning?

[14] Du, Y., et al. (2023). Guiding Pretraining in Reinforcement Learning with Large Language Models.

[15] Pathak, D., Agrawal, P., Efros, A. A., & Darrell, T. (2017). Curiosity-driven Exploration by Self-supervised Prediction.

[16] Mazoure, B., Kostrikov, I., Nachum, O., & Tompson, J. (2021). Improving Zero-shot Generalization in Offline Reinforcement Learning using Generalized Similarity Functions.

[17] Nath, S., Subbaraj, G. R., Khetarpal, K., & Kahou, S. E. (2023). Discovering Object-Centric Generalized Value Functions from Pixels.

[18] Tsai, C. F., Zhou, X., Liu, S. S., Li, J., Yu, M., & Mei, H. (2023). Can Large Language Models Play Text Games Well? Current State-of-the-Art and Open Questions.

[19] Mokady, R., Hertz, A., & Bermano, A. H. (2021). ClipCap: CLIP Prefix for Image Captioning.

[20] Wang, J., et al. (2022). GIT: A Generative Image-to-text Transformer for Vision and Language.

[21] Ma, Y. J., et al. (2024). Vision Language Models are In-Context Value Learners.

Blog 0: LLMs as Value-Function Approximators

Q-learning and Dogs

LLMs as Function Approximators in RL

Proof of Concept - Simple Grid World

LLMs for Grid-world Value Approximation

Scaling LLM Value Approximation for High-Dimensional Problems

Challenges

Proposed Solutions

Incorporating Cached Value Estimates

Analysis

References