In my reinforcement learning projects—optimization for lifelong RL [1], policy particle filters, and generative agent models—I often found myself circling back to the fundamentals: temporal-difference (TD) learning [2], [3]. While elegant in theory, Q-learning often surprises me, as it starts tabula rasa—and it often feels like, to me at least, that Q-updates are like a dog chasing its own tail.
This realization has motivated me to ask: why not initialize the Q-values with something better than random? Specifically, can we use large language models (LLMs) as zero-shot value approximators to warm-start the Q-table or network?
LLMs for decision-making tasks is nothing new. Examples include VOYAGER [4], an LLM-driven lifelong learning agent in Minecraft. Similarly, LLaRP [5] finetunes LLMs with actor and critic heads for embodied visual tasks. In-context learning for policy iteration has also been explored [6]. In robotics, vision-language-action models, such as OpenVLA [7], finetune LLMs to directly output motor commands. LLMs and VLMs have also been used as reward approximators [8] and reward designers [9], [10], [11].
Distinct from using LLMs as policy and reward approximators, I want to investigate the use of LLMs as value function approximators in Q learning. My idea draws on the idea of distilling LLM priors into Q-learning, where LLMs predict initial value estimates for states based on their encoded world knowledge. Unlike policy-based applications, this approach focuses on bootstrapping TD methods by providing more informative priors to the Q-network or Q-table, trying to tackle the "inefficiency" of tabula rasa learning.
I quickly designed a 12x12 grid-world environment to quickly test how different Q-table initializations affect convergence. The agent's goal is to reach the terminal state at (11, 11), incurring a reward of -1 for each non-terminal step.
To initialize the Q-table with LLM estimates, I instructed GPT-4o to estimate Q-values for state-action pairs. The model was prompt-engineered with descriptions of the environment and the Bellman equation to output structured JSON representations. Below is a partial version of the system prompt:
You are a helpful assistant who reasons through Q-learning calculations step by step. Guide the user through the solution to estimate Q-values for each action (up, down, left, right) for State (STATE), and then provide the best action. We are working with a 12x12 grid world environment. The agent can move in four directions: up, down, left, or right. The bottom-right corner (state (11, 11)) is the terminal state, and all other states give a reward of -1 for each step. The discount factor (γ) is {gamma}.
For each action (up, down, left, right), compute the Q-value by considering the immediate reward and the discounted value of the next state, based on the formula:
(This is a summary of the full prompt, which includes additional details about the task setup and response structure.)
Using the LLM-estimated Q-values for initialization significantly accelerated convergence in tabular Q-learning (LLaMA-70b gets similar results).
Comparison of value initialization methods in a 12x12 grid world: GPT-4o generates suboptimal Q-values (middle) compared to true value iteration (left). Tabular Q-learning with GPT-4o-initialized Q-table significantly accelerates convergence over random initialization across 30 runs (right).
While these results are promising, they raise questions about scalability. How can this approach be extended to high-dimensional RL tasks/games like Crafter and be integrated with Deep Q-learning methods?
To explore the scalability of large model value approximation, we propose extending this framework to high-dimensional tasks such as Crafter [12], a procedurally-generated survival environment where an agent interacts with the world through discrete text actions (e.g., "chop wood" or "craft tools"). Crafter is challenging due to its vast state space, sparse rewards, and the necessity for long-horizon planning.
To address this, we adapt techniques like Generalized Similarity Functions (GSFs) [16] or Object-Centric Generalized Value Functions (GVFs) [17], which compute similarities grounded in expected future trajectories or object-centric features.
While directly initializing a Q-network with value estimates from LLMs or cached states may not be feasible, we can incorporate these estimates into the training process through TD updates. Specifically, during the initial stages of training, we can compute the TD error in a DQN by replacing the target Q-network's estimates with the cached value estimates for N steps. This guides the Q-network toward a more informed initialization. After this warm up period, we switch back to using the target Q-network for TD error computation, allowing the model to refine its values based on learned dynamics.
Impact on Policy Convergence: We will evaluate whether LLM Q value initialization and similarity-based caching improve convergence speed in deep Q-learning methods, including DQN and Double DQN, on Crafter and other high-dimensional RL games.
Value Predictions: We will analyze LLM-based value predictions: are predictions more accurate near goal states or initial states? Are overall predictions optimistic or pessimistic? How consistent are value predictions across structurally similar states?
[1] Muppidi, A., Zhang, Z., & Yang, H. (2024). Fast TRAC: A Parameter-Free Optimizer for Lifelong Reinforcement Learning.
[2] Vidyasagar, M. (2023). A Tutorial Introduction to Reinforcement Learning.
[3] Sutton, R. S., & Barto, A. G. (1998). Reinforcement Learning: An Introduction. The MIT Press.
[4] Wang, G., et al. (2023). Voyager: An Open-Ended Embodied Agent with Large Language Models.
[5] Szot, A., et al. (2024). Large Language Models as Generalizable Policies for Embodied Tasks.
[6] Brooks, E., Walls, L., Lewis, R. L., & Singh, S. (2023). Large Language Models can Implement Policy Iteration.
[7] Kim, M. J., et al. (2024). OpenVLA: An Open-Source Vision-Language-Action Model.
[8] Rocamonde, J., Montesinos, V., Nava, E., Perez, E., & Lindner, D. (2024). Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning.
[9] Kwon, M., Xie, S. M., Bullard, K., & Sadigh, D. (2023). Reward Design with Language Models.
[10] Yu, W., et al. (2023). Language to Rewards for Robotic Skill Synthesis.
[11] Ma, Y. J., et al. (2024). Eureka: Human-Level Reward Design via Coding Large Language Models.
[12] Hafner, D. (2022). Benchmarking the Spectrum of Agent Capabilities.
[13] Zhang, Y., et al. (2024). How Far Are We from Intelligent Visual Deductive Reasoning?
[14] Du, Y., et al. (2023). Guiding Pretraining in Reinforcement Learning with Large Language Models.
[15] Pathak, D., Agrawal, P., Efros, A. A., & Darrell, T. (2017). Curiosity-driven Exploration by Self-supervised Prediction.
[16] Mazoure, B., Kostrikov, I., Nachum, O., & Tompson, J. (2021). Improving Zero-shot Generalization in Offline Reinforcement Learning using Generalized Similarity Functions.
[17] Nath, S., Subbaraj, G. R., Khetarpal, K., & Kahou, S. E. (2023). Discovering Object-Centric Generalized Value Functions from Pixels.
[18] Tsai, C. F., Zhou, X., Liu, S. S., Li, J., Yu, M., & Mei, H. (2023). Can Large Language Models Play Text Games Well? Current State-of-the-Art and Open Questions.
[19] Mokady, R., Hertz, A., & Bermano, A. H. (2021). ClipCap: CLIP Prefix for Image Captioning.
[20] Wang, J., et al. (2022). GIT: A Generative Image-to-text Transformer for Vision and Language.
[21] Ma, Y. J., et al. (2024). Vision Language Models are In-Context Value Learners.