Planning without Search: Refining Frontier LLMs with Offline Goal-Conditioned RL

Best AI papers explained

May 29, 2025•25 min

--:--

Listen in podcast apps:

Apple Podcasts

Spotify

Download

Listen to this episode in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

This paper introduces Planning with a Natural Language Critic (PNLC), a novel approach for improving the planning capabilities of large language models (LLMs) in complex interactive tasks without relying on computationally expensive reinforcement learning (RL) fine-tuning or extensive inference-time search. PNLC trains a lightweight, goal-conditioned value function offline that predicts the likelihood of various future outcomes based on a proposed thought or strategy by the LLM agent. During inference, this value function acts as a natural language critic, providing the LLM with feedback on the potential positive and negative results of its thoughts, enabling the LLM to refine its reasoning and actions effectively and efficiently. Experiments on interactive tasks like web shopping, social deduction, and persuasion demonstrate that PNLC outperforms existing RL and prompting methods in both performance and efficiency, scaling to larger LLMs.

For the best experience, listen in Metacast app for iOS or Android