ShiQ: Bringing back Bellman to LLMs

Best AI papers explained

May 22, 2025•18 min

--:--

Listen in podcast apps:

Apple Podcasts

Spotify

Download

Listen to this episode in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

This paper introduces ShiQ, a novel offline reinforcement learning algorithm designed for fine-tuning large language models (LLMs) by adapting traditional Q-learning methods. The authors address the challenges of applying Q-learning to LLMs, such as computational cost and initialization issues, by deriving theoretically grounded loss functions from Bellman equations. ShiQ enables off-policy, token-wise learning and is evaluated on various benchmarks, including multi-turn settings, where it demonstrates effectiveness compared to existing methods like DPO and CoPG. The paper details the theoretical basis of ShiQ and includes empirical results from both synthetic and real-world datasets.

For the best experience, listen in Metacast app for iOS or Android