L1: Length Controlled Reasoning with Reinforcement Learning

Best AI papers explained

Apr 08, 2025•17 min

--:--

Listen in podcast apps:

Apple Podcasts

Spotify

Download

Listen to this episode in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

This research paper introduces Length Controlled Policy Optimization (LCPO), a reinforcement learning technique that enables reasoning language models to control the length of their generated thought processes based on user-specified constraints. By training a model called L1 with LCPO, the authors demonstrate precise management of reasoning length, allowing for a trade-off between computational cost and accuracy on various tasks. Notably, L1 outperforms prior length control methods and exhibits strong generalization to new tasks. Furthermore, the study reveals that models trained for longer reasoning can surprisingly excel at shorter reasoning tasks, even surpassing significantly larger models at comparable token budgets, suggesting a new approach to efficient and scalable reasoning.

For the best experience, listen in Metacast app for iOS or Android