The Coverage Principle: How Pre-Training Enables Post-Training - podcast episode cover

The Coverage Principle: How Pre-Training Enables Post-Training

Oct 24, 202516 min
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

This paper provides a theoretical analysis of next-token prediction in language models, introducing the concept of the coverage profile ($\text{Cov}_N$) as a superior metric to cross-entropy for predicting downstream performance with Best-of-N (BoN) sampling. The authors establish a "coverage principle," demonstrating that maximum likelihood, or next-token prediction, implicitly optimizes the coverage profile, leading to faster generalization that avoids the spurious dependence on sequence length seen in cross-entropy/KL divergence. The research shows that achieving a good coverage profile is necessary and sufficient for BoN success and derives scaling laws relating cross-entropy to coverage, while also exploring various optimization methods like stochastic gradient descent (SGD) and gradient normalization to provably improve coverage bounds. Finally, the text proposes tournament-style estimators for selecting models with optimal coverage, particularly in scenarios where the true data distribution is unknown.

For the best experience, listen in Metacast app for iOS or Android