Stagewise Reinforcement Learning and the Geometry of the Regret Landscape
Episode description
This research paper establishes a formal connection between singular learning theory (SLT) and deep reinforcement learning (RL) to explain how agents evolve during training. The authors introduce a generalized Bayesian framework and a complexity metric called the local learning coefficient (LLC) to analyze the geometry of an agent's policy. Their findings demonstrate that RL training is characterized by stagewise development, where models undergo sudden Bayesian phase transitions between different behavioral strategies. Through experiments in a "cheese-in-the-corner" environment, the study reveals that agents often plateau in simpler, suboptimal phases before jumping to more complex, higher-performing ones. A key theoretical insight is the simplicity bias, which suggests that a Bayesian learner may prefer a less effective but less complex policy over a more optimal one at smaller dataset sizes. This framework provides a new lens for AI alignment, offering mathematical explanations for phenomena like goal misgeneralization and reward hacking based on the trade-off between reward and model complexity.
