Revisiting Superficial Alignment Hypothesis
Mar 14, 2025•4 min
Episode description
- The paper revisits the Superficial Alignment Hypothesis.
- It studies post-training scaling behavior with finetuning examples.
- Performance scales as a power law with more finetuning examples.
- Model performance correlates with reasoning ability, not just style.
- Language models can integrate new knowledge post-pre-training.
- Results suggest the hypothesis is an oversimplification.
For the best experience, listen in Metacast app for iOS or Android
