Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
CMSA EVENTS: CMSA MEMBER SEMINAR
Reinforcement Learning has become a crucial step in training state-of-the-art language models such as DeepSeek-R1 for solving mathematical problems. In this talk, I will first review the mechanisms of Reinforcement Learning fine-tuning. Then, I will present a systematic end-to-end study of RL fine-tuning for mathematical reasoning, training models entirely from scratch on different mixtures of fully open datasets and fine-tuning them with RL. Doing so allows us to investigate the effects of the pretraining data mixture on the behavior of RL, and its interaction with the model size and choices of the algorithm hyperparameters. Our study reveals that RL algorithms consistently converge towards a dominant output distribution, amplifying patterns in the pretraining data. We also find that models of different scales trained on the same data mixture will converge to distinct output distributions, suggesting that there are scale-dependent biases in model generalization.
The second part of the talk is based on a joint work with Rosie Zhao, Alex Meterez, Cengiz Pehlevan, Sham Kakade and Eran Malach: https://arxiv.org/abs/2504.07912