CMSA New Technologies in Mathematics Seminar: What Algorithms can Transformers Learn? A Study in Length Generalization


View Calendar
February 14, 2024 2:00 pm - 2:00 pm

Preetum Nakkiran - Apple

Large language models exhibit many surprising “out-of-distribution” generalization abilities, yet also struggle to solve certain simple tasks like decimal addition. To clarify the scope of Transformers' out-of-distribution generalization, we isolate this behavior in a specific controlled setting: length-generalization on algorithmic tasks. Eg: Can a model trained on 10 digit addition generalize to 50 digit addition? For which tasks do we expect this to work?

Our key tool is the recently-introduced RASP language (Weiss et al 2021), which is a programming language tailor-made for the Transformer's computational model. We conjecture, informally, that: Transformers tend to length-generalize on a task if there exists a short RASP program that solves the task for all input lengths. This simple conjecture remarkably captures most known instances of length generalization on algorithmic tasks, and can also inform design of effective scratchpads. Finally, on the theoretical side, we give a simple separating example between our conjecture and the "min-degree-interpolator" model of learning from Abbe et al. (2023).

Joint work with Hattie Zhou, Arwen Bradley, Etai Littwin, Noam Razin, Omid Saremi, Josh Susskind, and Samy Bengio. To appear in ICLR 2024.
Password: cmsa