CMSA New Technologies in Mathematics Seminar: The TinyStories Dataset: How Small Can Language Models Be And Still Speak Coherent English?

SEMINARS, CMSA EVENTS

September 20, 2023 2:00 pm - 3:00 pm

CMSA, 20 Garden St, G10

Address: 20 Garden Street, Cambridge, MA 02138

Speaker:

Ronan Eldan - Microsoft Research

While generative language models exhibit powerful capabilities at large scale, when either the model
or the number of training steps is too small, they struggle to produce coherent and fluent text:
Existing models whose size is below a few billion parameters often do not generate coherent text
beyond a few sentences. Hypothesizing that one of the main reasons for the strong reliance on size is
the vast breadth and abundance of patterns in the datasets used to train those models, this
motivates the following question: Can we design a dataset that preserves the essential elements of
natural language, such as grammar, vocabulary, facts, and reasoning, but that is much smaller and
more refined in terms of its breadth and diversity?
In this talk, we introduce TinyStories, a synthetic dataset of short stories that only contain words
that 3 to 4-year-olds typically understand, generated by GPT-3.5/4. We show that TinyStories can
be used to train and analyze language models that are much smaller than the state-of-the-art models
(below 10 million parameters), or have much simpler architectures (with only one transformer
block), yet still produce fluent and consistent stories with several paragraphs that are diverse and
have almost perfect grammar, and demonstrate certain reasoning capabilities. We also show that
the trained models are substantially more interpretable than larger ones, as we can visualize and
analyze the attention and activation patterns of the models, and show how they relate to the
generation process and the story content. We hope that TinyStories can facilitate the development,
analysis and research of language models, especially for low-resource or specialized domains, and
shed light on the emergence of language capabilities in LMs.

This seminar will be held in person and on Zoom: https://harvard.zoom.us/j/95706757940?pwd=dHhMeXBtd1BhN0RuTWNQR0xEVzJkdz09