Loading Events

New directions in synthetic data

CMSA NEW TECHNOLOGIES IN MATHEMATICS

When: May 6, 2026
2:00 pm - 3:00 pm
Where: Virtually
Speaker: Tatsunori Hashimoto (Stanford)

Synthetic data has been an effective, if boring set of techniques: prompt some language model to restructure your corpus to match some downstream task, with occasionally some distillation. In this talk, we will take a more expansive view of synthetic data as a general algorithmic tool for generative modeling, arguing that the design space and possibilities of synthetic data are much bigger than it might seem. Through a few recent works, we will show that synthetic data has major benefits beyond transforming the data – improving in-domain perplexities, and enabling unique algorithmic primitives, such as neighborhood smoothing and concatenated ‘mega’ documents. With this broader view, we will point towards a nascent but interesting possibility of treating data itself as an algorithmic object to be engineered and optimized end-to-end.

Zoom: https://harvard.zoom.us/j/91864143060?pwd=liDbUVYXs47QsYhxdzXYowl8vpQGy1.1