New directions in synthetic data
CMSA NEW TECHNOLOGIES IN MATHEMATICS
Synthetic data has been an effective, if boring set of techniques: prompt some language model to restructure your corpus to match some downstream task, with occasionally some distillation. In this talk, we will take a more expansive view of synthetic data as a general algorithmic tool for generative modeling, arguing that the design space and possibilities of synthetic data are much bigger than it might seem. Through a few recent works, we will show that synthetic data has major benefits beyond transforming the data – improving in-domain perplexities, and enabling unique algorithmic primitives, such as neighborhood smoothing and concatenated ‘mega’ documents. With this broader view, we will point towards a nascent but interesting possibility of treating data itself as an algorithmic object to be engineered and optimized end-to-end.
Zoom: https://harvard.zoom.us/j/91864143060?pwd=liDbUVYXs47QsYhxdzXYowl8vpQGy1.1
