First Proof’s second batch of math problems test AI
Amid claims of rapid advancement for AI’s mathematical prowess, the project known as “First Proof” gained widespread attention in February as a novel experiment to independently test AI’s ability to solve tough problems.
The organizers, including Harvard professor Lauren Williams, collected 10 problems known as “lemmas” — a theorem solved on the way to a larger proof — from leading mathematicians who had solved them but not yet published their answers, meaning no large-language model (LLM) could scrape the internet for easy solutions. They posted those problems online and gave AI companies and the public a week to tackle them.
Collectively, the LLMs solved at least six of 10, with results as mixed as they were intriguing. The organizers had intended First Proof to be a nimble experiment — with no funding and no formal grading process — save for a real-time message board where mathematicians could discuss the solutions. But they soon discovered that some of those solutions were nearly impossible to vet for correctness. At the same time, it could be hard to tell in some cases how much work had been done autonomously by AI as opposed to the mathematicians guiding or coaxing it.
They knew they wanted to try again, with a formalized structure — a newly registered nonprofit, First Proof Foundation Inc. — more transparency, and more controls for the next round, which they dubbed “First Proof, Second Batch.” That included running the tests themselves and checking all of the answers carefully with human referees.
“We wanted to structure the second batch as a more formal benchmark, where transparency was really a key thing,” said Williams, Dwight Parker Robinson Professor of Mathematics and a 2025 MacArthur Foundation “genius grant” recipient.
Read the full story at The FAS Current.