At the headquarters of Google DeepMind, an artificial intelligence laboratory in London, researchers have a long ritual for announcing important results: They bang a large ceremonial gong.
In 2016, the gong sounded for AlphaGo, an AI system that excelled in the game of Go. In 2017, the gong returned when AlphaZero won chess. On every occasion, the algorithm was beating the champions of the human world.
Last week DeepMind researchers got the gong out again to celebrate what Alex Davies, head of Google’s DeepMind mathematics initiative, described as a “huge breakthrough” in mathematical reasoning by an AI system. A pair of Google DeepMind models tried their hand at the problem set in the 2024 International Mathematical Olympiad, or IMO, held from July 11 to July 22 about 100 miles west of London at the University of Bath. The event is said to be the premier math competition for the “world’s brightest mathematicians,” according to a promotional post on social media.
Sign up for The Morning newsletter from the New York Times
The human problem solvers — 609 high school students from 108 countries — won 58 gold medals, 123 silver medals and 145 bronze medals. The AI performed at a silver medal level, solving 4 out of 6 problems for a total of 28 points. This was the first time AI achieved a worthy performance on Olympiad problems.
“It’s not perfect, we haven’t solved everything,” Pushmeet Kohli, Google DeepMind’s vice president of research, said in an interview. “We want to be perfect.”
However, Kohli described the result as a “step shift” – a transformative change – “in the use of AI in mathematics and the ability of AI systems to do mathematics.”
The lab asked two independent experts to judge AI performance: Timothy Gowers, a mathematician at the University of Cambridge in England and a Fields medal winner, who has been interested in math-AI interaction for 25 years; and Joseph Myers, a software developer in Cambridge. They both won IMO gold in their day. Myers chaired this year’s problem selection committee and was previously a coordinator at the Olympiads, judging human solutions. “I tried to rate the AI efforts consistently with how human efforts were rated this year,” he said.
Gowers submitted in an email: “I was definitely impressed.” The lab discussed his Olympic ambitions with him a few weeks in advance, so “my expectations were pretty high,” he said. “But the program met them, and in one or two cases it really beat them.” The program found the “magic keys” that unlocked the problems, he said.
Hitting the gong
After months of intense training, the students sat for two exams, three problems a day — in algebra, combination, geometry and number theory.
The AI colleague went away roughly together with the lab in London. (The students didn’t know Google DeepMind was competing, in part because the researchers didn’t want to steal the spotlight.) Researchers moved the gong into the room where they had gathered to watch the system work. . “Every time the system solved a problem, we hit the gong to celebrate,” said David Silver, a research scientist.
Haojia Shi, a student from China, achieved number 1 and was the only competitor to achieve a perfect score – 42 points out of six problems; Each problem is worth seven points for a complete solution. Team USA won first place with 192 points; China came second with 190.
Google’s system earned its 28 points for completely solving four problems – two in algebra, one in geometry and one in number theory. (He flopped at two combinatorics problems.) The system was allowed without time limit; for some problems, it took up to three days. Only 4.5 hours per exam were given to the students.
For the Google DeepMind team, speed is secondary to overall success, because “it’s really just how much computing power you’re willing to put into these things,” Silver said.
“It’s a turning point in the history of mathematics because we’ve reached this threshold where these problems can even be tackled at all,” he said. “And I hope that this is not only a step change IMO, but also represents the point at which we went from computers being able to create very simple things to computers being able to do things that humans cannot creation.”
Algorithmic components
Applying AI to mathematics has been part of DeepMind’s mission for several years, often in collaboration with top research mathematicians.
“Mathematics requires this interesting combination of abstract, precise and creative reasoning,” Davies said. In part, he noted, it’s this repertoire of abilities that makes a good litmus test in mathematics for the ultimate goal: finding artificial general intelligence, or AGI, a system with abilities ranging from emergent to competent to virtuoso to superhuman. Companies such as OpenAI, Meta AI and xAI are pursuing similar goals.
The mathematical problems of the Olympiad have been considered a benchmark.
In January, a Google DeepMind system called AlphaGeometry successfully sampled Olympic geometry problems at nearly a human gold medal-winning level. “AlphaGeometry 2 has now surpassed the gold medalists in solving IMO problems,” said Thang Luong, the principal investigator, in an email.
Using that momentum, Google DeepMind stepped up its multidisciplinary Olympiad effort, with two teams: one led by Thomas Hubert, a research engineer in London, and another led by Luong and Quoc Le in Mountain View, each with about 20 researchers. Because of his “superhuman reasoning team,” Luong said he recruited a dozen IMO medalists — “by far the highest number of IMO medals at Google!”
The lab strike at this year’s Olympiad implemented the improved version of AlphaGeometry. Unsurprisingly, the model did well on the geometry problem, polishing it in 19 seconds.
Hubert’s team developed a new comparable but more general model. Called AlphaProof, it is designed to tackle a wide range of mathematical topics. All told, AlphaGeometry and AlphaProof used several different AI technologies.
One approach was an informal system of reasoning, expressed in natural language. This system leveraged Gemini, Google’s major language model. It used an English corpus of published problems and proofs and the like as training data.
The informal system is better at identifying patterns and suggesting what’s next; he is creative and talks about ideas in an understandable way. Of course, big models of language tend to make things up — things that might (or might not) fly for poetry and certainly not for mathematics. But in this context, the LLM seems to have shown restraint; he was not immune to hallucinations, but the frequency was reduced.
Another approach was a formal system of reasoning, based on logic and expressed in code. He used theorem generator and proof software called Lean, which guarantees that if the system says a proof is correct, it is indeed correct. “We can directly check whether the proof is correct or not,” Hubert said. “Every step is promised to be logically sound.”
Another critical component was a reinforcement learning algorithm in the AlphaGo and AlphaZero lineage. This type of AI learns on its own and can scale indefinitely, said Silver, who is vice president of reinforcement learning at Google DeepMind. Since the algorithm doesn’t need a human teacher, it can “learn and retain the learning and keep learning until it can solve the hardest problems that humans can,” he said . “And then maybe even one day go beyond those.”
Hubert added, “The system can rediscover information for itself.” That’s what happened with AlphaZero: He started with no knowledge, Hubert said, “and by playing games directly, and seeing who won and who lost, he could learn all the knowledge about chess rediscovery. It took us less than a day to rediscover all the knowledge of chess, and about a week to rediscover all the knowledge of Go. So we thought, let’s apply this to mathematics.”
Gowers isn’t worried — too much — about the long-term consequences. “One can imagine a state of affairs where mathematicians are essentially left with nothing to do,” he said. “It would be if computers got better, and much faster, at everything mathematicians do now.”
“It seems like there’s still a long way to go before computers can do research-level math,” he said. “It’s a fairly safe bet that if Google DeepMind can solve at least some hard problems IMO, a useful research tool can’t be that far off.”
A really powerful tool could make mathematics accessible to more people, speed up the research process, encourage mathematicians outside the box. Eventually it may even represent novel ideas that are related to it.
c.2024 The New York Times Company