A growing chorus of mathematicians says GPT-5 math capability has crossed a new threshold: solving “minor” open problems—the kind a strong PhD student might dispatch in a day or a few days. The claim, long debated, now rests on converging datapoints from summer 2025 benchmarks and fresh preprints documenting both contest-like prowess and early-stage research contributions. With gold-level performance on Olympiad-style problems, near-perfect coding contest runs, and credible attempts at easy conjectures, the conversation has shifted from “if” to “how” and “how reliably” such systems can contribute to mathematical discovery.
Key Takeaways
– Shows gold-level results with an 83.3% score (35/42) on IMO-style problems, sparking claims GPT-5 can tackle minor open questions in days. – Reveals 91.7% first-try accuracy (11 of 12) on ICPC tasks and 100% solved with retries, indicating a step-change in problem-solving competence. – Demonstrates 60% coverage (3 of 5) of easy conjectures with nearly correct solutions, aligning with work typically needing several days by top PhD students. – Indicates a documented August 20, 2025 solution of an open convex optimization problem, while warning about reproducibility and verification requirements. – Suggests formalization and proof-checking will be decisive, as automated proofs can raise confidence even when human oversight remains essential.
What the new GPT-5 math claims actually cover
When mathematicians say GPT-5 can handle “minor open problems,” they mean short-horizon research tasks that are unresolved but tractable with standard techniques and a few novel twists. Think lemmas that glue together known tools, optimizations of constants or rates, or small conjectures in well-mapped subfields. The point isn’t general genius; it’s time-to-solution on problems that previously absorbed a day or a few days of focused human effort. That’s the bar—modest, but meaningful—now claimed in light of recent test results and expert commentary tied to gold-level Olympiad performance by leading AI models in July 2025 [1].
Benchmarks behind the buzz: Olympiad, ICPC, and conjecture tests
The Financial Times reported in September 2025 that GPT-5 solved all 12 ICPC-style problems in test conditions, with 11 correct on the first attempt—91.7% first-try accuracy—prompting experts to call it a “step-change,” albeit with the caveat that contest success does not equal genuine research creativity; several mathematicians nonetheless told the FT that GPT-5 now appears able to crack routine open problems in the one-to-few-day difficulty band [2].
A complementary snapshot comes from the Gödel Test, a September 22, 2025 arXiv preprint evaluating easy conjectures in combinatorial optimization, where GPT-5 produced nearly correct solutions on three out of five cases—a 60% hit rate—and a Microsoft researcher noted that top doctoral students “usually spend several days” on such problems, suggesting performance parity in that limited, short-duration regime [5].
A documented case: GPT-5 in a Malliavin–Stein experiment
Beyond contests, a September 3, 2025 arXiv paper details a Malliavin–Stein experiment in which GPT-5 reportedly solved an open convex optimization problem on August 20, 2025, producing what the authors describe as a novel, potentially publishable result—while emphasizing reproducibility concerns and the ongoing need for human verification of the proof and quantitative rates [4].
Why the Olympiad number matters—and what it doesn’t prove
The Reuters-reported 35/42 (83.3%) on International Mathematical Olympiad-style problems matters because Olympiad gold-level performance historically signaled elite human capability. AI models matching that bar show that structured problem-solving, multi-step reasoning, and tactical proof assembly are no longer exclusive to top students. Yet Olympiads are still curated puzzles with clean statements and bounded scope. They are not messy conjectures with ambiguous framings, hidden dependencies, or literature gaps. The gold-level signal is strong—but it’s still a signal, not a wholesale guarantee of creative research generalization [1].
How GPT-5 math compares to human effort
Taken together, the 83.3% Olympiad score, the 91.7% first-try ICPC accuracy, and the 60% easy-conjecture coverage outline a profile: a system that can quickly navigate well-specified problems, generate tight solution plans, and iterate when needed. A good PhD student might spend 8–20 hours across two or three days to chase down a modest conjecture, check edge cases, and hone a proof. GPT-5, with structured prompting and verification tools, can compress that cycle to minutes or hours—when the problem lies within its competence envelope. The flipside is brittleness: errors can be subtle and require expert review.
What “nearly correct” looks like in practice
“Nearly correct” in conjecture work often means the model identifies the right technique, sketches a proof path, and nails 80–95% of the necessary steps, but leaves a gap—an unproven inequality, an unjustified limit exchange, or a rate bound that’s off by a factor. In current workflows, a human co-author patches the gap, or a formal proof tool catches the misstep. This is still valuable labor-saving: finding the approach and scaffolding the argument commonly consumes most of the human time budget. Closing the last 5–20% is effortful but tractable for experts.
Why formalization and proof-checking are pivotal for GPT-5 math
Years of progress in translating informal math into formal statements and machine-checkable proofs are converging with GPT-5-style models. Automated formalization boosts both proving systems and human trust, because a proof that passes a proof assistant’s kernel removes many classes of error—typos, missing cases, illegal inferences. As one scholar put it, formal proofs “increase confidence,” not by removing human oversight, but by making it targeted and efficient. For claims that a model solved an open problem, mechanized verification will increasingly serve as the field’s default standard of evidence [3].
Limits, failure modes, and the reproducibility test
Despite the eye-catching numbers, three constraints remain. First, problem framing matters: small changes in notation, conventions, or domain assumptions can derail an otherwise competent model. Second, reasoning chains can “look right” yet smuggle in a silent error two pages earlier; without formal checks, human experts must audit line by line. Third, reproducibility is not guaranteed: rerunning the same prompt can yield different arguments, and small nudges can flip success to failure. These failure modes are familiar in human math, but models add stochasticity and opaque latent knowledge.
The emerging workflow: human–AI collaboration
In labs and seminars, a pragmatic pipeline is forming. Researchers propose a problem, have GPT-5 sketch multiple solution strategies, and then pick the most promising route for deeper development. The model drafts a proof with explicit lemmas, bounds, and assumptions, and a human co-author tightens constants, checks edge cases, and aligns notation with the literature. For formal confidence, they port the argument into a proof assistant or at least unit-test critical inequalities and rate claims. Iteration cycles compress from days to hours; human time concentrates on validation and novelty.
What the GPT-5 math gains do—and do not—change
The gains change the economics of exploration: weak leads can be cheaply evaluated; broader neighborhoods of techniques can be tested; and “obvious but tedious” calculations become near-free. They do not change the necessity for taste—choosing what to prove, which conjectures matter, and how a result fits the literature. Nor do they eliminate authorship responsibility: theorems must stand under peer review, and claims of novelty must be checked. The net effect is leverage, not replacement: more ideas tried per week, more dead-ends pruned early, and faster convergence on viable arguments.
GPT-5 math in context: coding, combinatorics, and convexity
The ICPC-style results indicate robust algorithmic thinking under constraints, useful for combinatorics and discrete optimization. The Gödel Test points toward competence on easy conjectures that require blending known methods in a fresh configuration—bread-and-butter work for early-stage researchers. The convex optimization case study hints at occasional genuine novelty when the search space is well-mapped and the target is a crisp property or rate. Across these domains, the underlying pattern is similar: speed at structured reasoning, with final-mile proof polishing still typically human-led.
Signals to watch over the next 6–12 months
Three metrics will show whether today’s claims harden into standard practice. First, the share of model-assisted papers with fully formalized proofs; expect a steady rise from a low base. Second, the proportion of “minor open problems” closed with repeatable prompts and public seeds; that measure will address reproducibility head-on. Third, cross-domain generalization: does a system that handles easy combinatorial conjectures transfer to, say, functional inequalities or probability tail bounds without heavy hand-holding? Movement on these fronts will determine how widely GPT-5 math workflows spread.
The bottom line on the “day-or-few-days” claim
As of late September 2025, the evidence supports a narrow but meaningful conclusion: GPT-5 can often match a strong PhD student’s one-to-few-day effort on modest, well-scoped open problems, particularly when assisted by structured prompts, retrieval, and verification tools. The measured gains—83.3% on Olympiad-style math, 91.7% first-try on ICPC tasks, and 60% coverage on easy conjectures—align with that claim’s scope. The strongest single-case evidence reports a novel solution dated August 20, 2025, with clear caveats: human verification and reproducibility remain non-negotiable.
Source notes and attributions
On July 21, 2025, Reuters reported that OpenAI’s experimental model achieved gold-level IMO-style results with a 35/42 score, and relayed mathematicians’ views on GPT-5’s ability to tackle minor open problems in the one-to-few-day range [1]. The Financial Times followed in September 2025 with ICPC test results—12/12 solved, 11 correct on first try—and balancing expert caution about contest-to-research generalization [2]. The Gödel Test preprint on September 22, 2025 documented 3/5 easy conjectures nearly solved and contextualized the human time budget for such problems [5]. A September 3, 2025 arXiv preprint described a model-produced solution of an open convex optimization problem on August 20, 2025, highlighting reproducibility and verification issues [4]. Finally, a July 25, 2024 MIT Technology Review feature traced how automated formalization and proof checking raise trust and accelerate collaborative math with systems like GPT-5 [3].
Sources:
[1] Reuters – Google and OpenAI’s AI models win milestone gold at global math competition: www.reuters.com/world/asia-pacific/google-openais-ai-models-win-milestone-gold-global-math-competition-2025-07-21/” target=”_blank” rel=”nofollow noopener noreferrer”>https://www.reuters.com/world/asia-pacific/google-openais-ai-models-win-milestone-gold-global-math-competition-2025-07-21/
[2] Financial Times – DeepMind and OpenAI achieve gold at ‘coding Olympics’ in AI milestone: www.ft.com/content/c2f7e7ef-df7d-4b74-a899-1cb12d663ce6″ target=”_blank” rel=”nofollow noopener noreferrer”>https://www.ft.com/content/c2f7e7ef-df7d-4b74-a899-1cb12d663ce6 [3] MIT Technology Review – Google DeepMind’s new AI systems can now solve complex math problems: www.technologyreview.com/2024/07/25/1095315/google-deepminds-ai-systems-can-now-solve-complex-math-problems/” target=”_blank” rel=”nofollow noopener noreferrer”>https://www.technologyreview.com/2024/07/25/1095315/google-deepminds-ai-systems-can-now-solve-complex-math-problems/
[4] arXiv (preprint) – Mathematical research with GPT-5: a Malliavin-Stein experiment: https://arxiv.org/abs/2509.03065 [5] arXiv (preprint) – Gödel Test: Can Large Language Models Solve Easy Conjectures?: https://arxiv.org/abs/2509.18383
Image generated by DALL-E 3
Leave a Reply