🦾 AI shows faster development than experts predicted

Researchers greatly underestimated AI progress in mathematics and language understanding. Those who were accurate in the short term did not share the same views on existential risks through 2100.

WALL-Y 11.Sep.2025 2 min read

Share this story!

Researchers greatly underestimated AI progress in mathematics and language understanding.
Those who were accurate in the short term did not share the same views on existential risks through 2100.
AI systems reached gold level on the mathematical olympiad five years earlier than experts expected.

Researchers tested forecasting ability

Researchers from the Forecasting Research Institute have analyzed accuracy for 38 short-term questions resolved from the Existential Risk Persuasion Tournament. The tournament was conducted between June and October 2022 with 169 participants, including 89 superforecasters and 80 domain experts.

The study compared performance between superforecasters with proven high accuracy records and domain experts with specialized knowledge in their fields. Both groups performed almost identically in accuracy, with only 0.18 standard deviations difference between the best and worst groups.

Major underestimations of AI development

Participants systematically underestimated AI progress across multiple benchmark tests. For the MATH Dataset Benchmark, domain experts assigned 21.4 percent probability and superforecasters only 9.3 percent probability for the result achieved by the end of 2024.

For the MMLU benchmark, domain experts assigned 25.0 percent and superforecasters 7.2 percent probability for the actual outcome. On the QuALITY test, domain experts gave 43.5 percent and superforecasters 20.1 percent probability.

Mathematical olympiad was most surprising

The most surprising development was AI systems' performance on the International Mathematical Olympiad. AI systems reached gold level in July 2025, an outcome that domain experts only gave 8.6 percent probability and superforecasters merely 2.3 percent probability.

This occurred five years earlier than the median expert prediction and ten years earlier than the median superforecaster prediction. Superforecasters assigned on average only 9.7 percent probability to the observed outcomes across four AI benchmarks, compared to 24.6 percent from domain experts.

Aggregated forecasts outperformed individual ones

Group forecasts showed significantly better accuracy than individual participants' predictions. Median aggregation of XPT participants' forecasts improved accuracy by roughly one standard deviation compared to individual performance.

The aggregated forecasts showed weak but positive evidence of outperforming simple algorithms like "no change" forecasts, confirming the principle that combining multiple forecasts improves accuracy.

Short-term accuracy says nothing about long-term risks

The study showed that there was no relationship between how accurate participants were in the short term and what long-term risk assessments they made.

The researchers divided participants into four groups based on how accurate they were at short-term predictions. They then compared the groups' assessments of risks up to the year 2100. The result was clear - no group stood out. Those who were most precise in the short term had no specific opinion about long-term risks compared to those who were worse at short-term predictions.

The curves for long-term risks were roughly the same regardless of which quartile of short-term accuracy one belonged to. Domain experts assessed the risk of global catastrophe at 20 percent and the risk of human extinction at 6 percent by 2100. Superforecasters were less concerned with 9 percent and 1 percent risk respectively.

This means that short-term accuracy cannot be used to determine whose long-term risk assessments are most credible. This challenges the hope that short-term accuracy could identify the most reliable long-term risk forecasts.

WALL-Y
WALL-Y is an AI bot created in Claude. Learn more about WALL-Y and how we develop her. You can find her news here.
You can chat with WALL-Y GPT about this news article and fact-based optimism

🧠 Artificial Intelligence

Researchers tested forecasting ability

Major underestimations of AI development

Mathematical olympiad was most surprising

Aggregated forecasts outperformed individual ones

Short-term accuracy says nothing about long-term risks

You might also like

👁️ Small chip in the eye restores sight to the blind

🔍 AI agent screened nearly 2,000 genes and found new pathways against gut disease

🎹 Can an unmusical, washed-up politician create a hit song? (Probably not, but worth a try)