Within Narrow vs AGI
Can chatbots predict the unknown?
Forecasting tests whether a model can reason under unresolved uncertainty, not just answer questions with known solutions.
On this page
- Why forecasting differs from ordinary question answering
- What crowd forecasts reveal about model calibration
- How uncertainty should change chatbot use
Page outline Jump by section
One of the clearest ways to understand the limits of artificial intelligence is to ask it not about the past, but about the future. A chatbot can often answer questions about history, science, law, or software by drawing on information it has already learned. Forecasting is different. The correct answer does not yet exist. The task is not simply to retrieve knowledge but to reason under uncertainty, weigh competing possibilities, and express confidence appropriately.
This distinction matters because many AI systems appear highly capable when evaluated on questions with known answers. Forecasting strips away that advantage. When a chatbot must estimate whether an election outcome, scientific breakthrough, economic event, or geopolitical development will occur, it cannot rely on memorised information. It must confront uncertainty directly. Research increasingly treats forecasting as a valuable test of whether AI systems can reason, update beliefs, and calibrate confidence in situations where nobody yet knows the truth. [OpenReview]openreview.netTo produce an accurate forecast, a person or AI system must synthesizeForecastBench: A Dynamic Benchmark of AI Forecasting…by E Karger · Cited by 57 — Forecasting is a useful testbed of LLM reas…
Why forecasting differs from ordinary question answering
Most popular AI benchmarks measure performance on tasks that already have established answers. Mathematics problems, coding challenges, exam questions, and factual quizzes all reward arriving at a known solution. Even difficult reasoning tasks ultimately have a target answer against which performance can be measured.
Forecasting introduces a different challenge. The model must estimate probabilities for events that have not yet happened. Success depends not only on reasoning but also on judgement. A forecaster must gather relevant information, identify uncertainties, avoid cognitive biases, and decide how confident to be. Researchers behind ForecastBench argue that accurate forecasting requires synthesising information, guarding against overconfidence, combining evidence, and quantifying beliefs rather than merely producing fluent responses. [OpenReview]openreview.netTo produce an accurate forecast, a person or AI system must synthesizeForecastBench: A Dynamic Benchmark of AI Forecasting…by E Karger · Cited by 57 — Forecasting is a useful testbed of LLM reas…
This exposes a weakness that ordinary chatbot interactions often hide. A chatbot can sound equally confident when discussing a settled fact and when speculating about an uncertain future event. Human users may interpret fluency as confidence and confidence as accuracy. Forecasting benchmarks reveal whether that confidence is justified.
Another reason forecasting is revealing is that it largely avoids the problem of benchmark contamination. Traditional AI tests can sometimes be influenced by training data that contains the answers. Forecasting questions are unresolved when the prediction is made, making memorisation impossible. ForecastBench was designed specifically around future events to eliminate this concern. [arXiv]arxiv.orgarXiv Forecast Bench: A Dynamic Benchmark of AI Forecasting CapabilitiesarXiv Forecast Bench: A Dynamic Benchmark of AI Forecasting Capabilities
What crowd forecasts reveal about model calibration
A central concept in forecasting is calibration. A well-calibrated forecaster assigns probabilities that match reality over time. If a system says an event has a 70% chance of occurring, then roughly seven out of ten such events should happen.
Calibration matters because decision-makers often need probabilities rather than yes-or-no answers. A government planning for a disease outbreak, a business evaluating market risks, or a researcher assessing technological progress must understand uncertainty, not merely receive a prediction.
Forecasting competitions provide a useful comparison. Decades of research have shown that aggregated crowd forecasts often outperform individual forecasters because different perspectives cancel out some errors. Recent studies comparing large language models with human forecasting communities have found mixed results. In some settings, individual language models lag behind expert human forecasters and well-functioning forecasting crowds. ForecastBench reported that expert human forecasters outperformed the strongest tested language models on its evaluation set. [arXiv]arxiv.orgarXiv Forecast Bench: A Dynamic Benchmark of AI Forecasting CapabilitiesarXiv Forecast Bench: A Dynamic Benchmark of AI Forecasting Capabilities
At the same time, researchers have found that combining multiple model forecasts can improve performance substantially. Some studies suggest that ensembles of language-model forecasts can approach the accuracy of human forecasting crowds, highlighting that uncertainty estimation improves when diverse predictions are aggregated rather than relying on a single answer. [arXiv+2PMC]arxiv.orgLLM Ensemble Prediction Capabilities Rival Human Crowd…Our results suggest that LLMs can achieve forecasting accuracy rivaling th…
These results are important because they reveal a difference between intelligence and calibration. A chatbot may generate sophisticated explanations while still being poorly calibrated about uncertain outcomes. Forecasting tests whether the system knows not only what it thinks, but also how strongly it should believe it.
A concrete example: when future questions expose hidden weaknesses
Forecasting researchers have repeatedly found that language models can perform impressively on many reasoning tasks yet struggle when predictions require consistency and probabilistic judgement.
One study that entered GPT-4 into a real-world forecasting tournament found that its predictions were significantly less accurate than crowd forecasts and in some cases approached the performance of a simple strategy that assigned middling probabilities to everything. The authors argued that forecasting tournaments are particularly useful because the answers are genuinely unknown at prediction time, making them a cleaner test of general reasoning than benchmarks where solutions may already exist in training data. [arXiv]arxiv.orgLarge Language Model Prediction Capabilities: Evidence from a Real-World Forecasting TournamentOctober 17, 2023…
Subsequent work showed that performance can improve dramatically when models are given external information, structured reasoning steps, retrieval systems, and forecast aggregation methods. However, this finding itself is revealing. Simply asking a chatbot for a prediction often produces weak results. Building a competitive forecasting system typically requires additional scaffolding, consistency checks, and specialised processes beyond ordinary conversation. [arXiv]arxiv.orgarXiv Approaching Human-Level Forecasting with Language ModelsarXiv Approaching Human-Level Forecasting with Language Models
Researchers and forecasting practitioners have also noted that language models sometimes violate basic logical constraints when estimating probabilities across related events. For example, they may assign a lower probability to an event occurring by a later date than by an earlier date, despite the later event encompassing the earlier one. Such inconsistencies expose limitations in uncertainty reasoning that are less visible during standard question answering. [Vox]vox.comCompetitions like those on Metaculus show that human experts consistently beat AI forecasters, although the performance gap is narrowing…
How uncertainty should change chatbot use
The forecasting gap does not mean chatbots are useless. On the contrary, they can be highly valuable for gathering information, identifying relevant factors, summarising competing arguments, and generating possible scenarios.
The lesson is that users should distinguish between knowledge assistance and predictive judgement.
When a chatbot explains an existing concept, much of the challenge involves retrieving and organising information. When it predicts a future outcome, the challenge becomes managing uncertainty. The same system may perform strongly in the first task and much less reliably in the second.
Forecasting research therefore encourages a more nuanced view of AI capability:
- Strong language generation does not automatically imply accurate prediction.
- Reasoning quality and calibration are related but distinct abilities.
- Probabilistic estimates are often more informative than categorical answers.
- Aggregated forecasts frequently outperform single forecasts, whether the forecasters are humans or AI systems.
- Confidence should be treated as a measurable property rather than inferred from persuasive language. [arXiv+2OpenReview]arxiv.orgarXiv Forecast Bench: A Dynamic Benchmark of AI Forecasting CapabilitiesarXiv Forecast Bench: A Dynamic Benchmark of AI Forecasting Capabilities
For anyone trying to understand artificial intelligence, forecasting provides a useful stress test. It reveals where a chatbot’s apparent certainty rests on genuine predictive skill and where it reflects the limitations of systems that can generate convincing answers without fully understanding how uncertain the future remains. [OpenReview+2metaculus.com]openreview.netTo produce an accurate forecast, a person or AI system must synthesizeForecastBench: A Dynamic Benchmark of AI Forecasting…by E Karger · Cited by 57 — Forecasting is a useful testbed of LLM reas…
Amazon book picks
Further Reading
Books and field guides related to Can chatbots predict the unknown?. Use these as the next step if you want deeper reading beyond the article.
Superforecasting
The foundational popular book on prediction, uncertainty, and calibration.
How to Measure Anything
Focuses on quantifying uncertainty and making better predictions.
Endnotes
-
Source: openreview.net
Title: To produce an accurate forecast, a person or AI system must synthesize
Link: https://openreview.net/forum?id=lfPkGWXLLfSource snippet
ForecastBench: A Dynamic Benchmark of AI Forecasting...by E Karger · Cited by 57 — Forecasting is a useful testbed of LLM reas...
-
Source: arxiv.org
Title: arXiv Forecast Bench: A Dynamic Benchmark of AI Forecasting Capabilities
Link: https://arxiv.org/abs/2409.19839 -
Source: forecastbench.org
Link: https://www.forecastbench.org/about/Source snippet
8 Oct 2025 — We evaluate LLMs by regularly asking them to make probabilistic forecasts about future events, thereby creating a contaminat...
-
Source: arxiv.org
Link: https://arxiv.org/html/2409.19839v4Source snippet
A Dynamic Benchmark of AI Forecasting CapabilitiesWhile LLMs have achieved super-human performance on many benchmarks, they perform less...
-
Source: arxiv.org
Link: https://arxiv.org/html/2402.19379v4Source snippet
LLM Ensemble Prediction Capabilities Rival Human Crowd...Our results suggest that LLMs can achieve forecasting accuracy rivaling th...
-
Source: pmc.ncbi.nlm.nih.gov
Title: PMCWisdom of the silicon crowd: LLM ensemble prediction
Link: https://pmc.ncbi.nlm.nih.gov/articles/PMC11800985/Source snippet
by P Schoenegger · 2024 · Cited by 97 — Our findings suggest that LLM predictions can rival the human crowd's forecasting accuracy thr...
-
Source: arxiv.org
Title: arXiv Approaching Human-Level Forecasting with Language Models
Link: https://arxiv.org/abs/2402.18563 -
Source: arxiv.org
Link: https://arxiv.org/abs/2310.13014Source snippet
Large Language Model Prediction Capabilities: Evidence from a Real-World Forecasting TournamentOctober 17, 2023...
Published: October 17, 2023
-
Source: vox.com
Link: https://www.vox.com/future-perfect/411742/ai-forecasting-prediction-metaculus-llmSource snippet
Competitions like those on Metaculus show that human experts consistently beat AI forecasters, although the performance gap is narrowing...
-
Source: metaculus.com
Title: introducing futureeval our new home for ai forecasting
Link: https://www.metaculus.com/notebooks/42136/introducing-futureeval-our-new-home-for-ai-forecasting/Source snippet
Introducing FutureEval, our new home for AI forecasting12 Feb 2026 — FutureEval is Metaculus's new benchmark that measures how well AI ag...
-
Source: forecastbench.org
Link: https://www.forecastbench.org/Source snippet
Explore how LLM forecasting accuracy evolves on ForecastBench. A linear trend projects the date when LLMs reach superforecas...
-
Source: metaculus.com
Title: Exploring Metaculus’s AI Track Record
Link: https://www.metaculus.com/notebooks/16708/exploring-metaculuss-ai-track-record/Source snippet
March 28, 2023 — In this post, we report the results of a recent analysis we conducted exploring the performance of all AI-related foreca...
Published: March 28, 2023
-
Source: metaculus.com
Title: A I Driven AI Forecasting Literature Review
Link: https://www.metaculus.com/notebooks/43430/ai-forecasting-literature-review/Source snippet
AI Driven AI Forecasting Literature ReviewMay 16, 2026 — This is an AI-driven, human-reviewed piece, designed as a research aid to a larg...
Published: May 16, 2026
-
Source: metaculus.com
Title: another year of ai benchmarking the plan
Link: https://www.metaculus.com/notebooks/38909/another-year-of-ai-benchmarking-the-plan/Source snippet
Another Year of AI Benchmarking: The Plan22 Jul 2025 — Over the last year, Metaculus has run a $120k tournament split over 4 quarters whe...
-
Source: metaculus.com
Title: A I Forecasting Benchmark Tournament
Link: https://www.metaculus.com/tournament/aibq2/Source snippet
AI Forecasting Benchmark Tournament - 2025 Q2This is the 4th tournament in our $120,000 series designed to benchmark AI forecasting capab...
-
Source: metaculus.com
Title: fall aib 2025
Link: https://www.metaculus.com/tournament/fall-aib-2025/Source snippet
This is a bot-only competition where bot-makers attempt to push AI to its limits in predicting future events.Read more...
-
Source: metaculus.com
Link: https://www.metaculus.com/questions/40290/when-will-llms-beat-superforecasters-at-forecastbench/Source snippet
When will LLMs beat superforecasters at ForecastBench?Metaculus is an online forecasting platform and aggregation engine working to impro...
-
Source: openreview.net
Link: https://openreview.net/forum?id=R3VBfYVK1xSource snippet
I evaluate state-of-the-art LLMs on 464 forecasting...Read more...
-
Source: openreview.net
Link: https://openreview.net/forum?id=QqtvS8ZMhbSource snippet
that forecasting small-model failure can reduce [inference]({{ 'inference-test/' | relative_url }}) cost while...
-
Source: arxiv.org
Link: https://arxiv.org/html/2601.22444v2Source snippet
Automating Forecasting Question Generation and...9 Mar 2026 — Abstract. Forecasting future events is highly valuable in decision-making...
-
Source: emergentmind.com
Link: https://www.emergentmind.com/topics/forecastbenchSource snippet
Dynamic AI Forecast Benchmark20 Feb 2026 — ForecastBench is a dynamic benchmark evaluating AI forecasting with contamination-free, contin...
Additional References
-
Source: agent4science.org
Link: https://agent4science.org/page/paper_mm2ew9ud2ftc7z0e -
Source: researchgate.net
Link: https://www.researchgate.net/publication/399806185_Human-Centric_AI_Forecasting_Models_for_Enhancing_Product_Availability_Perception_in_Seasonal_Retail_MicroenterprisesSource snippet
(PDF) Human-Centric AI Forecasting Models for Enhancing...9 Jan 2026 — The results of this study indicate that perceived flexibility, ac...
-
Source: iclr.cc
Link: https://iclr.cc/media/iclr-2025/Slides/28507.pdfSource snippet
ForecastBench: A Dynamic Benchmark of AI Forecasting...by E Karger · Cited by 57 — Our [automated]({{ 'decisions/' | relative_url }}) system manages the benchmark, from upda...
-
Source: scientificadvice.eu
Link: https://scientificadvice.eu/scientific-outputs/artificial-intelligence-in-emergency-and-crisis-management-rapid-evidence-review-report/Source snippet
Artificial Intelligence in Emergency and Crisis Management11 Dec 2025 — AI can help with situational awareness, forecasting, damage asses...
-
Source: researchgate.net
Title: 384502750 ForecastBench A Dynamic Benchmark of AI Forecasting Capabilities
Link: https://www.researchgate.net/publication/384502750_ForecastBench_A_Dynamic_Benchmark_of_AI_Forecasting_CapabilitiesSource snippet
A Dynamic Benchmark of AI Forecasting Capabilities30 Sept 2024 — To address this gap, we introduce ForecastBench: a dynamic benchmark tha...
-
Source: faculty.wharton.upenn.edu
Link: https://faculty.wharton.upenn.edu/wp-content/uploads/2026/02/ForecastBench_A_Dynamic_.pdfSource snippet
upenn.eduFORECASTBENCH:ADYNAMIC BENCHMARK OF AI...by E Karger · Cited by 75 — Forecasts of future events are essential inputs into infor...
-
Source: forum.effectivealtruism.org
Title: announcing forecastbench a new benchmark for ai and human
Link: https://forum.effectivealtruism.org/posts/zwzgR8iuFEcJms3Hu/announcing-forecastbench-a-new-benchmark-for-ai-and-humanSource snippet
ForecastBench, a new benchmark for AI and...1 Oct 2024 — ForecastBench is a new dynamic benchmark for evaluating AI and human forecastin...
-
Source: lesswrong.com
Title: Approaching Human-Level Forecasting with Language Models
Link: https://www.lesswrong.com/posts/K2F9g2aQubd7kwEr3/approaching-human-level-forecasting-with-language-models-2Source snippet
February 29, 2024 — We develop a retrieval-augmented LM system designed to automatically search for relevant information, generate foreca...
Published: February 29, 2024
-
Source: reddit.com
Title: Advancing Towards Human-Level Accuracy in Forecasting
Link: https://www.reddit.com/r/singularity/comments/1b4ed8f/advancing_towards_humanlevel_accuracy_in/Source snippet
March 2, 2024 — Advancing towards human-level accuracy in forecasting with language models: Achieving 71.5% precision with LLM-base...
Published: March 2, 2024
-
Source: researchgate.net
Link: https://www.researchgate.net/publication/397196221_Approaching_Human-Level_Forecasting_with_Language_ModelsSource snippet
arable to that of competitive human forecasters [3], while dynamic...Read more...
Topic Tree



