Within Forecasting gap
Why crowds and ensembles forecast better
Combining many forecasts can reduce individual errors and make AI predictions more reliable than a single model answer.
On this page
- Why averaging can improve calibration
- Human crowds versus model ensembles
- When aggregation still fails
Page outline Jump by section
Introduction
Forecasting is useful for understanding artificial intelligence because it forces systems to deal with uncertainty rather than retrieve known answers. One of the most important lessons from forecasting research is that a single prediction is often less reliable than a carefully combined set of predictions. This principle, known as forecast aggregation, has long improved human forecasting and is increasingly improving AI forecasting as well.
Rather than asking one person or one model for the answer, aggregation combines multiple probability estimates into a single forecast. The result is often better calibrated, less vulnerable to individual mistakes, and more accurate over time. Research on both human forecasting tournaments and large language models shows that combining diverse forecasts can substantially narrow the gap between individual predictions and consistently strong performance. [arXiv+2arXiv]arxiv.orgLLM Ensemble Prediction Capabilities Rival Human Crowd…Our results suggest that LLMs can achieve forecasting accuracy rivaling th…
Why averaging can improve calibration
The central idea behind aggregation is simple: different forecasters make different errors.
One forecaster may be too optimistic. Another may be too pessimistic. A third may focus on the wrong evidence. When these judgments are combined, some of the errors cancel out. The final forecast is often closer to reality than any individual contributor.
Forecasting researchers describe this as a form of the “wisdom of crowds”. Studies associated with the Good Judgment Project found that aggregation, alongside training and teamwork, was one of the major drivers of forecasting improvement. Teams and aggregated forecasts consistently outperformed isolated individuals because they reduced the impact of idiosyncratic mistakes. [Good Judgment+2AI Impacts]goodjudgment.comGood JudgmentThe Science Of SuperforecastingGood Judgment research discovered four keys to accurate forecasting: talent-spotting, trainin…
For AI systems, the same mechanism applies. A language model can produce slightly different forecasts depending on the prompt, model version, retrieved information, or reasoning path. Each forecast contains signal but also noise. Averaging across multiple forecasts often preserves the signal while reducing the noise. This tends to improve calibration—the match between stated probabilities and real-world outcomes. [arXiv]arxiv.orgarXiv Aggregating distribution forecasts from deep ensemblesAggregating distribution forecasts from deep ensemblesApril 5, 2022…
The improvement is especially important because forecasting is fundamentally probabilistic. Decision-makers usually need estimates such as “30% likely” or “70% likely”, not merely yes-or-no answers. Aggregation helps those probabilities become more reliable over large numbers of predictions. [ForecastBench]forecastbench.orgForecastBenchA dynamic, contamination-free benchmark of LLM forecasting accuracy with human comparison groups, serving as a…
Human crowds versus model ensembles
Forecast aggregation originated in human forecasting communities, where large groups of participants independently estimate future events and their predictions are combined.
In forecasting tournaments, the crowd forecast often beats most individual participants. The aggregate benefits from diverse perspectives, different information sources, and varied reasoning styles. This is one reason why forecasting platforms rely heavily on crowd averages rather than highlighting a single forecaster’s opinion. [Esoteric Library+2Wikipedia]esotericlibrary.weebly.comEsoteric LibrarySuperforecastingSurowiecki's bestseller The Wisdom of Crowds. Aggregating the judgment of many consistently beats the acc…
Researchers have increasingly applied the same idea to AI systems through model ensembles. An ensemble may combine:
- Different language models. [arxiv.org]arxiv.orgarXiv Approaching Human-Level Forecasting with Language ModelsarXiv Approaching Human-Level Forecasting with Language Models
- Multiple prompts given to the same model.
- Multiple reasoning chains.
- Separate forecasts generated at different times.
- Human and machine forecasts together.
A notable 2024 study tested an ensemble of twelve large language models against a forecasting crowd of 925 humans. Individual models lagged behind the human aggregate, but the combined model ensemble achieved forecasting accuracy statistically comparable to the human crowd. The researchers concluded that the familiar wisdom-of-crowds effect can also emerge among AI systems. [arXiv]arxiv.orgOpen source on arxiv.org.
The same study found another important result: when models were shown the median human forecast, their predictions improved. Yet simply averaging human and AI forecasts often performed even better. This suggests that human and machine forecasts contain partly independent information, creating additional gains when combined. [arXiv]arxiv.orgOpen source on arxiv.org.
How aggregation is implemented in AI forecasting systems
Modern AI forecasting systems rarely rely on a single model output.
Instead, they often generate multiple candidate forecasts and then combine them. The combination process may be as simple as taking an arithmetic average or as sophisticated as weighting forecasts according to past accuracy.
Common aggregation approaches include:
- Simple averaging: Every forecast receives equal weight.
- Weighted averaging: More accurate models receive greater influence.
- Median aggregation: Reduces the effect of extreme forecasts.
- Hybrid aggregation: Combines human judgments and machine forecasts.
- Ensemble probability distributions: Merges complete probability estimates rather than single-point predictions.
Research on deep-learning forecast ensembles has repeatedly found that combining multiple forecast distributions improves predictive performance compared with relying on a single model run. Aggregation can also correct systematic weaknesses that appear consistently across individual forecasts. [arXiv]arxiv.orgarXiv Aggregating distribution forecasts from deep ensemblesAggregating distribution forecasts from deep ensemblesApril 5, 2022…
Forecasting systems designed to approach human-level performance increasingly treat aggregation as a core component rather than a final adjustment. Retrieval-augmented forecasting systems, for example, may generate multiple forecasts after gathering information and then aggregate the outputs into a single probability estimate. [arXiv]arxiv.orgarXiv Approaching Human-Level Forecasting with Language ModelsarXiv Approaching Human-Level Forecasting with Language Models
A concrete example: from chatbots to weather forecasting
The value of aggregation becomes especially visible in weather prediction.
Traditional weather centres have long relied on ensemble forecasting, running many simulations with slightly different starting conditions. Instead of producing one answer, they generate a range of plausible futures and estimate probabilities across them.
New AI weather systems are adopting the same principle. Google DeepMind’s GenCast model generates large ensembles of forecasts rather than a single best guess. By producing dozens of alternative weather trajectories and combining them into probabilistic forecasts, the system can better quantify uncertainty and improve prediction quality. Researchers and forecasters view this ensemble approach as a major reason for the model’s strong performance. [Axios]axios.comNew Google AI weather model beats most reliable forecast systemThis innovation marks a major breakthrough in ensemble-based forecasting, where forecasts are generated through multiple simulations usin…
This illustrates a broader lesson for artificial intelligence: uncertainty is often represented more accurately when multiple forecasts are considered together rather than forcing the system to commit to one prediction.
When aggregation still fails
Aggregation is powerful, but it is not magic.
The biggest limitation is correlated error. If all forecasters make the same mistake, averaging cannot remove it. A crowd that shares the same false assumption may confidently converge on the wrong answer. The same problem affects model ensembles built from highly similar AI systems. [Scattered Thoughts]scattered-thoughts.netScattered ThoughtsNotes on 'Superforecasting: The Art and Science of…28 Jan 2016 — Wisdom of the crowds - average group predictions ar…
Diversity therefore matters as much as quantity. Ten nearly identical forecasts may add little value compared with a smaller group that uses different information and reasoning strategies.
Another challenge is systematic bias. Research on language-model forecasting has identified tendencies such as overestimating positive outcomes or producing overly similar forecasts across prompts. Aggregation can reduce some of these effects but may not eliminate them if the bias is shared broadly across the ensemble. [arXiv]arxiv.orgOpen source on arxiv.org.
Finally, aggregation cannot compensate for a complete lack of information. If neither humans nor models possess meaningful evidence about an event, combining many guesses simply produces a more stable guess. Forecast quality still depends on the quality of the underlying information. [Scattered Thoughts]scattered-thoughts.netScattered ThoughtsNotes on 'Superforecasting: The Art and Science of…28 Jan 2016 — Wisdom of the crowds - average group predictions ar…
What aggregation reveals about AI uncertainty
Forecast aggregation highlights an important distinction between generating answers and estimating uncertainty. A chatbot can produce a confident prediction, but confidence alone says little about whether the prediction is trustworthy.
By comparing and combining multiple forecasts, researchers gain a clearer picture of what the system actually knows and where uncertainty remains. ForecastBench and related forecasting studies show that while individual language models often trail expert human forecasters, aggregated model forecasts perform substantially better and can sometimes approach crowd-level forecasting accuracy. [arXiv+2OpenReview]arxiv.orgarXiv Forecast Bench: A Dynamic Benchmark of AI Forecasting CapabilitiesForecastBench: A Dynamic Benchmark of AI Forecasting CapabilitiesSeptember 30, 2024…
For understanding artificial intelligence, this is a significant lesson. The future of AI forecasting may depend less on finding a single perfect predictor and more on building systems that combine many imperfect predictions into a better-calibrated view of an uncertain world. [arXiv+2arXiv]arxiv.orgOpen source on arxiv.org.
Endnotes
-
Source: arxiv.org
Link: https://arxiv.org/html/2402.19379v4Source snippet
LLM Ensemble Prediction Capabilities Rival Human Crowd...Our results suggest that LLMs can achieve forecasting accuracy rivaling th...
-
Source: arxiv.org
Link: https://arxiv.org/abs/2402.19379 -
Source: arxiv.org
Title: arXiv Aggregating distribution forecasts from deep ensembles
Link: https://arxiv.org/abs/2204.02291Source snippet
Aggregating distribution forecasts from deep ensemblesApril 5, 2022...
Published: April 5, 2022
-
Source: arxiv.org
Title: arXiv Approaching Human-Level Forecasting with Language Models
Link: https://arxiv.org/abs/2402.18563 -
Source: forecastbench.org
Link: https://www.forecastbench.org/Source snippet
ForecastBenchA dynamic, [contamination]({{ 'contamination/' | relative_url }})-free benchmark of LLM forecasting accuracy with human comparison groups, serving as a...
-
Source: Wikipedia
Title: The Good Judgment Project
Link: https://en.wikipedia.org/wiki/The_Good_Judgment_ProjectSource snippet
The Good Judgment ProjectThe Good Judgment Project (GJP) is an organization dedicated to "harnessing the wisdom of the crowd to foreca...
-
Source: axios.com
Title: New Google AI weather model beats most reliable forecast system
Link: https://www.axios.com/2024/12/04/google-ai-weather-model-more-reliableSource snippet
This innovation marks a major breakthrough in ensemble-based forecasting, where forecasts are generated through multiple simulations usin...
-
Source: scattered-thoughts.net
Link: https://www.scattered-thoughts.net/blog/2016/01/28/notes-on-superforecasting-the-art-and-science-of-predictionSource snippet
Scattered ThoughtsNotes on 'Superforecasting: The Art and Science of...28 Jan 2016 — Wisdom of the crowds - average group predictions ar...
-
Source: arxiv.org
Title: arXiv Forecast Bench: A Dynamic Benchmark of AI Forecasting Capabilities
Link: https://arxiv.org/abs/2409.19839Source snippet
ForecastBench: A Dynamic Benchmark of AI Forecasting CapabilitiesSeptember 30, 2024...
Published: September 30, 2024
-
Source: openreview.net
Link: https://openreview.net/forum?id=lfPkGWXLLfSource snippet
ForecastBench: A Dynamic Benchmark of AI Forecasting...by E Karger · Cited by 57 — The paper introduces ForecastBench, a dynam...
-
Source: openreview.net
Link: https://openreview.net/pdf?id=MKqb0aB1e6Source snippet
ASSESSING LARGE LANGUAGE MODELS IN UPDATING...by M Yuan · Cited by 2 — A growing body of research has investigated the forecasting capab...
-
Source: forecastbench.org
Link: https://www.forecastbench.org/explore/Source snippet
This interactive visualization charts the evolution of AI forecasting accuracy on ForecastBench.Read more...
-
Source: forecastbench.org
Link: https://www.forecastbench.org/docs/Source snippet
DocsThe repository contains the full pipeline for generating forecasting questions from time-series data, evaluating LLM and human foreca...
-
Source: goodjudgment.com
Link: https://goodjudgment.com/about/the-science-of-superforecasting/Source snippet
Good JudgmentThe Science Of SuperforecastingGood Judgment research discovered four keys to accurate forecasting: talent-spotting, trainin...
-
Source: aiimpacts.org
Link: https://aiimpacts.org/evidence-on-good-forecasting-practices-from-the-good-judgment-project-an-accompanying-blog-post/Source snippet
AI ImpactsEvidence on good forecasting practices from the...Jul 2, 2019 — “Teams of ordinary forecasters beat the wisdom of the crowd by...
-
Source: esotericlibrary.weebly.com
Link: https://esotericlibrary.weebly.com/uploads/5/0/7/7/5077636/philip_e.tetlock-_superforecasting_the_art_and_science_of_prediction.pdfSource snippet
Esoteric LibrarySuperforecastingSurowiecki's bestseller The Wisdom of Crowds. Aggregating the judgment of many consistently beats the acc...
-
Source: emergentmind.com
Link: https://www.emergentmind.com/topics/forecastbenchSource snippet
Dynamic AI Forecast Benchmark20 Feb 2026 — ForecastBench is a dynamic benchmark evaluating AI forecasting with contamination-free, contin...
-
Source: aiimpacts.org
Link: https://aiimpacts.org/evidence-on-good-forecasting-practices-from-the-good-judgment-project/Source snippet
1.2. Correlates of successful forecasting.Read more...
-
Source: liner.com
Title: forecastbench a dynamic benchmark of ai forecasting capabilities
Link: https://liner.com/review/forecastbench-a-dynamic-benchmark-of-ai-forecasting-capabilitiesSource snippet
ForecastBench: A Dynamic Benchmark of AI Forecasting...Sep 30, 2024 — ForecastBench evaluates LLMs against human forecasters, showing th...
-
Source: forum.effectivealtruism.org
Title: announcing forecastbench a new benchmark for ai and human
Link: https://forum.effectivealtruism.org/posts/zwzgR8iuFEcJms3Hu/announcing-forecastbench-a-new-benchmark-for-ai-and-humanSource snippet
ForecastBench, a new benchmark for AI and...1 Oct 2024 — ForecastBench is a new dynamic benchmark for evaluating AI and human forecastin...
-
Source: iclr.cc
Link: https://iclr.cc/media/iclr-2025/Slides/28507.pdf
Additional References
-
Source: researchgate.net
Link: https://www.researchgate.net/publication/403873460_Crowdsourced_versus_large_language_models_forecasting_evidence_for_the_accuracy-correlation_effectSource snippet
Crowdsourced versus large language models forecasting19 Apr 2026 — Using 76 model × prompt forecast sets from 16 LLMs on 580 resolved For...
-
Source: medium.com
Link: https://medium.com/%40mahdi-ghafarian/aggregating-forecasts-the-wisdom-and-limits-of-the-crowd-bd0d51f6c502Source snippet
Aggregating Forecasts: The Wisdom — and LimitsAggregating their forecasts — whether through simple averages or extremizing techniques — a...
-
Source: linkedin.com
Link: https://www.linkedin.com/pulse/superforecasting-philip-e-tetlock-dan-gardner-juan-carlos-zambrano -
Source: theguardian.com
Link: https://www.theguardian.com/science/2024/dec/04/google-deepmind-predicts-weather-more-accurately-than-leading-systemSource snippet
GenCast is proficient in predicting day-to-day weather and extreme events up to 15 days ahead and surpasses ENS in forecasting hurricane...
-
Source: researchgate.net
Title: 281765164 Distilling the Wisdom of Crowds Prediction Markets vs Prediction Polls
Link: https://www.researchgate.net/publication/281765164_Distilling_the_Wisdom_of_Crowds_Prediction_Markets_vs_Prediction_PollsSource snippet
(PDF) Distilling the Wisdom of Crowds: Prediction Markets...9 Feb 2026 — We report the results of the first large-scale, long-term, expe...
-
Source: researchgate.net
Title: 384502750 ForecastBench A Dynamic Benchmark of AI Forecasting Capabilities
Link: https://www.researchgate.net/publication/384502750_ForecastBench_A_Dynamic_Benchmark_of_AI_Forecasting_CapabilitiesSource snippet
A Dynamic Benchmark of AI Forecasting Capabilities30 Sept 2024 — To address this gap, we introduce ForecastBench: a dynamic benchmark tha...
-
Source: pmc.ncbi.nlm.nih.gov
Title: PMCWisdom of the silicon crowd: LLM ensemble prediction
Link: https://pmc.ncbi.nlm.nih.gov/articles/PMC11800985/Source snippet
by P Schoenegger · 2024 · Cited by 97 — Our findings suggest that LLM predictions can rival the human crowd's forecasting accuracy thr...
-
Source: alignmentforum.org
Title: approaching human level forecasting with language models 2
Link: https://www.alignmentforum.org/posts/K2F9g2aQubd7kwEr3/approaching-human-level-forecasting-with-language-models-2Source snippet
Approaching Human-Level Forecasting with Language...29 Feb 2024 — We develop a retrieval-augmented LM system designed to automatically s...
-
Source: lesswrong.com
Title: approaching human level forecasting with language models 2
Link: https://www.lesswrong.com/posts/K2F9g2aQubd7kwEr3/approaching-human-level-forecasting-with-language-models-2Source snippet
Approaching Human-Level Forecasting with Language...29 Feb 2024 — We develop a retrieval-augmented LM system designed to automatically s...
-
Source: researchgate.net
Title: (PDF) Superforecasting: The Art and Science of Prediction
Link: https://www.researchgate.net/publication/304924623_Superforecasting_The_Art_and_Science_of_Prediction_By_Philip_Tetlock_and_Dan_GardnerSource snippet
5 Jul 2016 — This is an excellent book to read. It is not only informative, as it should be for a book on forecasting, but it is highly e...
Topic Tree


