Within Forecasting gap

Why crowds and ensembles forecast better

Combining many forecasts can reduce individual errors and make AI predictions more reliable than a single model answer.

On this page

  • Why averaging can improve calibration
  • Human crowds versus model ensembles
  • When aggregation still fails
Preview for Why crowds and ensembles forecast better

Introduction

Forecasting is useful for understanding artificial intelligence because it forces systems to deal with uncertainty rather than retrieve known answers. One of the most important lessons from forecasting research is that a single prediction is often less reliable than a carefully combined set of predictions. This principle, known as forecast aggregation, has long improved human forecasting and is increasingly improving AI forecasting as well.

Aggregation illustration 1 Rather than asking one person or one model for the answer, aggregation combines multiple probability estimates into a single forecast. The result is often better calibrated, less vulnerable to individual mistakes, and more accurate over time. Research on both human forecasting tournaments and large language models shows that combining diverse forecasts can substantially narrow the gap between individual predictions and consistently strong performance. [arXiv+2arXiv]arxiv.orgLLM Ensemble Prediction Capabilities Rival Human Crowd…Our results suggest that LLMs can achieve forecasting accuracy rivaling th…

Why averaging can improve calibration

The central idea behind aggregation is simple: different forecasters make different errors.

One forecaster may be too optimistic. Another may be too pessimistic. A third may focus on the wrong evidence. When these judgments are combined, some of the errors cancel out. The final forecast is often closer to reality than any individual contributor.

Forecasting researchers describe this as a form of the “wisdom of crowds”. Studies associated with the Good Judgment Project found that aggregation, alongside training and teamwork, was one of the major drivers of forecasting improvement. Teams and aggregated forecasts consistently outperformed isolated individuals because they reduced the impact of idiosyncratic mistakes. [Good Judgment+2AI Impacts]goodjudgment.comGood JudgmentThe Science Of SuperforecastingGood Judgment research discovered four keys to accurate forecasting: talent-spotting, trainin…

For AI systems, the same mechanism applies. A language model can produce slightly different forecasts depending on the prompt, model version, retrieved information, or reasoning path. Each forecast contains signal but also noise. Averaging across multiple forecasts often preserves the signal while reducing the noise. This tends to improve calibration—the match between stated probabilities and real-world outcomes. [arXiv]arxiv.orgarXiv Aggregating distribution forecasts from deep ensemblesAggregating distribution forecasts from deep ensemblesApril 5, 2022…Published: April 5, 2022

The improvement is especially important because forecasting is fundamentally probabilistic. Decision-makers usually need estimates such as “30% likely” or “70% likely”, not merely yes-or-no answers. Aggregation helps those probabilities become more reliable over large numbers of predictions. [ForecastBench]forecastbench.orgForecastBenchA dynamic, contamination-free benchmark of LLM forecasting accuracy with human comparison groups, serving as a…

Human crowds versus model ensembles

Forecast aggregation originated in human forecasting communities, where large groups of participants independently estimate future events and their predictions are combined.

In forecasting tournaments, the crowd forecast often beats most individual participants. The aggregate benefits from diverse perspectives, different information sources, and varied reasoning styles. This is one reason why forecasting platforms rely heavily on crowd averages rather than highlighting a single forecaster’s opinion. [Esoteric Library+2Wikipedia]esotericlibrary.weebly.comEsoteric LibrarySuperforecastingSurowiecki's bestseller The Wisdom of Crowds. Aggregating the judgment of many consistently beats the acc…

Researchers have increasingly applied the same idea to AI systems through model ensembles. An ensemble may combine:

  • Different language models. [arxiv.org]arxiv.orgarXiv Approaching Human-Level Forecasting with Language ModelsarXiv Approaching Human-Level Forecasting with Language Models
  • Multiple prompts given to the same model.
  • Multiple reasoning chains.
  • Separate forecasts generated at different times.
  • Human and machine forecasts together.

A notable 2024 study tested an ensemble of twelve large language models against a forecasting crowd of 925 humans. Individual models lagged behind the human aggregate, but the combined model ensemble achieved forecasting accuracy statistically comparable to the human crowd. The researchers concluded that the familiar wisdom-of-crowds effect can also emerge among AI systems. [arXiv]arxiv.orgOpen source on arxiv.org.

The same study found another important result: when models were shown the median human forecast, their predictions improved. Yet simply averaging human and AI forecasts often performed even better. This suggests that human and machine forecasts contain partly independent information, creating additional gains when combined. [arXiv]arxiv.orgOpen source on arxiv.org.

Aggregation illustration 2

How aggregation is implemented in AI forecasting systems

Modern AI forecasting systems rarely rely on a single model output.

Instead, they often generate multiple candidate forecasts and then combine them. The combination process may be as simple as taking an arithmetic average or as sophisticated as weighting forecasts according to past accuracy.

Common aggregation approaches include:

  • Simple averaging: Every forecast receives equal weight.
  • Weighted averaging: More accurate models receive greater influence.
  • Median aggregation: Reduces the effect of extreme forecasts.
  • Hybrid aggregation: Combines human judgments and machine forecasts.
  • Ensemble probability distributions: Merges complete probability estimates rather than single-point predictions.

Research on deep-learning forecast ensembles has repeatedly found that combining multiple forecast distributions improves predictive performance compared with relying on a single model run. Aggregation can also correct systematic weaknesses that appear consistently across individual forecasts. [arXiv]arxiv.orgarXiv Aggregating distribution forecasts from deep ensemblesAggregating distribution forecasts from deep ensemblesApril 5, 2022…Published: April 5, 2022

Forecasting systems designed to approach human-level performance increasingly treat aggregation as a core component rather than a final adjustment. Retrieval-augmented forecasting systems, for example, may generate multiple forecasts after gathering information and then aggregate the outputs into a single probability estimate. [arXiv]arxiv.orgarXiv Approaching Human-Level Forecasting with Language ModelsarXiv Approaching Human-Level Forecasting with Language Models

A concrete example: from chatbots to weather forecasting

The value of aggregation becomes especially visible in weather prediction.

Traditional weather centres have long relied on ensemble forecasting, running many simulations with slightly different starting conditions. Instead of producing one answer, they generate a range of plausible futures and estimate probabilities across them.

New AI weather systems are adopting the same principle. Google DeepMind’s GenCast model generates large ensembles of forecasts rather than a single best guess. By producing dozens of alternative weather trajectories and combining them into probabilistic forecasts, the system can better quantify uncertainty and improve prediction quality. Researchers and forecasters view this ensemble approach as a major reason for the model’s strong performance. [Axios]axios.comNew Google AI weather model beats most reliable forecast systemThis innovation marks a major breakthrough in ensemble-based forecasting, where forecasts are generated through multiple simulations usin…

This illustrates a broader lesson for artificial intelligence: uncertainty is often represented more accurately when multiple forecasts are considered together rather than forcing the system to commit to one prediction.

When aggregation still fails

Aggregation is powerful, but it is not magic.

The biggest limitation is correlated error. If all forecasters make the same mistake, averaging cannot remove it. A crowd that shares the same false assumption may confidently converge on the wrong answer. The same problem affects model ensembles built from highly similar AI systems. [Scattered Thoughts]scattered-thoughts.netScattered ThoughtsNotes on 'Superforecasting: The Art and Science of…28 Jan 2016 — Wisdom of the crowds - average group predictions ar…

Diversity therefore matters as much as quantity. Ten nearly identical forecasts may add little value compared with a smaller group that uses different information and reasoning strategies.

Another challenge is systematic bias. Research on language-model forecasting has identified tendencies such as overestimating positive outcomes or producing overly similar forecasts across prompts. Aggregation can reduce some of these effects but may not eliminate them if the bias is shared broadly across the ensemble. [arXiv]arxiv.orgOpen source on arxiv.org.

Finally, aggregation cannot compensate for a complete lack of information. If neither humans nor models possess meaningful evidence about an event, combining many guesses simply produces a more stable guess. Forecast quality still depends on the quality of the underlying information. [Scattered Thoughts]scattered-thoughts.netScattered ThoughtsNotes on 'Superforecasting: The Art and Science of…28 Jan 2016 — Wisdom of the crowds - average group predictions ar…

Aggregation illustration 3

What aggregation reveals about AI uncertainty

Forecast aggregation highlights an important distinction between generating answers and estimating uncertainty. A chatbot can produce a confident prediction, but confidence alone says little about whether the prediction is trustworthy.

By comparing and combining multiple forecasts, researchers gain a clearer picture of what the system actually knows and where uncertainty remains. ForecastBench and related forecasting studies show that while individual language models often trail expert human forecasters, aggregated model forecasts perform substantially better and can sometimes approach crowd-level forecasting accuracy. [arXiv+2OpenReview]arxiv.orgarXiv Forecast Bench: A Dynamic Benchmark of AI Forecasting CapabilitiesForecastBench: A Dynamic Benchmark of AI Forecasting CapabilitiesSeptember 30, 2024…Published: September 30, 2024

For understanding artificial intelligence, this is a significant lesson. The future of AI forecasting may depend less on finding a single perfect predictor and more on building systems that combine many imperfect predictions into a better-calibrated view of an uncertain world. [arXiv+2arXiv]arxiv.orgOpen source on arxiv.org.

Amazon book picks

Further Reading

Books and field guides related to Why crowds and ensembles forecast better. Use these as the next step if you want deeper reading beyond the article.

BookCover for Noise

Noise

By Daniel Kahneman, Olivier Sibony et al.

Shows how averaging can reduce judgement variability.

eBay marketplace picks

Marketplace Samples

Example marketplace items related to this page. Use the search link to explore similar finds on eBay.

Using USA

Endnotes

  1. Source: arxiv.org
    Link: https://arxiv.org/html/2402.19379v4
    Source snippet

    LLM Ensemble Prediction Capabilities Rival Human Crowd...Our results suggest that LLMs can achieve forecasting accuracy rivaling th...

  2. Source: arxiv.org
    Link: https://arxiv.org/abs/2402.19379

  3. Source: arxiv.org
    Title: arXiv Aggregating distribution forecasts from deep ensembles
    Link: https://arxiv.org/abs/2204.02291
    Source snippet

    Aggregating distribution forecasts from deep ensemblesApril 5, 2022...

    Published: April 5, 2022

  4. Source: arxiv.org
    Title: arXiv Approaching Human-Level Forecasting with Language Models
    Link: https://arxiv.org/abs/2402.18563

  5. Source: forecastbench.org
    Link: https://www.forecastbench.org/
    Source snippet

    ForecastBenchA dynamic, [contamination]({{ 'contamination/' | relative_url }})-free benchmark of LLM forecasting accuracy with human comparison groups, serving as a...

  6. Source: Wikipedia
    Title: The Good Judgment Project
    Link: https://en.wikipedia.org/wiki/The_Good_Judgment_Project
    Source snippet

    The Good Judgment ProjectThe Good Judgment Project (GJP) is an organization dedicated to "harnessing the wisdom of the crowd to foreca...

  7. Source: axios.com
    Title: New Google AI weather model beats most reliable forecast system
    Link: https://www.axios.com/2024/12/04/google-ai-weather-model-more-reliable
    Source snippet

    This innovation marks a major breakthrough in ensemble-based forecasting, where forecasts are generated through multiple simulations usin...

  8. Source: scattered-thoughts.net
    Link: https://www.scattered-thoughts.net/blog/2016/01/28/notes-on-superforecasting-the-art-and-science-of-prediction
    Source snippet

    Scattered ThoughtsNotes on 'Superforecasting: The Art and Science of...28 Jan 2016 — Wisdom of the crowds - average group predictions ar...

  9. Source: arxiv.org
    Title: arXiv Forecast Bench: A Dynamic Benchmark of AI Forecasting Capabilities
    Link: https://arxiv.org/abs/2409.19839
    Source snippet

    ForecastBench: A Dynamic Benchmark of AI Forecasting CapabilitiesSeptember 30, 2024...

    Published: September 30, 2024

  10. Source: openreview.net
    Link: https://openreview.net/forum?id=lfPkGWXLLf
    Source snippet

    ForecastBench: A Dynamic Benchmark of AI Forecasting...by E Karger · Cited by 57 — The paper introduces ForecastBench, a dynam...

  11. Source: openreview.net
    Link: https://openreview.net/pdf?id=MKqb0aB1e6
    Source snippet

    ASSESSING LARGE LANGUAGE MODELS IN UPDATING...by M Yuan · Cited by 2 — A growing body of research has investigated the forecasting capab...

  12. Source: forecastbench.org
    Link: https://www.forecastbench.org/explore/
    Source snippet

    This interactive visualization charts the evolution of AI forecasting accuracy on ForecastBench.Read more...

  13. Source: forecastbench.org
    Link: https://www.forecastbench.org/docs/
    Source snippet

    DocsThe repository contains the full pipeline for generating forecasting questions from time-series data, evaluating LLM and human foreca...

  14. Source: goodjudgment.com
    Link: https://goodjudgment.com/about/the-science-of-superforecasting/
    Source snippet

    Good JudgmentThe Science Of SuperforecastingGood Judgment research discovered four keys to accurate forecasting: talent-spotting, trainin...

  15. Source: aiimpacts.org
    Link: https://aiimpacts.org/evidence-on-good-forecasting-practices-from-the-good-judgment-project-an-accompanying-blog-post/
    Source snippet

    AI ImpactsEvidence on good forecasting practices from the...Jul 2, 2019 — “Teams of ordinary forecasters beat the wisdom of the crowd by...

  16. Source: esotericlibrary.weebly.com
    Link: https://esotericlibrary.weebly.com/uploads/5/0/7/7/5077636/philip_e.tetlock-_superforecasting_the_art_and_science_of_prediction.pdf
    Source snippet

    Esoteric LibrarySuperforecastingSurowiecki's bestseller The Wisdom of Crowds. Aggregating the judgment of many consistently beats the acc...

  17. Source: emergentmind.com
    Link: https://www.emergentmind.com/topics/forecastbench
    Source snippet

    Dynamic AI Forecast Benchmark20 Feb 2026 — ForecastBench is a dynamic benchmark evaluating AI forecasting with contamination-free, contin...

  18. Source: aiimpacts.org
    Link: https://aiimpacts.org/evidence-on-good-forecasting-practices-from-the-good-judgment-project/
    Source snippet

    1.2. Correlates of successful forecasting.Read more...

  19. Source: liner.com
    Title: forecastbench a dynamic benchmark of ai forecasting capabilities
    Link: https://liner.com/review/forecastbench-a-dynamic-benchmark-of-ai-forecasting-capabilities
    Source snippet

    ForecastBench: A Dynamic Benchmark of AI Forecasting...Sep 30, 2024 — ForecastBench evaluates LLMs against human forecasters, showing th...

  20. Source: forum.effectivealtruism.org
    Title: announcing forecastbench a new benchmark for ai and human
    Link: https://forum.effectivealtruism.org/posts/zwzgR8iuFEcJms3Hu/announcing-forecastbench-a-new-benchmark-for-ai-and-human
    Source snippet

    ForecastBench, a new benchmark for AI and...1 Oct 2024 — ForecastBench is a new dynamic benchmark for evaluating AI and human forecastin...

  21. Source: iclr.cc
    Link: https://iclr.cc/media/iclr-2025/Slides/28507.pdf

Additional References

  1. Source: researchgate.net
    Link: https://www.researchgate.net/publication/403873460_Crowdsourced_versus_large_language_models_forecasting_evidence_for_the_accuracy-correlation_effect
    Source snippet

    Crowdsourced versus large language models forecasting19 Apr 2026 — Using 76 model × prompt forecast sets from 16 LLMs on 580 resolved For...

  2. Source: medium.com
    Link: https://medium.com/%40mahdi-ghafarian/aggregating-forecasts-the-wisdom-and-limits-of-the-crowd-bd0d51f6c502
    Source snippet

    Aggregating Forecasts: The Wisdom — and LimitsAggregating their forecasts — whether through simple averages or extremizing techniques — a...

  3. Source: linkedin.com
    Link: https://www.linkedin.com/pulse/superforecasting-philip-e-tetlock-dan-gardner-juan-carlos-zambrano

  4. Source: theguardian.com
    Link: https://www.theguardian.com/science/2024/dec/04/google-deepmind-predicts-weather-more-accurately-than-leading-system
    Source snippet

    GenCast is proficient in predicting day-to-day weather and extreme events up to 15 days ahead and surpasses ENS in forecasting hurricane...

  5. Source: researchgate.net
    Title: 281765164 Distilling the Wisdom of Crowds Prediction Markets vs Prediction Polls
    Link: https://www.researchgate.net/publication/281765164_Distilling_the_Wisdom_of_Crowds_Prediction_Markets_vs_Prediction_Polls
    Source snippet

    (PDF) Distilling the Wisdom of Crowds: Prediction Markets...9 Feb 2026 — We report the results of the first large-scale, long-term, expe...

  6. Source: researchgate.net
    Title: 384502750 ForecastBench A Dynamic Benchmark of AI Forecasting Capabilities
    Link: https://www.researchgate.net/publication/384502750_ForecastBench_A_Dynamic_Benchmark_of_AI_Forecasting_Capabilities
    Source snippet

    A Dynamic Benchmark of AI Forecasting Capabilities30 Sept 2024 — To address this gap, we introduce ForecastBench: a dynamic benchmark tha...

  7. Source: pmc.ncbi.nlm.nih.gov
    Title: PMCWisdom of the silicon crowd: LLM ensemble prediction
    Link: https://pmc.ncbi.nlm.nih.gov/articles/PMC11800985/
    Source snippet

    by P Schoenegger · 2024 · Cited by 97 — Our findings suggest that LLM predictions can rival the human crowd's forecasting accuracy thr...

  8. Source: alignmentforum.org
    Title: approaching human level forecasting with language models 2
    Link: https://www.alignmentforum.org/posts/K2F9g2aQubd7kwEr3/approaching-human-level-forecasting-with-language-models-2
    Source snippet

    Approaching Human-Level Forecasting with Language...29 Feb 2024 — We develop a retrieval-augmented LM system designed to automatically s...

  9. Source: lesswrong.com
    Title: approaching human level forecasting with language models 2
    Link: https://www.lesswrong.com/posts/K2F9g2aQubd7kwEr3/approaching-human-level-forecasting-with-language-models-2
    Source snippet

    Approaching Human-Level Forecasting with Language...29 Feb 2024 — We develop a retrieval-augmented LM system designed to automatically s...

  10. Source: researchgate.net
    Title: (PDF) Superforecasting: The Art and Science of Prediction
    Link: https://www.researchgate.net/publication/304924623_Superforecasting_The_Art_and_Science_of_Prediction_By_Philip_Tetlock_and_Dan_Gardner
    Source snippet

    5 Jul 2016 — This is an excellent book to read. It is not only informative, as it should be for a book on forecasting, but it is highly e...

Topic Tree

Follow this branch

Parent topic

Forecasting gap Can chatbots predict the unknown?

Related pages 2