Methodology May 28, 2025 Principle Research Team

Forecasting Under Deep Uncertainty: Probability Calibration in Geopolitical Simulation

Abstract probability distribution fan chart showing widening uncertainty over time

Deep uncertainty — conditions where neither the relevant probability distributions nor the causal structure of outcomes is known — presents a fundamental challenge for probabilistic geopolitical scenario analysis. This paper examines the specific challenge of probability calibration in geopolitical simulation under deep uncertainty, reviews available calibration methods from the forecasting and decision analysis literature, proposes a layered framework for selecting methods based on domain uncertainty depth, and addresses the epistemic-honesty obligations that arise when calibrated outputs are communicated to decision-makers.

Background: Three Distinct Epistemic Conditions

A necessary preliminary distinction clarifies the problem addressed here. Knight's classic separation between risk and uncertainty [1] establishes the first relevant boundary: risk applies when outcome probabilities are known or estimable from historical data; uncertainty applies when probabilities are not known but the causal structure of outcomes is understood. Geopolitical scenario analysis operates in both conditions, but is most analytically challenging when it encounters what Walker and colleagues [2] have termed "deep uncertainty" — a third condition in which neither the probability distributions nor the causal structure of outcomes is sufficiently well-characterized for standard probabilistic reasoning to apply.

The deep uncertainty condition arises in geopolitical contexts more frequently than in domains with richer historical base rates. Situations involving novel actor configurations, historically unprecedented triggering conditions, or rapid structural changes in the international system cannot be reliably mapped to historical reference classes — the foundational technique for extending from observed frequencies to probability estimates. The COVID-19 pandemic's geopolitical implications, the emergence of new autonomous weapons capabilities, and the rapid reorientation of supply-chain dependencies in response to political risk all represent deep uncertainty conditions for which historical base rates provided inadequate calibration guidance.

The practical implication for scenario analysis is substantial. Applying standard Bayesian probability assignment to deeply uncertain geopolitical scenarios produces outputs that carry false precision. A scenario probability of 0.35 implies a calibration standard: over many iterations of genuinely similar events, outcomes assessed at 0.35 should occur approximately 35% of the time. In deeply uncertain geopolitical domains, this correspondence cannot be established, yet numerical probability assignments are required for scenario outputs to be analytically actionable — decision-makers and risk models cannot work with purely qualitative scenario descriptions. The methodological challenge is to provide usable probability estimates while accurately communicating their epistemic status.

Method: Calibration Approaches and Their Trade-offs

Four calibration approaches from the forecasting and decision analysis literature address the deep uncertainty problem, each with distinct trade-offs between rigor and tractability:

Reference class forecasting (RCF). Developed by Kahneman and Lovallo [3] and systematized by Flyvbjerg [4], RCF estimates probabilities for a specific situation by anchoring to the base rate of a reference class of historically similar situations. RCF trades specificity for calibration: the resulting estimates are better calibrated against observed historical frequencies than estimates derived from analyzing the specific case in detail, but they may not adequately reflect characteristics of the current situation that are materially different from the historical reference class. The validity of RCF depends critically on the appropriateness of the reference class selection — a methodological judgment that is often the most contestable step in the process.

Structured expert elicitation (SEE). Expert probability estimates are elicited using structured protocols designed to reduce the availability, anchoring, and overconfidence biases that systematically distort unstructured expert judgment. The Good Judgment Project research program, culminating in Tetlock and Gardner's account of "superforecasting" [5], provides the most rigorous empirical documentation of what structured elicitation protocols achieve: the best-performing forecasters in structured elicitation contexts are substantially better calibrated than domain experts providing unstructured estimates, with Brier scores approaching those of prediction markets in domains where the latter are available. SEE requires access to appropriately skilled elicitation participants and structured facilitation; it is resource-intensive but produces the most directly defensible probability estimates in domains where reference class data is unavailable.

Interval probability representation. Rather than assigning point probabilities, scenarios are assigned probability intervals that explicitly represent estimation uncertainty. A scenario assessed as having probability between 0.20 and 0.45 is analytically more honest than one assigned 0.32 — the interval representation communicates that the estimate reflects genuine uncertainty rather than spurious precision. Interval representation is more tractable than SEE but less precise; it is appropriate as a default for all scenario outputs regardless of the calibration method applied, to communicate the uncertainty surrounding the point estimates.

Verbal probability tiers. For deeply uncertain domains where numerical interval representation would itself carry misleading precision, a minimal constraint-based approach assigns scenarios to ordered probability tiers using standardized verbal expressions. The U.S. Intelligence Community's standardized verbal probability scale — ranging from "remote" (approximately 5–20% probability) through "unlikely," "roughly even chance," "likely," "very likely" to "almost certain" (approximately 95%+) — provides a vocabulary calibrated to intelligence analysis practice and widely understood by policy consumers [6]. Verbal tiers sacrifice numerical precision in exchange for epistemic honesty about the limits of calibration in deeply uncertain domains.

Findings

Finding 1: The appropriate calibration method depends on domain uncertainty depth, not on analytical preference. Applying RCF to domains where no valid reference class exists does not produce better-calibrated estimates than expert judgment; it produces falsely precise estimates anchored to an inappropriate reference class. Similarly, applying verbal probability tiers to domains with adequate historical base rates discards usable calibration information. The framework proposed here selects calibration methods based on explicit assessment of domain uncertainty depth: RCF where reference classes are valid, SEE where they are not and elicitation participants are available, interval representation as a default overlay on all numerical estimates, verbal tiers for deeply uncertain domains lacking both valid reference classes and elicitation participants.

Finding 2: Calibration metadata is analytically necessary, not merely supplementary. Scenarios assigned probabilities without documenting the calibration basis cannot be correctly interpreted by analytical consumers. A probability of 0.28 derived from a well-validated reference class analysis carries substantially different epistemic status than a probability of 0.28 derived from structured expert elicitation under deep uncertainty — but both present identically to an analyst reading only the number. Systems that present all probability estimates with equivalent apparent precision are producing analytically misleading outputs regardless of the quality of any individual estimate. Calibration metadata — calibration method, reference class used where applicable, confidence level of the calibration itself — should be a required component of scenario output, not optional documentation.

Finding 3: Point probability presentation systematically increases overconfidence in decision-makers relative to interval presentation. Experimental findings in judgment and decision-making research consistently demonstrate that point probability presentations produce higher decision-maker confidence in scenario outcomes than interval presentations of the same scenarios — even when the intervals are explicitly described as representing estimation uncertainty and the point estimates are explicitly described as midpoints [5]. This has a practical implication for scenario communication: analysts presenting point probabilities to decision-makers should assume that the precision of the number will be read as a precision claim regardless of accompanying caveats, and should present intervals as the primary representation where decision-maker calibration matters.

Finding 4: Verbal probability tiers are less susceptible to false precision but more susceptible to interpretation variability. The I.C. verbal probability scale reduces the false precision problem but introduces a different challenge: different readers interpret verbal probability expressions differently, with documented research showing substantial variance in the numerical probabilities associated with terms like "likely" and "possibly" across reader populations [6]. Scenario outputs using verbal tiers should include the numerical range associated with each term, anchoring the verbal expression to a specific quantitative range for each analytical context.

Implications

For Analysts

The calibration framework presented here imposes an upfront discipline on scenario construction: before assigning probabilities, the analyst must assess the domain's uncertainty depth and select the appropriate calibration method. This discipline is not simply a methodological formality — it regularly surfaces the discovery that the analyst's intended calibration approach is not actually valid for the domain, which is itself analytically important information. An analyst who discovers that no valid reference class exists for their scenario domain has learned something about the epistemic status of their analysis that should change how they present conclusions.

For Risk Teams

Risk models that consume geopolitical scenario probabilities as inputs should propagate calibration uncertainty through the model rather than treating scenario probabilities as fixed parameters. A risk model incorporating a scenario probability of 0.28 with a calibration interval of 0.15–0.45 should produce meaningfully different output distributions than one treating 0.28 as a precise input — and the difference in output distributions is an accurate representation of the epistemic uncertainty in the underlying analysis. Risk teams that demand point-probability inputs are implicitly requesting false precision that the scenario analysis process cannot legitimately provide.

For Policy Planners

The decision-relevance of geopolitical scenario probabilities for policy planning depends on the sensitivity of the policy decision to the probability value. Scenario probability estimates that are too uncertain to distinguish between "roughly even chance" and "likely" cannot support policy decisions that require a precise probability threshold to be crossed before action is warranted. Policy planners should assess explicitly whether their decision is sensitive to probability precision before requiring numerical estimates from their analytical teams — and should consider whether verbal probability tiers provide adequate precision for decisions that require only ordinal probability rankings rather than cardinal estimates.

Limitations and Known Constraints

The calibration framework described here addresses the epistemic challenges of probability assignment in geopolitical scenario analysis; it does not resolve them. Reference class forecasting requires a valid reference class, which may not exist for novel configurations. Structured expert elicitation improves on unstructured judgment but cannot overcome the fundamental knowledge limitations that produce deep uncertainty in the first place. Interval representation and verbal probability tiers communicate uncertainty honestly but at the cost of analytical precision that decision-makers may require.

We are not arguing that any of these methods produces calibrated probability estimates in the actuarial sense for deeply uncertain geopolitical domains. The appropriate standard is not perfect calibration but honest communication of epistemic status: outputs that accurately represent what is and is not known about the probability structure of the scenario space, and that give analytical consumers the information they need to correctly weight those outputs in their decision processes.

A further limitation is that this framework focuses on single-scenario probability assignment. The joint probability structure of scenario trees — the correlations between scenario branches that arise from shared causal mechanisms — introduces additional calibration challenges not addressed here. A comprehensive treatment of probability calibration in geopolitical simulation would need to address portfolio-level calibration as well as individual scenario calibration, which remains an active area of methodological development.

References

Knight, F. H. (1921). Risk, Uncertainty, and Profit. Houghton Mifflin.
Walker, W. E., Harremoës, P., Rotmans, J., van der Sluijs, J. P., van Asselt, M. B. A., Janssen, P., & Krayer von Krauss, M. P. (2003). Defining uncertainty: A conceptual basis for uncertainty management in model-based decision support. Integrated Assessment, 4(1), 5–17.
Kahneman, D., & Lovallo, D. (1993). Timid choices and bold forecasts: A cognitive perspective on risk taking. Management Science, 39(1), 17–31.
Flyvbjerg, B. (2008). Curbing optimism bias and strategic misrepresentation in planning: Reference class forecasting in practice. European Planning Studies, 16(1), 3–21.
Tetlock, P. E., & Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. Crown.
Office of the Director of National Intelligence. (2015). Intelligence Community Directive 203: Analytic Standards. ODNI.