LLM-assisted wargaming represents a qualitatively distinct capability from classical game-theoretic wargaming, with correspondingly distinct failure modes. This paper examines the theoretical basis for this distinction, identifies the Structured Analytic Techniques most effective for validating LLM wargame outputs against expert judgment, and proposes a structured validation protocol for organizations deploying LLM-assisted wargaming in policy or strategic planning contexts. Particular attention is given to the failure modes that SAT protocols must address and to the limitations of the validation approach itself.
Background and Context
Wargaming as a structured analytical practice has served military and strategic planning functions since at least the Prussian Kriegsspiel of the early nineteenth century [1]. Modern applications span military operational planning, corporate strategy stress-testing, and — with increasing frequency in the post-2010 period — policy analysis in intelligence and diplomatic contexts. The core function across all applications is consistent: wargaming forces explicit reasoning about adversary decision-making under conditions of uncertainty and incomplete information.
Classical wargaming relies on structured rule systems, defined actor move sets, and referee judgment to manage the simulation space. Game-theoretic formalizations — Nash equilibrium analysis, extensive-form games, Bayesian games with private information — add mathematical rigor to this framework but introduce correspondingly strong assumptions: rational actors, common knowledge of the game structure, and move sets that can be fully enumerated in advance [2]. These assumptions hold poorly in most real geopolitical settings, where actor rationality is bounded by organizational culture and political constraints, game structure is contested rather than commonly known, and the move space expands dynamically as actors learn from each other's actions.
LLM-assisted wargaming offers a different trade-off profile. Rather than imposing formal rationality and exhaustive move enumeration, an LLM-based approach simulates actor behavior by drawing on the model's representation of how actors with specified characteristics and constraints have historically made decisions. This produces higher behavioral realism — capturing the political costs, reputational considerations, and coalition management dynamics that formal models struggle to represent — at the cost of the formal guarantees that game-theoretic models provide. The practical question for analytical teams is how to extract the benefits of LLM behavioral modeling while managing the specific failure modes that LLM-based simulation introduces.
Method: Comparative Framework and SAT Selection
The analytical approach develops a comparative framework across three analytical dimensions: actor behavioral modeling fidelity, move space coverage, and outcome probability calibration. For each dimension, LLM-assisted and classical wargaming approaches are evaluated against the analytical requirements of policy-oriented wargames — wargames conducted to inform decisions that will be subject to institutional review, not merely to build shared understanding within a team.
The SAT framework applied draws on three techniques from the Heuer-Pherson taxonomy, selected for applicability to the specific weaknesses identified in LLM-assisted wargaming:
Indicators and Warning (I&W) Analysis. Applied to move detection in LLM-generated sequences, I&W analysis requires that each actor move in a generated sequence be associated with identifiable leading indicators that would precede the move in a real-world context. This constraint filters out moves that are narratively plausible but operationally implausible — the category of LLM generation most likely to mislead decision-makers who read wargame outputs as operational intelligence.
Multiple Scenario Generation (MSG). Applied to outcome enumeration, MSG requires that the wargame produce multiple distinct outcome clusters rather than converging on a dominant narrative. In classical wargaming, multiple outcomes emerge naturally from the branching structure of the game tree; in LLM-assisted wargaming, narrative convergence pressure can cause the model to reduce apparent outcome diversity while preserving surface-level textual variation. MSG constraints counteract this by requiring structural distinction between outcomes, not just textual variation.
Quality of Information Check (QIC). Applied to the actor profiles and behavioral models that anchor LLM simulation, QIC evaluates the reliability and currency of the information on which the actor model is based. Actor models constructed from outdated or selectively sourced information will produce systematically biased simulation outputs regardless of the quality of the generation process itself. QIC forces explicit documentation of the information basis for each actor model, enabling reviewers to identify and challenge the weakest actor characterizations before the simulation proceeds.
Findings
Finding 1: LLM-assisted wargaming demonstrates consistent advantages in behavioral realism over classical approaches. In domains where actor decision constraints are difficult to formalize — including political coalition management, reputational risk calculation, and communication dynamics under uncertainty — LLM-based actor simulation produced outputs that reviewer teams rated as more behaviorally realistic than formal game-theoretic alternatives. The advantage was most pronounced in scenarios involving actors facing significant domestic political constraints, where the interaction between external strategic options and internal political costs is difficult to represent formally but is well-represented in LLM training data.
Finding 2: LLM-assisted wargaming exhibits systematic weaknesses in move probability calibration. Across all tested domains, LLM-generated actor move sequences demonstrated systematic overconfidence in the probability of dominant move choices and underrepresentation of lower-probability but strategically significant alternatives. This finding is consistent with the general calibration literature on LLM confidence: the model's probability assignments reflect training data frequencies rather than the structured calibration required for analytical use. The I&W constraint partially addresses this by filtering implausible moves, but does not substitute for formal probability calibration of move sequences.
Finding 3: The QIC step identifies high-impact actor model weaknesses that would otherwise propagate undetected through the simulation. In multiple wargame exercises, QIC review of actor models prior to simulation identified actor characterizations based on information that was either outdated (reflecting the actor's strategic posture from a prior period) or selectively sourced (reflecting a particular analytical perspective rather than the balance of available evidence). These weaknesses, if uncorrected, would have produced simulation outputs that reflected the weaknesses of the actor model rather than the dynamics of the scenario under analysis.
Finding 4: Asymmetric information scenarios represent a specific LLM-assisted wargaming strength without classical equivalent. Scenarios requiring actors to operate on systematically different information — where each actor has partial knowledge of the situation that the others lack — are difficult to manage in classical wargaming and require significant referee infrastructure. LLM-assisted simulation handles this configuration more tractably by simulating each actor's decision process from its specific information state rather than from a common game structure. This capability is analytically significant for scenarios involving intelligence operations, diplomatic signaling, and crisis communication dynamics.
Implications
For Analysts
The behavioral realism advantage of LLM-assisted wargaming is most valuable in analytical contexts where understanding actor decision constraints is more important than formal probability guarantees — policy analysis, diplomatic assessment, and crisis simulation. Analysts should match the wargaming approach to the analytical question: where formal probability bounds are required, classical approaches retain advantages; where behavioral modeling of constrained actors is the primary analytical need, LLM-assisted approaches with SAT validation are appropriate.
For Risk Teams
Risk teams using wargame outputs as inputs to formal risk models should apply explicit calibration protocols to LLM-generated move probability estimates before using them in quantitative analysis. LLM-generated probabilities are starting points for structured calibration, not calibrated estimates in themselves. Consuming them as calibrated estimates will introduce systematic overconfidence into any downstream quantitative model.
For Policy Planners
The defensibility of wargame-based policy recommendations depends substantially on the actor model documentation that underlies the simulation. Policy teams should require QIC documentation as a wargame deliverable — not because QIC eliminates actor model uncertainty, but because it makes the uncertainty explicit and auditable. Undocumented actor models cannot be challenged by reviewers who did not participate in their construction; documented models can be reviewed, updated, and improved as new information becomes available.
Limitations and Known Constraints
The validation protocol described in this paper addresses the specific failure modes identified in controlled wargame exercises. It does not address failure modes that were not identified in those exercises — including failure modes that may be systematically difficult to detect precisely because LLM outputs are coherent and authoritative in register. The coherent-narrative problem in LLM wargaming is not fully resolved by SAT validation; it is moderated by it.
We are not arguing that LLM-assisted wargaming with SAT validation is a substitute for human expert wargaming in high-stakes policy contexts. The appropriate comparison is between LLM-assisted wargaming with SAT validation and the alternatives that organizations actually use in resource-constrained environments — which often includes informal scenario discussion without formal structure rather than full expert wargames. Against that baseline, the SAT-validated LLM approach offers substantial analytical improvements. Against an idealized expert wargame with deep domain expertise and adequate time, the comparison is less favorable and depends heavily on domain-specific factors.
References
- Perla, P. P. (1990). The Art of Wargaming: A Guide for Professionals and Hobbyists. Naval Institute Press.
- Fudenberg, D., & Tirole, J. (1991). Game Theory. MIT Press.
- Heuer, R. J., & Pherson, R. H. (2014). Structured Analytic Techniques for Intelligence Analysis (2nd ed.). CQ Press.
- Schwartz, P., & Ogilvy, J. (1998). Plotting your scenarios. In L. Fahey & R. Randall (Eds.), Learning from the Future. Wiley.
- RAND Corporation. (2021). Wargaming for Policy Analysis. RAND National Security Research Division.
- Tetlock, P. E., & Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. Crown.