5,000 Iterations: Why Monte Carlo Confirms the Conclusions

Table of Contents

TLDR

A Monte Carlo simulation ran 5,000 iterations with eight Beta-distributed parameters to test whether the project's conclusions change under parameter uncertainty (PAPER TRAIL Project, 2026a). All three ACH hypotheses maintained their rank-one position with 100% probability across every iteration, and zero leads were fragile enough to flip classification (PAPER TRAIL Project, 2026b).


What Monte Carlo Tests

Monte Carlo simulation is a method for testing whether conclusions survive uncertainty. Instead of running an analysis once with fixed parameters, Monte Carlo runs it thousands of times with parameters drawn randomly from probability distributions that represent how uncertain we are about each value. If the conclusion holds across all those random draws, it is robust. If it flips, it is fragile.

The synthesis engine uses eight parameters that carry measurement uncertainty: OCR error rate, NER precision, NER recall, entity resolution F1, wire parsing accuracy, FedEx parsing accuracy, temporal alignment accuracy, and corpus completeness (PAPER TRAIL Project, 2026a). Each parameter was modeled as a Beta distribution -- a probability distribution bounded between 0 and 1, appropriate for rates and proportions -- with shape parameters derived from measured performance on validation samples.

ACH Stability

The three Analysis of Competing Hypotheses (ACH) matrices -- spoliation, willful blindness, and asset concealment -- were re-evaluated under each of the 5,000 parameter draws (PAPER TRAIL Project, 2026a).

Spoliation (FedEx Cessation). The spoliation hypothesis held rank one in 100% of iterations. The mean score was 5.20, with a 90% confidence interval of 5.04 to 5.36 (PAPER TRAIL Project, 2026a). No parameter draw produced a scenario where an alternative explanation -- innocent cessation, account closure, or data loss -- ranked higher than deliberate destruction or withholding.

Willful Blindness (Deutsche Bank Compliance). The willful blindness hypothesis held rank one in 100% of iterations, with a mean score of 6.40 and a 90% confidence interval of 6.21 to 6.58 (PAPER TRAIL Project, 2026a). This is the strongest of the three hypotheses. Even under the most adversarial parameter draws, the evidence for willful blindness outscored the alternatives by wide margins. Separate Bayesian Belief Network analysis confirmed that understaffing and industry norm hypotheses receive posterior probabilities below 0.001 (PAPER TRAIL Project, 2026c).

Asset Concealment (Butterfly Trust). The asset concealment hypothesis held rank one in 100% of iterations, with a mean score of 4.70 and a 90% confidence interval of 4.55 to 4.85 (PAPER TRAIL Project, 2026a). The legitimate estate planning alternative never approached the top rank.

The 100% rank stability across all three matrices means that no plausible combination of parameter errors can change which hypothesis the evidence favors. The scores shift slightly -- the confidence intervals show the range -- but the ordering never changes.

Lead Robustness

Every evidence chain in the system was also re-evaluated under the 5,000 parameter draws (PAPER TRAIL Project, 2026b). The question: does any lead flip to a finding (above the 0.75 adjusted confidence threshold), or does any finding flip to a lead?

The answer is zero fragile leads. Mean findings across all iterations was 0.0 -- no parameter draw produced a scenario where any chain crossed the finding threshold (PAPER TRAIL Project, 2026b). This confirms that the system's classification of all chains as leads (not findings) is not an artifact of the specific parameter values chosen. Even optimistic assumptions about OCR quality, NER precision, and corpus completeness do not push any chain over the line.

Sensitivity Rankings

The error disclosure analysis identifies which parameters matter most. Corpus completeness -- the proportion of the total entity population captured in the corpus, estimated at 63.7% by the Chao1 species richness estimator -- is the most sensitive parameter, with a mean sensitivity of 0.197 across all 20 evidence chains (PAPER TRAIL Project, 2026d). NER recall (0.076) and entity resolution (0.066) rank second and third. OCR error (0.018) and FedEx parsing (0.011) are the least sensitive.

This ranking has practical implications. Improving corpus completeness -- obtaining the estimated 468,000 missing entities through additional document releases or FOIA requests -- would do more to strengthen conclusions than any technical improvement to the OCR or NER pipeline. The Monte Carlo simulation quantifies what intuition suggests: the biggest limitation is not the tools; it is the data they operate on.

Why 5,000 Iterations

Standard Monte Carlo practice recommends enough iterations to stabilize the tails of the distribution. At 5,000 iterations, the 5th and 95th percentile estimates for each ACH score are stable to the second decimal place (PAPER TRAIL Project, 2026a). Doubling to 10,000 would narrow the confidence intervals marginally but would not change any conclusion. The computational cost is modest -- the full simulation completes in minutes -- so the choice of 5,000 reflects a balance between precision and the diminishing returns of additional iterations.


References

PAPER TRAIL Project. (2026a). Monte Carlo ACH robustness results [Data set]. _exports/synthesis/mc_ach_robustness.csv

PAPER TRAIL Project. (2026b). Monte Carlo lead robustness results [Data set]. _exports/synthesis/mc_lead_robustness.csv

PAPER TRAIL Project. (2026c). BBN posterior estimates [Data set]. _exports/synthesis/bbn_posteriors.csv

PAPER TRAIL Project. (2026d). Error disclosure matrix [Data set]. _exports/synthesis/error_disclosure_matrix.csv

PAPER TRAIL Project. (2026e). Cross-domain synthesis engine: Monte Carlo module [Software]. Script 25b, app/scripts/25_cross_domain_synthesis.py


This investigation is part of the SubThesis accountability journalism network.