TLDR
Data Set 10 contains approximately 950,000 page-level database records derived from 504,000 distinct multi-page PDFs, primarily Deutsche Bank financial documents. The multi-page structure creates a 446,891-record delta between disk files and database entries — an artifact of processing, not missing data. At 81.1 GB and 57.4% Chao1 completeness (a statistical estimate of what percentage of all discoverable entities have been found), DS10 is the financial backbone of the entire analytical pipeline.
The Numbers Do Not Match (And That Is Fine)
When the corpus audit system compared files on disk to records in the database for Data Set 10, it found a delta of negative 446,891. There were far more database records than files on disk (PAPER TRAIL Project, 2026a). In any other data set, this would signal a serious integrity problem. In DS10, it is an expected artifact of document structure.
Deutsche Bank financial records are multi-page PDFs. A single account statement, KYC form (a "Know Your Customer" form that banks use to verify client identity), or wire confirmation may span 5, 10, or 190 pages. The processing pipeline creates a separate database record for each page, enabling page-level search, entity extraction, and annotation. A 10-page PDF becomes 10 database records but remains 1 file on disk. Across 504,000 distinct PDF files, this page-level expansion produces approximately 950,000 records.
The size calculation exposed this clearly. Without deduplication on file path, summing file sizes across all 950,000 records yielded 14.3 TB — an impossibility for a data set that fits on a single drive. Applying the correction (counting each file only once) produced the actual figure: 81.1 GB (PAPER TRAIL Project, 2026a).
What DS10 Contains
DOJ described Data Set 10 as containing "180,000 images and 2,000 videos seized from Epstein properties" with heavy redactions and approximately 14 hours of Epstein's own recordings, age-gated at 18+ (U.S. Department of Justice [DOJ], 2026). That description is accurate but incomplete. DS10 also contains the Deutsche Bank financial records that form the foundation of the project's financial forensics analysis.
The key documents in DS10 include the TD Bank Suspicious Activity Report (SAR) — a confidential bank filing sent to the government when a bank suspects illegal activity — in EFTA01656524.pdf, documenting $47.3 million (forty-seven million, three hundred thousand dollars) in suspicious activity across 25 subjects (PAPER TRAIL Project, 2026b). The Deutsche Bank portal credentials (EFTA01268275 through EFTA01268283), showing Richard Kahn's login information reproduced across eight duplicate captures. The Ghislaine Maxwell UBS account statements (EFTA01275697.pdf, 190 pages), which became the source of the retracted OBS-6 "Krakow" hallucination (a false reading generated by scanning software). And approximately 16,600 pages of FedEx invoices (EFTA01312563 through EFTA01337164) that support the 2,894-shipment shipping analysis (PAPER TRAIL Project, 2026c).
In short, DS10 is where the money is. The wire transfers, the compliance failures, the KYC forms, the account statements — the documentary record of Deutsche Bank's six-year relationship with Epstein's financial network lives in this data set.
57.4% Complete
The Chao1 species richness estimator — a statistical method borrowed from ecology that estimates the total number of species (or in this case, entities) in a population based on how many have been observed only once or twice — calculates DS10 at 57.4% entity completeness (PAPER TRAIL Project, 2026d). This means the statistical model predicts that 42.6% of entities that should be discoverable from DS10 documents remain undetected. This is the largest absolute gap of any data set, which is expected given that DS10 is also the largest single entity source in the corpus.
The gap has two primary causes. First, errors introduced when scanning software tries to read blurry or damaged text are severe on financial documents. Deutsche Bank forms combine pre-printed labels, typed text, handwritten entries, and stamps on the same page. Scanning software struggles with this layered typography, producing both missed entities and phantom entities (as OBS-5 and OBS-6 demonstrated). Second, the sheer volume means that even at 57.4% completeness, DS10 has already contributed more entities to the corpus than any other single source.
The Financial Pipeline
Nearly every financial finding in this project traces back to DS10 (PAPER TRAIL Project, 2026e). The bank records classifier processed DS10 documents into the wire_transfers and bank_documents tables. The FedEx parser extracted shipping records from DS10 invoices. The institutional forensics analysis examined DS10-sourced compliance language for willful blindness markers (language suggesting a bank deliberately avoided learning about suspicious activity). The cross-domain synthesis fuses DS10 financial events with shipping, flight, and corporate registration data.
The wire transfer corpus — 224 parsed transactions totaling $24.1 million (twenty-four million, one hundred thousand dollars) — was extracted from DS10 bank confirmations and statement pages. The VLM (Vision Language Model) recovery passes went back to DS10 source PDFs with advanced AI-powered scanning to recover dates, originators, and beneficiaries that text-only scanning had missed, ultimately achieving 58.9% date coverage, 92.9% originator coverage, and 94.6% beneficiary coverage (PAPER TRAIL Project, 2026e).
Age-Gated and Partially Redacted
DS10 is the only data set with an age gate (18+), reflecting the presence of images and recordings seized from Epstein properties (DOJ, 2026). This age-gating is separate from the redaction issue — redactions suppress content within documents, while age-gating restricts access to entire files.
The combination of heavy redactions, multi-page PDF complexity, scanning challenges on financial documents, and age-gated content makes DS10 the most analytically demanding data set in the corpus. It is also the most valuable. The financial records it contains are the closest thing to a paper trail through the Epstein network's operational infrastructure — and paper trails, however degraded, are what forensic analysis is built to follow.
References
U.S. Department of Justice. (2026). Epstein files library: Data Set 10. Published January 30, 2026. justice.gov/epstein.
PAPER TRAIL Project. (2026a). Corpus audit and size correction. [Script: app/scripts/27_corpus_audit.py; Export: _exports/audit/corpus_inventory.csv].
PAPER TRAIL Project. (2026b). TD Bank SAR extraction. [Data analysis: research/td_bank_sar_extraction.md].
PAPER TRAIL Project. (2026c). Observation log: OBS-2, OBS-6. [Data analysis: OBSERVATIONS.md].
PAPER TRAIL Project. (2026d). Chao1 completeness estimates. [Export: _exports/validation/chao1_summary.json].
PAPER TRAIL Project. (2026e). Wire transfer processing pipeline. [Scripts 16, 16b-16g; Database table: wire_transfers, db=epstein_files].