Overview
The Epstein Files Transparency Act (P.L. 119-38) mandated the release of documents held by the Department of Justice. The resulting data drops — 12 formal data sets plus supplemental files — constitute one of the largest single releases of investigative material in U.S. history. Epstein Revealed has processed 2.1 million pages through a custom 16-script pipeline, transforming raw, unformatted government scans into structured, verifiable data.
This page provides a guide to the document releases, the processing methodology, and the key findings within each data set.
The Data Sets
December 2025: Data Sets 1–8
Released on December 19, 2025, the first batch met the 30-day statutory deadline but contained significant gaps. Our analysis of Data Sets 1–8 documents the contents, including 500+ entirely blacked-out pages and systematic redaction patterns.
January 2026: The Major Release
The January 30, 2026 release represented the bulk of the corpus:
- Data Set 9: 531,256 Emails — The first tranche of the email corpus, including personal correspondence, business communications, and legal files
- Data Set 10: 950,921 Deutsche Bank Records — Financial records spanning the entire Deutsche Bank relationship
- Data Set 11: Misidentified Emails — A subset of emails incorrectly classified in the initial release, later reclassified
- Data Set 12: 150 Supplemental Files — Additional documents released after the initial batches
The 42% Gap
Despite the Transparency Act's mandate, approximately 42% of known documents remain unreleased. The Democracy Defenders Fund FOIA litigation (Case No. 1:25-cv-02791, D.D.C.) is the primary legal challenge seeking the remainder.
Key investigation: The 42% Gap
The Email Corpus
863,000 Emails
The combined email corpus from Data Sets 9 and 11 contains approximately 863,000 individual messages. Our team utilized Relativity litigation processing — the industry standard for e-discovery — to ingest, deduplicate, and thread the corpus.
The DS9 vs. DS11 comparison documents the overlap and divergence between the two email data sets, identifying messages that were reclassified between releases.
Processing Methodology
The 16-Script Pipeline
Converting 2.1 million raw government scans into structured data required a custom 16-script pipeline addressing:
- Document type identification and routing
- Multi-format OCR (typed, handwritten, tabular)
- Entity extraction across 16 scripts and character sets
- Cross-reference linking between documents
- Quality validation and error correction
Key investigation: What 2 Million Documents Look Like
OCR Challenges and Phantom Entities
Government scans are notoriously poor quality — faded ink, skewed pages, intentionally obscured file extensions. Our analysis of OCR hallucinations documents how automated text recognition generates phantom entities: names, numbers, and references that appear in the extracted text but don't exist in the original documents.
The 2.38 million NER entities analysis quantifies the scale of named entity recognition across the corpus, while the error disclosure methodology establishes the statistical framework for communicating uncertainty.
The 2.5 Million Missing Pages
Not all documents survived the digitization process intact. 2.5 Million Missing Pages investigates the gap between the expected page count (based on document manifests) and the actual pages delivered, identifying patterns in what was lost.
Data Quality and Verification
The Four-Tier System
Every finding published on Epstein Revealed is classified using a four-tier verification system:
- Tier 1: Direct documentary evidence (original documents, unambiguous)
- Tier 2: Corroborated findings (multiple independent sources)
- Tier 3: Analytical outputs (computational findings requiring interpretation)
- Tier 4: Leads and hypotheses (requiring further investigation)
Download and Verify
In the spirit of open-source intelligence, our download and verify guide provides instructions for independently accessing and verifying the DOJ data sets.
Key investigation: The Distribution Roadmap
Key Document Deep Dives
Our most significant individual document analyses:
- The 29-Page SAR — TD Bank's Suspicious Activity Report documenting $47.3 million
- The Approval Email — The uncorroborated Deutsche Bank compliance claim
- The Birthday Book — 238 pages of organized contacts
- The Non-Prosecution Agreement — Nine pages that shaped twelve years
- Banking Credentials Released: OBS-2 — Financial credentials in the public release
- The Last Will: August 8, 2019 — Signed two days before death
Where to Start
If you're new to the document releases, we recommend:
- What 2 Million Documents Look Like — Visual overview of the corpus
- Data Sets 1–8 — The first release
- The 16-Script Pipeline — How we process the data
- The Four-Tier Verification System — How we verify findings