Epstein Document Releases — DOJ Data Sets, 2.1 Million Documents & Transparency Act | Epstein Revealed

Overview

The Epstein Files Transparency Act (P.L. 119-38) mandated the release of documents held by the Department of Justice. The resulting data drops — 12 formal data sets plus supplemental files — constitute one of the largest single releases of investigative material in U.S. history. Epstein Revealed has processed 2.1 million pages through a custom 16-script pipeline, transforming raw, unformatted government scans into structured, verifiable data.

This page provides a guide to the document releases, the processing methodology, and the key findings within each data set.

The Data Sets

December 2025: Data Sets 1–8

Released on December 19, 2025, the first batch met the 30-day statutory deadline but contained significant gaps. Our analysis of Data Sets 1–8 documents the contents, including 500+ entirely blacked-out pages and systematic redaction patterns.

January 2026: The Major Release

The January 30, 2026 release represented the bulk of the corpus:

Data Set 9: 531,256 Emails — The first tranche of the email corpus, including personal correspondence, business communications, and legal files
Data Set 10: 950,921 Deutsche Bank Records — Financial records spanning the entire Deutsche Bank relationship
Data Set 11: Misidentified Emails — A subset of emails incorrectly classified in the initial release, later reclassified
Data Set 12: 150 Supplemental Files — Additional documents released after the initial batches

The 42% Gap

Despite the Transparency Act's mandate, approximately 42% of known documents remain unreleased. The Democracy Defenders Fund FOIA litigation (Case No. 1:25-cv-02791, D.D.C.) is the primary legal challenge seeking the remainder.

Key investigation: The 42% Gap

The Email Corpus

863,000 Emails

The combined email corpus from Data Sets 9 and 11 contains approximately 863,000 individual messages. Our team utilized Relativity litigation processing — the industry standard for e-discovery — to ingest, deduplicate, and thread the corpus.

The DS9 vs. DS11 comparison documents the overlap and divergence between the two email data sets, identifying messages that were reclassified between releases.

Processing Methodology

The 16-Script Pipeline

Converting 2.1 million raw government scans into structured data required a custom 16-script pipeline addressing:

Document type identification and routing
Multi-format OCR (typed, handwritten, tabular)
Entity extraction across 16 scripts and character sets
Cross-reference linking between documents
Quality validation and error correction

Key investigation: What 2 Million Documents Look Like

OCR Challenges and Phantom Entities

Government scans are notoriously poor quality — faded ink, skewed pages, intentionally obscured file extensions. Our analysis of OCR hallucinations documents how automated text recognition generates phantom entities: names, numbers, and references that appear in the extracted text but don't exist in the original documents.

The 2.38 million NER entities analysis quantifies the scale of named entity recognition across the corpus, while the error disclosure methodology establishes the statistical framework for communicating uncertainty.

Tier 1: Direct documentary evidence (original documents, unambiguous)
Tier 2: Corroborated findings (multiple independent sources)
Tier 3: Analytical outputs (computational findings requiring interpretation)
Tier 4: Leads and hypotheses (requiring further investigation)

Download and Verify

In the spirit of open-source intelligence, our download and verify guide provides instructions for independently accessing and verifying the DOJ data sets.

Key investigation: The Distribution Roadmap

Key Document Deep Dives

Our most significant individual document analyses:

The 29-Page SAR — TD Bank's Suspicious Activity Report documenting $47.3 million
The Approval Email — The uncorroborated Deutsche Bank compliance claim
The Birthday Book — 238 pages of organized contacts
The Non-Prosecution Agreement — Nine pages that shaped twelve years
Banking Credentials Released: OBS-2 — Financial credentials in the public release
The Last Will: August 8, 2019 — Signed two days before death

Where to Start

If you're new to the document releases, we recommend:

What 2 Million Documents Look Like — Visual overview of the corpus
Data Sets 1–8 — The first release
The 16-Script Pipeline — How we process the data
The Four-Tier Verification System — How we verify findings

The Document Releases: 2.1 Million Pages and the Fight for Transparency

Overview

The Data Sets

December 2025: Data Sets 1–8

January 2026: The Major Release

The 42% Gap

The Email Corpus

863,000 Emails

Processing Methodology

The 16-Script Pipeline

OCR Challenges and Phantom Entities

The 2.5 Million Missing Pages

Data Quality and Verification

The Four-Tier System

Download and Verify

Key Document Deep Dives

Where to Start

Related Investigations

Data Sets 1-8: The December Deadline Release

Data Set 9: 531,256 Emails and a Perfect Match

Data Set 10: 950,000 Pages of Deutsche Bank Records

Data Set 11: The Misidentified Email Trove

Data Set 12: 150 Documents and the 'Final Release'

2.5 Million Missing Pages