The Document Releases: 2.1 Million Pages and the Fight for Transparency

How 2.1 million government documents were released, processed, and analyzed

Table of Contents

Overview

The Epstein Files Transparency Act (P.L. 119-38) mandated the release of documents held by the Department of Justice. The resulting data drops — 12 formal data sets plus supplemental files — constitute one of the largest single releases of investigative material in U.S. history. Epstein Revealed has processed 2.1 million pages through a custom 16-script pipeline, transforming raw, unformatted government scans into structured, verifiable data.

This page provides a guide to the document releases, the processing methodology, and the key findings within each data set.

The Data Sets

December 2025: Data Sets 1–8

Released on December 19, 2025, the first batch met the 30-day statutory deadline but contained significant gaps. Our analysis of Data Sets 1–8 documents the contents, including 500+ entirely blacked-out pages and systematic redaction patterns.

January 2026: The Major Release

The January 30, 2026 release represented the bulk of the corpus:

The 42% Gap

Despite the Transparency Act's mandate, approximately 42% of known documents remain unreleased. The Democracy Defenders Fund FOIA litigation (Case No. 1:25-cv-02791, D.D.C.) is the primary legal challenge seeking the remainder.

Key investigation: The 42% Gap

The Email Corpus

863,000 Emails

The combined email corpus from Data Sets 9 and 11 contains approximately 863,000 individual messages. Our team utilized Relativity litigation processing — the industry standard for e-discovery — to ingest, deduplicate, and thread the corpus.

The DS9 vs. DS11 comparison documents the overlap and divergence between the two email data sets, identifying messages that were reclassified between releases.

Processing Methodology

The 16-Script Pipeline

Converting 2.1 million raw government scans into structured data required a custom 16-script pipeline addressing:

  • Document type identification and routing
  • Multi-format OCR (typed, handwritten, tabular)
  • Entity extraction across 16 scripts and character sets
  • Cross-reference linking between documents
  • Quality validation and error correction

Key investigation: What 2 Million Documents Look Like

OCR Challenges and Phantom Entities

Government scans are notoriously poor quality — faded ink, skewed pages, intentionally obscured file extensions. Our analysis of OCR hallucinations documents how automated text recognition generates phantom entities: names, numbers, and references that appear in the extracted text but don't exist in the original documents.

The 2.38 million NER entities analysis quantifies the scale of named entity recognition across the corpus, while the error disclosure methodology establishes the statistical framework for communicating uncertainty.

The 2.5 Million Missing Pages

Not all documents survived the digitization process intact. 2.5 Million Missing Pages investigates the gap between the expected page count (based on document manifests) and the actual pages delivered, identifying patterns in what was lost.

Data Quality and Verification

The Four-Tier System

Every finding published on Epstein Revealed is classified using a four-tier verification system:

  • Tier 1: Direct documentary evidence (original documents, unambiguous)
  • Tier 2: Corroborated findings (multiple independent sources)
  • Tier 3: Analytical outputs (computational findings requiring interpretation)
  • Tier 4: Leads and hypotheses (requiring further investigation)

Download and Verify

In the spirit of open-source intelligence, our download and verify guide provides instructions for independently accessing and verifying the DOJ data sets.

Key investigation: The Distribution Roadmap

Key Document Deep Dives

Our most significant individual document analyses:

Where to Start

If you're new to the document releases, we recommend:

  1. What 2 Million Documents Look Like — Visual overview of the corpus
  2. Data Sets 1–8 — The first release
  3. The 16-Script Pipeline — How we process the data
  4. The Four-Tier Verification System — How we verify findings