What 2.1 Million Documents Look Like | Epstein Revealed

TLDR

The Jeffrey Epstein document corpus contains 2,100,266 files across 12 DOJ data sets and 6 source directories, totaling approximately 331 GB. It spans financial records, 863,000 emails, flight logs, FBI files, civil litigation records, and House Oversight materials — the largest publicly released collection of documents related to a single criminal network in U.S. history.

The Scale of What Was Released

When the Department of Justice published the first eight Epstein data sets on December 19, 2025, it marked the beginning of the largest single document release under the Epstein Files Transparency Act (Epstein Files Transparency Act, Pub. L. No. 119-38, 2025). Data Sets 9 through 12 followed on January 30, 2026 (U.S. Department of Justice [DOJ], 2026a). Together they contain 2,100,266 files organized into six root directories on disk: DOJ releases, FBI Vault FOIA materials, Giuffre v. Maxwell civil litigation records, House Oversight Committee documents, flight logs, and a GitHub mirror index (PAPER TRAIL Project, 2026a).

The numbers alone are difficult to grasp. The raw DOJ data totals approximately 221 GB. When you add House Oversight materials, FBI Vault releases, and other sources, the full corpus reaches 331 GB across 17 database sources (PAPER TRAIL Project, 2026b). To read every page at a rate of one per minute, working eight hours a day, would take over 12 years.

What the Data Sets Contain

The corpus is not one monolithic block. Each data set has a distinct character.

Data Sets 9 and 11 together constitute the largest sub-corpus: 863,000 email documents. DS9 alone contains 531,256 email PDFs weighing 94.5 GB, with a perfect match between files on disk and records in the database (PAPER TRAIL Project, 2026b). DS11 was initially described by DOJ as "financial ledgers and USVI flight manifests," but inspection revealed it actually contains 331,655 PDFs of seized email correspondence — a misidentification that went unnoticed until systematic analysis of the file contents (PAPER TRAIL Project, 2026c).

Data Set 10 is the financial backbone: 950,921 page-level records derived from 504,000 distinct multi-page PDFs of Deutsche Bank records (PAPER TRAIL Project, 2026b). These are the documents that produced the wire transfer analysis, the compliance failure reconstruction, and the institutional forensics that run through the financial chapters of this series. At 81.1 GB, it is the third-largest data set by disk size.

House Oversight records contribute 59,694 documents, including 33,655 files with DOJ-OGR identifiers and 26,034 with HOUSE_OVERSIGHT identifiers (PAPER TRAIL Project, 2026b). These include estate records, the Non-Prosecution Agreement, the birthday book, and the last will and testament.

How Files Are Organized

Every document follows the DOJ Epstein Library naming convention: an EFTA prefix followed by an eight-digit identifier (for example, EFTA00039025.pdf). The email data sets use a variant format with Bates-stamped Relativity load files (EFTA_R1_xxxxxxxx), indicating the emails were processed through litigation review software before release (PAPER TRAIL Project, 2026a).

The six root directories on disk mirror the institutional sources: 01_DOJ_DataSets/ holds the 12 data sets, 02_FBI_Vault/ contains FOIA releases, 03_Giuffre_v_Maxwell/ stores civil litigation materials, 04_House_Oversight/ has congressional records, 05_Flight_Logs/ contains aircraft logs, and 06_GitHub_Mirror_Index/ tracks mirrored sources.

Duplication and Integrity

One immediate concern with any document collection this large is duplication. A 50,000-file sample found 1,957 exact content matches — a duplication rate under 4% (PAPER TRAIL Project, 2026b). Across the full corpus, 4,238 duplicate identifiers were found out of 2,100,266 total. This is remarkably low for a collection assembled from multiple agencies, legal proceedings, and FOIA responses over decades.

The corpus audit system checks database records against disk files across all 17 sources, detecting overlaps at four tiers: identifier matching, filename matching, size-plus-type matching, and SHA-256 cryptographic hash verification (a method that creates a unique digital fingerprint for each file) (PAPER TRAIL Project, 2026d). This four-tier approach confirmed that newly downloaded House Oversight materials (estate records, the NPA) partially overlapped with existing database entries while also containing genuinely new documents.

What Is Not Here

For all its size, the corpus is incomplete. The DOJ identified "more than six million pages as potentially responsive" but released approximately 3.5 million — a 42% gap (DOJ, 2026a). Senator Wyden's Treasury investigation documented $1.08 billion in 4,725 wire transfers through Epstein accounts (PAPER TRAIL Project, 2026e), but those Treasury records are not in the public corpus. The complete FBI 302 interview summaries, many Suspicious Activity Report filings, and an estimated 2.5 million pages remain unreleased (PAPER TRAIL Project, 2026f).

What we have is vast. What we are missing may be larger. The 2.1 million documents in this corpus are not the full story — they are the fraction of the story that the government chose to make visible.

References

PAPER TRAIL Project. (2026a). DOJ Epstein Library portal structure and data set breakdown [Research document]. research/doj_release_index.md

PAPER TRAIL Project. (2026b). Corpus audit: Disk and database reconciliation, duplication rates [Data set]. _exports/audit/corpus_inventory.csv

PAPER TRAIL Project. (2026c). DS11 content resolution (OBS-7) [Observation log]. OBSERVATIONS.md

PAPER TRAIL Project. (2026d). Corpus audit and gap analysis script [Computer software]. app/scripts/27_corpus_audit.py

PAPER TRAIL Project. (2026e). External government sources inventory [Research document]. research/external_government_sources.md

PAPER TRAIL Project. (2026f). DOJ compliance analysis [Research document]. research/doj_compliance_status.md

Epstein Files Transparency Act, Pub. L. No. 119-38 (2025).

U.S. Department of Justice. (2026a, January 30). Department of Justice publishes 3.5 million responsive pages in compliance with Epstein Files [Press release]. https://www.justice.gov/opa/pr/department-justice-publishes-35-million-responsive-pages-compliance-epstein-files

U.S. Department of Justice. (2026b, January 30). DAG Todd Blanche press conference transcript. Rev. https://rev.com/transcripts/doj-epstein-files-press-conference

TLDR

The Scale of What Was Released

What the Data Sets Contain

How Files Are Organized

Duplication and Integrity

What Is Not Here

References

Continue the Investigation

The 16-Script Pipeline

The 42% Gap

Chao1 Species Richness