Data Set 9: 531,256 Emails and a Perfect Match

Table of Contents

TLDR

Data Set 9 is the largest single data set by file count: 531,256 email PDFs occupying 94.5 GB. It achieved a perfect disk-to-database match (zero delta), complete Named Entity Recognition (NER) processing, and — combined with Data Set 11 — forms an 863,000-document email corpus that is the largest sub-corpus in the Epstein collection.

The Largest Single Data Set

When DOJ published Data Sets 9 through 12 on January 30, 2026, Data Set 9 immediately stood out by volume (U.S. Department of Justice [DOJ], 2026). Its 531,256 files make it the single largest data set by count. At 94.5 GB, it is the second-largest by disk size, trailing only the House Oversight materials in storage footprint.

The contents are email correspondence — Epstein-related emails rendered as individual PDF files, one page per document. DOJ described the set as containing email evidence including Epstein correspondence with high-profile individuals and internal DOJ correspondence regarding the 2008 Non-Prosecution Agreement (an agreement in which prosecutors decline to file charges in exchange for certain conditions). DOJ also acknowledged that DS9 is "known incomplete at source level," a caveat that signals even the agency releasing the files recognizes gaps in its own production (PAPER TRAIL Project, 2026a).

Perfect Integrity

In a corpus of 2.1 million documents assembled from multiple agencies and legal proceedings, data integrity is a constant concern. Every processing step — metadata ingestion, scanning, entity extraction — depends on the assumption that what is on disk matches what is in the database.

Data Set 9 provides the cleanest integrity signal in the entire corpus. The corpus audit system found exactly 531,256 files on disk and exactly 531,256 records in the PostgreSQL database (PAPER TRAIL Project, 2026b). The delta is zero. No missing files, no orphaned records, no phantom entries. For a data set approaching 100 GB, this level of alignment is unusual and valuable. It means every analytical result derived from DS9 rests on verified file-to-record correspondence.

Relativity Load Files

The emails were not dumped as raw .eml files. They were processed through Relativity — a standardized software platform used by legal teams to organize large document collections for review (a process called "litigation review" or "e-discovery"). DS9 uses DAT/OPT load file format — database files used by legal document review software — with sequential page numbering, meaning every email went through a structured legal review workflow before being rendered to PDF and released (PAPER TRAIL Project, 2026c).

This matters for two reasons. First, it means someone — likely a DOJ contractor or internal review team — made page-level decisions about what to include, exclude, and redact in the email corpus. The Relativity processing is a filter between the raw seized emails and what the public received. Second, the sequential numbering creates an identifier space where gaps in the sequence can indicate withheld documents, though they can also reflect standard legal review decisions about relevance and legal privilege (the right to withhold certain communications between attorneys and clients).

NER Processing Complete

Named Entity Recognition (NER) — the automated process of identifying and categorizing names of people, organizations, and locations in text — across DS9 is now complete as part of the pipeline's processing of 2,046,260 documents (PAPER TRAIL Project, 2026d). The spaCy natural language processing library's large English model extracted person names, organization names, and location references from every email PDF.

The NER results feed into the entity resolution pipeline (which links different mentions of the same person or organization), the co-occurrence graph (which maps who appears alongside whom in documents), the temporal analysis (which detects patterns over time), and ultimately the cross-domain synthesis. DS9's complete NER coverage means the email corpus is fully integrated into the analytical framework — every entity extracted from an email can be linked to the same entity appearing in financial records, flight logs, or FedEx shipments.

The 863K Email Corpus

DS9 does not exist in isolation. Combined with Data Set 11's 331,655 email PDFs, it forms an 863,000-document email corpus — the single largest analytical domain in the project (PAPER TRAIL Project, 2026d). This combined corpus contains Epstein's personal correspondence (including emails from jeevacation@gmail.com), travel coordination, scheduling, financial discussion, and legal communication spanning years of operations.

The email corpus is where operational patterns live. Financial records show money moving. Flight logs show people moving. But emails show decision-making: who authorized what, who knew what, and when they knew it. The 863,000 email documents are not just the largest data set — they are potentially the most revealing one for understanding how the network functioned day to day.

What "Known Incomplete" Means

DOJ's acknowledgment that DS9 is "known incomplete at source level" is worth dwelling on (PAPER TRAIL Project, 2026a). The agency that released the files is stating, on the record, that the email collection it published does not represent the full set of responsive emails in its possession.

This is not speculation about what might be missing. It is the government's own disclosure that it is. The question is how much. Senator Wyden's Treasury investigation documented 4,725 wire transfers (Wyden, 2025). Representative Raskin cited 200,000 withheld pages (Raskin, 2026). The gap between the 531,256 emails released and the total number of seized Epstein emails is unknown — but DOJ itself has told us it exists.

Data Set 9 is the best-verified, most complete data set in the corpus by every technical measure. It is also, by the government's own admission, incomplete. That paradox runs through every layer of this project.

References

U.S. Department of Justice. (2026). Epstein files library: Data Sets 9-12. Published January 30, 2026. justice.gov/epstein.

PAPER TRAIL Project. (2026a). DOJ release index. [Data analysis: research/doj_release_index.md].

PAPER TRAIL Project. (2026b). Corpus audit. [Script: app/scripts/27_corpus_audit.py; Export: _exports/audit/corpus_inventory.csv].

PAPER TRAIL Project. (2026c). Observation log: OBS-7 (Relativity load file format). [Data analysis: OBSERVATIONS.md].

PAPER TRAIL Project. (2026d). NER processing status. [Script: app/scripts/04_extract_entities.py].

Wyden, R. (2025). Treasury investigation into Epstein financial accounts. U.S. Senate Finance Committee.

Raskin, J. (2026, January 31). Letter to DAG Todd Blanche regarding Epstein document production. democrats-judiciary.house.gov.