DS9 vs. DS11: 863,000 Emails the DOJ Mislabeled | Epstein Revealed

TLDR

Data Set 9 (531,256 files) and Data Set 11 (331,655 files) together constitute 863,000 email documents — the largest sub-corpus. DS11 was publicly described as "financial ledgers and USVI flight manifests," but systematic inspection revealed it actually contains seized email correspondence, a misidentification that went uncorrected until automated analysis of the file contents.

The Mislabeled Data Set

When Data Sets 9 through 12 were released on January 30, 2026, the DOJ's descriptions guided initial understanding of what each contained (PAPER TRAIL Project, 2026a). Data Set 9 was identified as email records. Data Set 11 was described by multiple outlets as "financial ledgers, USVI flight manifests, property seizure records." Community indexing sites described it as approximately 180,000 images and 2,000 videos.

None of that was accurate.

Inspection of sample PDFs from Data Set 11 revealed 331,655 files of seized email correspondence plus 4 .m4v video files (PAPER TRAIL Project, 2026a). The emails are structured as Relativity load files — a format used by litigation review software that law firms and government investigators use to manage large document collections. The files include DAT/OPT metadata and sequentially numbered pages (a legal numbering system for document tracking, commonly known as "Bates numbering") under the EFTA_R1_ prefix.

The misidentification was not a minor labeling error. It meant that for weeks after release, the second-largest email corpus in the collection was being described as financial records. Researchers looking for email evidence were directed to DS9 alone. The 332,000 additional email documents in DS11 sat unexamined under a wrong label.

What Each Data Set Contains

Data Set 9 holds 531,256 email PDFs totaling 94.5 GB (PAPER TRAIL Project, 2026b). It has a perfect match between files on disk and records in the database — every file is accounted for, and entity extraction processing is complete. It is the largest data set by file count.

Data Set 11 holds 331,655 email PDFs totaling approximately 28 GB (PAPER TRAIL Project, 2026a). Its emails date to at least April-May 2017, placing them in the post-NPA (Non-Prosecution Agreement — the deal that granted federal immunity), post-conviction period when Epstein was a registered sex offender. The email samples document continued operations: scheduling, travel coordination, catering requests, and AmEx Centurion Travel bookings for international associates.

Together, the two data sets contain 862,911 email documents — approximately 41% of the entire 2.1 million file corpus. By volume, the email records are the dominant document type in the collection.

The Samples That Revealed the Truth

Two sample PDFs from DS11 settled the question (PAPER TRAIL Project, 2026a).

EFTA02212883 (2 pages) contains a Lesley Groff email chain dated May 30, 2017. "jeffrey E." requests "cookies or small cakes. tea" for a noon appointment with "Maxim and his mom." The email footer includes the address jeevacation@gmail.com — Epstein's vacation email, which serves as a searchable anchor across the email corpus.

EFTA02212885 (4 pages) contains travel coordination from April 24, 2017. An American Express Centurion Travel booking for a Russian Federation citizen flying from St. Thomas to JFK on American Airlines flight 936. The recipient asks Lesley Groff for a "new ticket back to Moscow." Russian-language text is present in the email body.

These are not financial ledgers. They are not flight manifests. They are emails — printed to PDF, one page per file, stamped with sequential tracking numbers through Relativity litigation review software.

What the Comparison Reveals

The two email data sets differ in more than size.

Attribute	DS9	DS11
File count	531,256	331,655
Size on disk	94.5 GB	~28 GB
Entity extraction status	Complete	Complete
Format	Relativity PDF	Relativity PDF
Page numbering prefix	EFTA_R1_	EFTA_R1_
Date range	TBD (pending full analysis)	At least 2017
Disk/DB match	Perfect	Perfect

DS9 is larger and denser — its per-file size averaging around 183 KB suggests more content per email (PAPER TRAIL Project, 2026b). DS11 averages around 87 KB per file, consistent with single-page email printouts. Both use the same page numbering scheme, indicating they were processed through the same litigation review pipeline, likely at different times or from different email sources.

The combined email corpus of 863,000 documents is the foundation for the Email Network Analysis methodology described in the research framework. Header extraction, thread reconstruction, communication network mapping, temporal burst detection, and content classification all apply to both data sets (PAPER TRAIL Project, 2026c). The misidentification of DS11 meant that for the first weeks after release, nearly 40% of the available email evidence was invisible to researchers who took the published descriptions at face value.

The Lesson

The DS11 misidentification is a reminder that official descriptions of released documents are not always accurate. Whether the mislabeling was intentional, careless, or simply an artifact of how the DOJ categorized files in bulk, the effect was the same: a third of a million email documents were hidden behind a wrong label. It took systematic inspection — actually opening the files and reading their contents — to discover what DS11 really was.

This is the same principle that applies across the corpus. Labels, descriptions, and metadata are starting points. The documents themselves are the evidence. When the two disagree, the documents win.

References

PAPER TRAIL Project. (2026a). DS11 content resolution, sample emails EFTA02212883 and EFTA02212885 [Observation]. OBSERVATIONS.md, OBS-7.

PAPER TRAIL Project. (2026b). DS9 file count (531,256) and entity extraction status [Project documentation]. CLAUDE.md; Script 04.

PAPER TRAIL Project. (2026c). Email network analysis methodology [Research file]. EMAIL_NETWORK_ANALYSIS.md.

PAPER TRAIL Project. (2026d). Corpus audit: Disk/DB reconciliation [Data set]. _exports/audit/corpus_inventory.csv.

TLDR

The Mislabeled Data Set

What Each Data Set Contains

The Samples That Revealed the Truth

What the Comparison Reveals

The Lesson

References

Continue the Investigation

What 2.1 Million Documents Look Like

The 16-Script Pipeline

The 42% Gap