TLDR
Episode 10 of PAPER TRAIL covers the 863,000-email sub-corpus spanning Data Sets 9 and 11 — scanned printouts processed through Relativity litigation software, extending the operational timeline to 2017 and revealing that Epstein's network continued functioning after his 2008 conviction. NER coverage jumped from 0% (EP02) to 100%, adding 19.2 million entity mentions from 1,048,234 documents (PAPER TRAIL Project, 2026a).
What the Episode Covers
EP10 returns to the largest blind spot identified in EP02: Data Sets 9 and 11 sitting at 0% NER coverage. DS9 contains 531,256 email PDFs (94.5 GB). DS11, originally described by the DOJ as "financial ledgers and USVI flight manifests," actually contains 331,655 PDFs of seized email correspondence — a misidentification resolved as OBS-7. Together they constitute the largest sub-corpus by document count (PAPER TRAIL Project, 2026a).
These are not native email files. They are scanned printouts — physical pages that were printed, collected during search warrants, and later digitized. Each PDF is an island: no Message-ID headers, no In-Reply-To fields, no threading metadata. The Relativity load file format (Bates-stamped EFTA_R1_xxxxxxxx identifiers) confirms they were processed through litigation review software before release, meaning someone — the FBI, prosecutors, or a contractor — reviewed them before they became public (PAPER TRAIL Project, 2026b).
Thread Reconstruction Without Headers
The episode's technical centerpiece is the thread reconstruction methodology. Without native email headers, the pipeline uses four heuristic dimensions to rebuild conversation trees from unlinked documents. First, temporal sequence: a reply must postdate the message it replies to (hard constraint). Second, subject normalization: strip Re:, Fwd:, FW:, AW: prefixes and match on the base subject line. Third, participant intersection: Jaccard similarity of sender/recipient arrays measures whether the same people are involved. Fourth, body text quotation: TF-IDF cosine similarity (tau = 0.85) detects quoted text in replies, with the high threshold accounting for OCR degradation of quoted passages (PAPER TRAIL Project, 2026c).
The methodology is designed but not yet executed at full scale across all 863,000 emails. The episode is transparent about this distinction: NER entity extraction is complete, but structured thread reconstruction and communication network analysis remain pipeline steps to be run.
The Communication Network Design
EP10 presents the framework for converting extracted email headers into a weighted directed graph. Four social network analysis metrics are defined: degree centrality (communication volume), betweenness centrality (information flow control), Burt's constraint (brokerage capacity), and Louvain community detection (functional sub-groups). The episode predicts actor patterns: Lesley Groff as administrative hub (high degree, high constraint), Darren Indyke as strategic gatekeeper (high betweenness, low constraint), Richard Kahn as financial gatekeeper, and Epstein as central node (PAPER TRAIL Project, 2026c).
These are predictions stated before results — the methodology design phase that EP12 later frames as essential to the Daubert standard. The actual network structure will confirm, refine, or contradict these predictions.
Why This Episode Matters
EP10 closes the largest data gap in the series. The jump from 0% to 100% email NER coverage transforms the pipeline's completeness. The jeevacation@gmail.com handle anchors Epstein's personal communications. The post-conviction timeline (extending to 2017) documents continued network operations during the period when public attention had faded. And the thread reconstruction methodology — even in its designed but unexecuted state — establishes how scanned printouts can be converted back into conversation structures, a technique applicable to any large-scale litigation document production.
References
PAPER TRAIL Project. (2026a). EP10 slide content: Email corpus analysis, NER completion, DS9/DS11 composition [Presentation]. communications/ep10_slides/
PAPER TRAIL Project. (2026b). DS11 content resolution (OBS-7) [Observation log]. OBSERVATIONS.md
PAPER TRAIL Project. (2026c). Email network analysis methodology: Thread reconstruction and communication network design [Research document]. research/EMAIL_NETWORK_ANALYSIS.md
This research is sponsored by Subthesis.