#NER — Epstein Revealed

EP02: The Pipeline — 1.18 Million Entities and the Missing Third

TLDR Episode 2 of PAPER TRAIL introduces the 16-script processing pipeline that converted 2.1 million raw government documents into structured, searchable data...

March 23, 2026 4 min read

Documentary Production

EP10: 863,000 Emails — The Post-Conviction Network

TLDR Episode 10 of PAPER TRAIL covers the 863,000-email sub-corpus spanning Data Sets 9 and 11 — scanned printouts processed through Relativity litigation...

March 23, 2026 4 min read

Comparisons

DS9 vs. DS11: 863,000 Emails the DOJ Mislabeled

TLDR Data Set 9 (531,256 files) and Data Set 11 (331,655 files) together constitute 863,000 email documents — the largest sub-corpus. DS11 was publicly...

March 10, 2026 6 min read

Corpus & Data

The 16-Script Pipeline

TLDR A pipeline of 27+ Python scripts transforms 2.1 million raw government documents into a searchable PostgreSQL database with 2.38 million extracted...

March 10, 2026 7 min read

DOJ Datasets

Data Set 9: 531,256 Emails and a Perfect Match

TLDR Data Set 9 is the largest single data set by file count: 531,256 email PDFs occupying 94.5 GB. It achieved a perfect disk-to-database match (zero delta),...

March 10, 2026 7 min read

Documentary Production

Seven Scalable Vector Graphics at 1920x1080: Visualizing the Pipeline

TLDR Seven production Scalable Vector Graphics (SVG) images — a type of image that stays sharp at any size — were created at 1920x1080 resolution with a...

March 10, 2026 6 min read

Machine Intelligence

2.38 Million NER Entities

TLDR Automated name detection software extracted 2,383,898 entities from 2.05 million documents — 57% organizations, 37% persons, 6% locations. Nearly one in...

March 10, 2026 9 min read

Methodology

519K Splink Clusters: Turning 2.38 Million Fragments Into One Namespace

TLDR Probabilistic entity resolution using a statistical matching method reduced 2.38 million raw Named Entity Recognition entries to 519,000 unified clusters...

March 10, 2026 7 min read