TLDR
A software tool that uses statistics to decide which database records refer to the same person (called Splink) merged eight or more scanning variants of "David Rodgers" (Epstein's pilot) into a single cluster, reducing 2.38 million raw entity records to 519,000 unified groups. The process demonstrates both the power and the limits of probabilistic record linkage — the same model that correctly merges real name variants can also incorrectly merge unreadable scanning noise, requiring a cleanup pass that dissolved 8 groups containing 3,576 false matches (PAPER TRAIL Project, 2026a; PAPER TRAIL Project, 2026b).
The Name Problem
When you run automated name extraction across 2.1 million documents that have been scanned from paper originals, you do not get clean data. You get 2,383,751 entity records where the same person, organization, or location appears under multiple spellings (PAPER TRAIL Project, 2026c). David Rodgers, Epstein's primary pilot, is instructive. His name appears in handwritten flight logs, typed legal documents, financial records, and email correspondence. Each document type introduces its own corruption patterns.
Handwritten flight logs produce the worst variants. Pilot handwriting is fast and abbreviated. Photocopying degrades it further. Automated text scanning attempts to read the degraded copies and produces outputs like "Davld Rodgers," "David Rogers," "David Rodgrs," "D. Rodgers," and several others — eight or more distinct strings in the entity database, all referring to the same individual (PAPER TRAIL Project, 2026c).
Without entity resolution (the process of determining which records refer to the same real-world entity), each variant is treated as a separate person. Cross-referencing becomes impossible. David Rodgers in a flight log and David Rogers in a legal document appear as two different entities even though they are the same pilot who flew the same aircraft for the same employer.
How the Entity Resolution Tool Works
Splink implements a well-established statistical model for record linkage called the Fellegi-Sunter model (Fellegi & Sunter, 1969). The model compares pairs of records and calculates a match score based on field-by-field agreement and disagreement patterns.
The process starts with blocking — a step to reduce the number of comparisons to a manageable level. Comparing every entity against every other entity would require checking approximately 2.84 trillion pairs. That is computationally infeasible. Blocking rules partition the entity space into smaller groups — by entity type (person, organization, location) and by initial character — so that only entities within the same group are compared. This reduces the comparison space by orders of magnitude (Linster, 2022).
Within each group, the model examines field agreement. For person entities, the primary field is the name string. The model calculates match weights using a training technique called Expectation-Maximization: it learns, from the data itself, how likely two records are to match given various levels of name similarity. A perfect string match gets a high weight. A close match (one or two character differences) gets a moderate weight. A complete disagreement gets a negative weight (Linster, 2022).
The cumulative weight determines whether a pair is classified as a match, a non-match, or an uncertain case requiring human review. Thresholds differ by entity type: person entities require higher match scores than organizations because person names are more ambiguous ("John Smith" is common; "Southern Trust Company" is distinctive) (PAPER TRAIL Project, 2026d).
Why a Specialized Database Is Used for Comparisons
Splink uses DuckDB (an in-process analytical database) rather than the main PostgreSQL database for the comparison phase. DuckDB is optimized for exactly this type of workload: billions of row-level comparisons with aggregate scoring. Running entity resolution against 2.38 million entities in PostgreSQL would require creating comparison tables that exceed available memory. DuckDB handles the same workload in a streaming fashion, processing groups incrementally (PAPER TRAIL Project, 2026a).
The output is written back to PostgreSQL as cluster assignments on the entities table: 519,000 clusters from 2,383,751 raw entities, a 78.2% reduction. Each cluster represents a resolved entity — one person, one organization, or one location — regardless of how many scanning variants contributed to it (PAPER TRAIL Project, 2026a).
What Goes Wrong: Noise Matching Noise
The statistical model is trained on the assumption that compared records contain real names with occasional corruption. This assumption breaks when the input is unreadable text that the scanning software produced from blurred or damaged pages — what we call scanning garbage (PAPER TRAIL Project, 2026e).
The entity resolution process produced 8 clusters where scanning garbage had been matched to other scanning garbage. A string like "f5.53't PasigottlIornterronersiner" (from a blank KYC form; see PAPER TRAIL Project, 2026e) has character-level similarity to other garbage strings. The model, which knows nothing about whether a string represents a real entity, assigns match weights based purely on character overlap. Two garbage strings from different documents can achieve a match score above threshold simply because they share enough random characters (PAPER TRAIL Project, 2026b).
These 8 garbage clusters contained 3,576 entities. A cleanup script (Script 19b) dissolved them by resetting the cluster assignments, freeing the affected entities to remain as isolated records or be re-clustered with improved input data (PAPER TRAIL Project, 2026b).
The garbage cluster problem is the entity resolution counterpart to the scanning hallucination problem. Just as automated scanning creates phantom entities from blank forms and repeated page headers, entity resolution can create phantom relationships between those phantom entities. The two failure modes compound: bad scanning produces false entities, and entity resolution connects them into false clusters.
The Quality Target
The research framework specifies quality targets for entity resolution: an accuracy score (F1) above 0.84, with human review conducted through blinded labelling at a consistency measure (Krippendorff's alpha) of at least 0.80 (PAPER TRAIL Project, 2026f). These targets ensure that the model's merge decisions are correct at a rate sufficient for downstream analysis — network construction, co-occurrence counting, and temporal correlation all depend on entities being correctly resolved.
The David Rodgers case demonstrates the value of getting this right. With eight or more variants merged into a single cluster, every flight log entry, legal reference, and corporate filing mentioning the pilot resolves to one entity. His co-occurrence network, his temporal activity pattern, and his relationships to other entities become visible as a unified picture rather than eight fragmented ones.
Conversely, the garbage cluster dissolution demonstrates the cost of getting it wrong. False merges create false connections in the network graph, inflate co-occurrence counts, and inject noise into every downstream analysis. The 3,576 freed entities from the 8 dissolved clusters represent potential false connections that were removed before they could corrupt the analytical output (PAPER TRAIL Project, 2026b).
References
Fellegi, I. P., & Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64(328), 1183-1210. https://doi.org/10.1080/01621459.1969.10501049
Linster, R. (2022). Splink: Fast, accurate and scalable probabilistic data linkage [Software]. https://github.com/moj-analytical-services/splink
PAPER TRAIL Project. (2026a). Entity resolution (Script 19) [Software]. app/scripts/19_entity_resolution.py
PAPER TRAIL Project. (2026b). Garbage cluster dissolution (Script 19b) [Data]. Referenced in CLAUDE.md
PAPER TRAIL Project. (2026c). Named entity recognition — 2,383,751 entities (Script 04) [Data]. Database: epstein_files
PAPER TRAIL Project. (2026d). Implementation specification — Fellegi-Sunter model, blocking rules, and thresholds [Technical report]. research/IMPLEMENTATION.md
PAPER TRAIL Project. (2026e). Observations — OBS-5 scanning artifacts from blank KYC forms [Data]. OBSERVATIONS.md
PAPER TRAIL Project. (2026f). Calibration specification — entity resolution quality targets [Technical report]. research/CALIBRATION.md