TLDR
A software tool that uses statistics to decide which database records refer to the same person (called Splink) merged eight or more scanning variants of "David Rodgers" (Epstein's pilot) into a single cluster, reducing 2.38 million raw entity records to 519,000 unified groups. The process demonstrates both the power and the limits of probabilistic record linkage — the same model that correctly merges real name variants can also incorrectly merge unreadable scanning noise, requiring a cleanup pass that dissolved 8 groups containing 3,576 false matches (PAPER TRAIL Project, 2026a; PAPER TRAIL Project, 2026b).
The Name Problem
Automated name extraction across 2.1 million scanned documents produced 2,383,751 entity records, with the same person appearing under multiple spellings. David Rodgers, Epstein's primary pilot, appears in flight logs, legal documents, financial records, and emails — handwritten flight logs producing the worst variants. The entity database holds eight or more distinct strings referring to him: "Davld Rodgers," "David Rogers," "David Rodgrs," "D. Rodgers," and others. Without entity resolution, each variant is treated as a separate person.
When you run automated name extraction across 2.1 million documents that have been scanned from paper originals, you do not get clean data. You get 2,383,751 entity records where the same person, organization, or location appears under multiple spellings (PAPER TRAIL Project, 2026c). David Rodgers, Epstein's primary pilot, is instructive. His name appears in handwritten flight logs, typed legal documents, financial records, and email correspondence. Each document type introduces its own corruption patterns.
Handwritten flight logs produce the worst variants. Pilot handwriting is fast and abbreviated. Photocopying degrades it further. Automated text scanning attempts to read the degraded copies and produces outputs like "Davld Rodgers," "David Rogers," "David Rodgrs," "D. Rodgers," and several others — eight or more distinct strings in the entity database, all referring to the same individual (PAPER TRAIL Project, 2026c).
Without entity resolution (the process of determining which records refer to the same real-world entity), each variant is treated as a separate person. Cross-referencing becomes impossible. David Rodgers in a flight log and David Rogers in a legal document appear as two different entities even though they are the same pilot who flew the same aircraft for the same employer.
How the Entity Resolution Tool Works
Splink implements the Fellegi-Sunter model (1969), comparing record pairs and calculating match scores from field-by-field agreement. Blocking partitions the entity space by type and initial character, reducing comparisons from approximately 2.84 trillion pairs to a feasible workload. Within each block, an Expectation-Maximization training technique learns match weights from the data: perfect string matches get high weights, close matches moderate weights, disagreements negative weights. Person entities require higher thresholds than organizations because names like "John Smith" are ambiguous.
Splink implements a well-established statistical model for record linkage called the Fellegi-Sunter model (Fellegi & Sunter, 1969). The model compares pairs of records and calculates a match score based on field-by-field agreement and disagreement patterns.
The process starts with blocking — a step to reduce the number of comparisons to a manageable level. Comparing every entity against every other entity would require checking approximately 2.84 trillion pairs. That is computationally infeasible. Blocking rules partition the entity space into smaller groups — by entity type (person, organization, location) and by initial character — so that only entities within the same group are compared. This reduces the comparison space by orders of magnitude (Linster, 2022).
Within each group, the model examines field agreement. For person entities, the primary field is the name string. The model calculates match weights using a training technique called Expectation-Maximization: it learns, from the data itself, how likely two records are to match given various levels of name similarity. A perfect string match gets a high weight. A close match (one or two character differences) gets a moderate weight. A complete disagreement gets a negative weight (Linster, 2022).
The cumulative weight determines whether a pair is classified as a match, a non-match, or an uncertain case requiring human review. Thresholds differ by entity type: person entities require higher match scores than organizations because person names are more ambiguous ("John Smith" is common; "Southern Trust Company" is distinctive) (PAPER TRAIL Project, 2026d).
Why a Specialized Database Is Used for Comparisons
Splink uses DuckDB rather than PostgreSQL for the comparison phase because DuckDB is optimized for billions of row-level comparisons with aggregate scoring. Running entity resolution against 2.38 million entities in PostgreSQL would require comparison tables exceeding available memory. DuckDB handles the workload in streaming fashion, processing groups incrementally. Output writes back to PostgreSQL as cluster assignments: 519,000 clusters from 2,383,751 raw entities, a 78.2% reduction representing one resolved entity per cluster.
Splink uses DuckDB (an in-process analytical database) rather than the main PostgreSQL database for the comparison phase. DuckDB is optimized for exactly this type of workload: billions of row-level comparisons with aggregate scoring. Running entity resolution against 2.38 million entities in PostgreSQL would require creating comparison tables that exceed available memory. DuckDB handles the same workload in a streaming fashion, processing groups incrementally (PAPER TRAIL Project, 2026a).
The output is written back to PostgreSQL as cluster assignments on the entities table: 519,000 clusters from 2,383,751 raw entities, a 78.2% reduction. Each cluster represents a resolved entity — one person, one organization, or one location — regardless of how many scanning variants contributed to it (PAPER TRAIL Project, 2026a).
What Goes Wrong: Noise Matching Noise
The Fellegi-Sunter model assumes inputs are real names with occasional corruption — an assumption that breaks for unreadable scanning output. The entity resolution process produced 8 clusters where scanning garbage matched other scanning garbage. A string like "f5.53't PasigottlIornterronersiner" (from a blank KYC form) has character-level overlap with other garbage strings. Two garbage strings can exceed match threshold purely through random character overlap. These 8 garbage clusters contained 3,576 entities, dissolved by Script 19b.
The statistical model is trained on the assumption that compared records contain real names with occasional corruption. This assumption breaks when the input is unreadable text that the scanning software produced from blurred or damaged pages — what we call scanning garbage (PAPER TRAIL Project, 2026e).
The entity resolution process produced 8 clusters where scanning garbage had been matched to other scanning garbage. A string like "f5.53't PasigottlIornterronersiner" (from a blank KYC form; see PAPER TRAIL Project, 2026e) has character-level similarity to other garbage strings. The model, which knows nothing about whether a string represents a real entity, assigns match weights based purely on character overlap. Two garbage strings from different documents can achieve a match score above threshold simply because they share enough random characters (PAPER TRAIL Project, 2026b).
These 8 garbage clusters contained 3,576 entities. A cleanup script (Script 19b) dissolved them by resetting the cluster assignments, freeing the affected entities to remain as isolated records or be re-clustered with improved input data (PAPER TRAIL Project, 2026b).
The garbage cluster problem is the entity resolution counterpart to the scanning hallucination problem. Just as automated scanning creates phantom entities from blank forms and repeated page headers, entity resolution can create phantom relationships between those phantom entities. The two failure modes compound: bad scanning produces false entities, and entity resolution connects them into false clusters.
The Quality Target
The framework specifies F1 above 0.84 with human review at Krippendorff's alpha of at least 0.80. The David Rodgers case demonstrates the value: with eight or more variants merged into a single cluster, every flight log entry, legal reference, and corporate filing mentioning the pilot resolves to one entity, producing a unified co-occurrence network. Conversely, garbage cluster dissolution demonstrates the cost of failure — the 3,576 freed entities from 8 dissolved clusters represent potential false connections removed before they could corrupt downstream analysis.
The research framework specifies quality targets for entity resolution: an accuracy score (F1) above 0.84, with human review conducted through blinded labelling at a consistency measure (Krippendorff's alpha) of at least 0.80 (PAPER TRAIL Project, 2026f). These targets ensure that the model's merge decisions are correct at a rate sufficient for downstream analysis — network construction, co-occurrence counting, and temporal correlation all depend on entities being correctly resolved.
The David Rodgers case demonstrates the value of getting this right. With eight or more variants merged into a single cluster, every flight log entry, legal reference, and corporate filing mentioning the pilot resolves to one entity. His co-occurrence network, his temporal activity pattern, and his relationships to other entities become visible as a unified picture rather than eight fragmented ones.
Conversely, the garbage cluster dissolution demonstrates the cost of getting it wrong. False merges create false connections in the network graph, inflate co-occurrence counts, and inject noise into every downstream analysis. The 3,576 freed entities from the 8 dissolved clusters represent potential false connections that were removed before they could corrupt the analytical output (PAPER TRAIL Project, 2026b).
References
Fellegi, I. P., & Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64(328), 1183-1210. https://doi.org/10.1080/01621459.1969.10501049
Linster, R. (2022). Splink: Fast, accurate and scalable probabilistic data linkage [Software]. https://github.com/moj-analytical-services/splink
PAPER TRAIL Project. (2026a). Entity resolution (Script 19) [Software]. app/scripts/19_entity_resolution.py
PAPER TRAIL Project. (2026b). Garbage cluster dissolution (Script 19b) [Data]. Referenced in CLAUDE.md
PAPER TRAIL Project. (2026c). Named entity recognition — 2,383,751 entities (Script 04) [Data]. Database: epstein_files
PAPER TRAIL Project. (2026d). Implementation specification — Fellegi-Sunter model, blocking rules, and thresholds [Technical report]. research/IMPLEMENTATION.md
PAPER TRAIL Project. (2026e). Observations — OBS-5 scanning artifacts from blank KYC forms [Data]. OBSERVATIONS.md
PAPER TRAIL Project. (2026f). Calibration specification — entity resolution quality targets [Technical report]. research/CALIBRATION.md