TLDR
Probabilistic entity resolution using a statistical matching method reduced 2.38 million raw Named Entity Recognition entries to 519,000 unified clusters -- a 78% reduction (PAPER TRAIL Project, 2026a). This created the first single namespace across all twelve DOJ data sets, merging text-extraction variants like eight different spellings of "Nadia Marcinkova" into one identity. It also created false connections that had to be manually dissolved.
The Name Fragmentation Problem
When you run named entity recognition across 2.1 million documents that have been scanned, processed through automated text extraction, and occasionally mangled by aging photocopiers, you do not get clean data. You get 2,383,751 raw entity records: 1.34 million organizations (57%), 911,000 persons (37%), and 129,000 locations (6%) (PAPER TRAIL Project, 2026a). Many of these are duplicates wearing different costumes.
Nadia Marcinkova appears in the corpus as "Nadia Marcinkova," "Nadia Marcinko," "Nadia Marcinkvo," and at least five other degraded variants -- eight in total, scattered across dozens of documents (PAPER TRAIL Project, 2026a). Without entity resolution, each variant is a separate node in the database. Queries about Marcinkova return partial results. Co-occurrence analysis underestimates her connections. The fragmentation is not a minor inconvenience. It is a structural barrier to analysis.
How the Statistical Matching Works
The solution is probabilistic record linkage, a statistical matching method developed by Ivan Fellegi and Alan Sunter in 1969 (Fellegi & Sunter, 1969). The core idea: for each pair of entity records, compute a match weight based on how similar they are across multiple comparison dimensions, then classify the pair as a match, non-match, or uncertain case based on the aggregate weight.
We implemented this using Splink, an open-source Python library that puts the Fellegi-Sunter method into practice at scale (Ministry of Justice Analytical Services, n.d.). Each entity pair is compared across multiple levels of name similarity: exact match, high similarity (Jaro-Winkler score at 0.95 or above), moderate similarity (Jaro-Winkler at 0.88 or above), low edit distance (two or fewer character changes), and everything else (PAPER TRAIL Project, 2026b). Each level contributes a different match weight, and the weights are summed to produce an overall score.
The model's parameters -- the probability that true matches agree at each comparison level and the probability that non-matches agree by coincidence -- are estimated through an iterative training process (PAPER TRAIL Project, 2026b). The coincidence probabilities come from random sampling of 2 million pairs (overwhelmingly non-matches). The true match probabilities come from rules that identify near-certain matches, giving the model reliable examples to learn from.
Thresholds That Matter
Not all entities deserve the same match threshold. A person named "John Smith" appearing in two documents is far more likely to be two different people than "Nadia Marcinkova" appearing with slightly different spellings. Organizations, meanwhile, have legitimate variant forms: "Corporation" versus "Corp" versus "Inc." that should merge aggressively.
The pipeline uses type-specific thresholds: 0.92 for persons (high, because common names create false positives) and 0.85 for organizations (lower, to accommodate legitimate abbreviation variants) (PAPER TRAIL Project, 2026b). These thresholds were calibrated against human review across three confidence bands: high-confidence matches above 0.90, boundary cases between 0.45 and 0.55, and low-confidence pairs below 0.10. The quality gate requires an F1 accuracy score above 0.84 across 400 reviewed predictions per pipeline stage (PAPER TRAIL Project, 2026c).
The DuckDB Acceleration
The computational challenge is scale. Comparing 2.38 million entities against each other produces trillions of potential pairs. Even with blocking rules -- using substring hashing and fuzzy matching to limit comparisons to plausible candidates -- the comparison space is enormous.
PostgreSQL cannot handle this efficiently using standard table joins. Instead, the pipeline exports entity data to columnar data files and loads them into DuckDB, a high-speed analytical database engine (PAPER TRAIL Project, 2026a). DuckDB's column-based processing handles blocked comparisons orders of magnitude faster than row-oriented PostgreSQL, making it possible to resolve millions of entities on a single machine without cluster computing.
The result: 2,383,751 raw entities collapsed to 519,000 resolved clusters (PAPER TRAIL Project, 2026a). Each cluster receives a unique identifier stored back in PostgreSQL, creating a unified namespace that works across all twelve DOJ data sets.
When Resolution Goes Wrong
Entity resolution can also create false connections. A follow-up quality check discovered eight garbage clusters containing 3,576 entities that had been incorrectly merged (PAPER TRAIL Project, 2026d). The failure mode was subtle: text-extraction garbage matching text-extraction garbage. When a scanner produces the same kind of meaningless character sequence from two different documents, the statistical model sees high similarity between two strings that are both nonsense. The model correctly identifies them as similar -- they are similar, because they are both garbled versions of nothing.
These eight clusters were dissolved, freeing the trapped entities. The episode demonstrates that entity resolution is not a fire-and-forget operation. Post-processing quality control is essential, and the system must be designed to permit surgical corrections without destabilizing the broader resolution.
What 519K Clusters Enable
With a unified entity namespace, every downstream analysis operates on resolved identities rather than text-extraction fragments. Document-based networks connect actual people rather than spelling variants. Cross-dataset queries return complete results. The nine cross-domain leads identified by the synthesis engine depend entirely on entity resolution -- without it, Darren Indyke appearing in wire transfers, corporate registrations, and FedEx records would be three separate database entries with no connection (PAPER TRAIL Project, 2026e).
The singleton rate -- 24.1% of entities appearing in only one document -- remains a known limitation. These 197,945 singletons inflate the species richness estimate and signal either genuine rare entities or text-extraction fragments too degraded for statistical matching to rescue (PAPER TRAIL Project, 2026f). They represent the boundary of what automated resolution can achieve.
References
Fellegi, I. P., & Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64(328), 1183-1210. https://doi.org/10.1080/01621459.1969.10501049
Ministry of Justice Analytical Services. (n.d.). Splink: Probabilistic record linkage at scale [Software]. https://github.com/moj-analytical-services/splink
PAPER TRAIL Project. (2026a). Entity resolution pipeline [Script/Data set]. Script 19, _exports/entity_resolution/
PAPER TRAIL Project. (2026b). Implementation methodology: Splink configuration [Data set]. research/IMPLEMENTATION.md
PAPER TRAIL Project. (2026c). Process architecture: Quality gates [Data set]. research/PROCESS_ARCHITECTURE.md
PAPER TRAIL Project. (2026d). Garbage cluster dissolution [Script]. Script 19b, 8 clusters, 3,576 entities freed.
PAPER TRAIL Project. (2026e). Cross-domain synthesis: Leads queue [Data set]. _exports/synthesis/leads_queue.csv
PAPER TRAIL Project. (2026f). Chao1 completeness estimates [Data set]. _exports/validation/chao1_summary.json