TLDR
Automated name detection software extracted 2,383,898 entities from 2.05 million documents — 57% organizations, 37% persons, 6% locations. Nearly one in four entities appears in only one document, and known duplicates like Nadia Marcinkova's 8 OCR variants demonstrate the scale of fragmentation that probabilistic entity resolution must address.
What Automated Name Detection Does
Named entity recognition, or NER, is the process of reading text and identifying the names within it: which strings are people, which are organizations, which are locations. The pipeline uses spaCy (an open-source language processing tool) with its large English model, which was trained on millions of annotated English sentences (Honnibal & Montani, 2017). It does not understand what it reads. It recognizes patterns — capitalization, context words like "Mr." or "Inc.," position within sentences — and assigns labels.
Applied to 2,046,260 documents using 8 parallel workers, the model extracted 2,383,898 entities (PAPER TRAIL Project, 2026a). This is 97.4% coverage of the 2,100,266 files in the corpus. The remaining 2.6% includes 53,000 House Oversight images that failed visual processing and 16 FBI Vault documents (PAPER TRAIL Project, 2026a).
The Distribution
The entity type breakdown reveals the corpus's character. Organizations dominate at approximately 1,342,000 entities (57%). Persons follow at approximately 911,000 (37%). Locations are the smallest category at approximately 129,000 (6%). The organization-to-person ratio is 1.47 to 1 (PAPER TRAIL Project, 2026b).
This distribution makes sense given the corpus composition. Deutsche Bank records are dense with corporate entity names — account holders, counterparties, intermediaries. Legal documents reference firms, trusts, and LLCs. The email corpus contains business correspondence with organizational signatures. In a financial crime corpus, organizations outnumber people because the criminal infrastructure is built from corporate shells (PAPER TRAIL Project, 2026a).
The Singleton Problem
Of the 2.38 million entities, 197,945 appear in exactly one document — these are called singletons. That is 24.1% of the entire entity population — nearly one in four (PAPER TRAIL Project, 2026c). These singletons drive the Chao1 estimate (a statistical method borrowed from ecology that uses the ratio of names seen once versus twice to estimate how many names were missed entirely) upward, suggesting 468,000 missing entities, and create a long tail of unverifiable names.
Some singletons are genuine: a person mentioned once in a single wire transfer, a company that appears on one FedEx invoice, a location referenced in one email. These are real entities that happen to be rare in the corpus. Others are OCR artifacts — fragments of mangled text that the name detection software interpreted as entity names but that correspond to nothing real (PAPER TRAIL Project, 2026c).
Organizations have the highest singleton ratio at 24.6% (119,312 singletons out of 485,045 organization entities) (PAPER TRAIL Project, 2026c). This is where OCR fragmentation hits hardest. Corporate names contain punctuation, abbreviations, and formatting that OCR engines frequently corrupt. "HBRK Associates Inc." might appear as "HBRK Associates Inc.," "HBRK Associates, Inc," "HBRK ASSOCIATES INC," and "H.B.R.K. Associates" — each a separate entity in the raw extraction.
The Deduplication Layers
The pipeline attacks this problem at multiple levels. Script 04b runs rule-based deduplication immediately after extraction, merging obvious variants (case normalization, punctuation stripping, common abbreviation expansion). This is the first pass — fast but shallow (PAPER TRAIL Project, 2026d).
Script 19 performs probabilistic entity resolution using Splink, a Python library that implements a statistical matching method originally developed by Fellegi and Sunter (1969). Rather than comparing exact strings, Splink computes the probability that two entity records refer to the same real-world entity based on name similarity, entity type, co-occurrence patterns, and document context. The method uses a DuckDB backend (a fast in-process database engine) for performance. The result: 2.38 million raw entities reduced to 519,000 clusters (PAPER TRAIL Project, 2026e).
But entity resolution can also create false connections. Script 19b dissolved 8 garbage clusters containing 3,576 entities that had been incorrectly merged (PAPER TRAIL Project, 2026f). These were cases where OCR garbage from one document matched OCR garbage from another document — meaningless strings that happened to be similar. The dissolution freed these entities from false clusters, restoring the integrity of the resolution.
The Nadia Marcinkova Problem
The known duplicate that best illustrates the challenge is Nadia Marcinkova — a named co-conspirator who appears across flight logs, legal documents, bank records, and correspondence. In the raw entity extraction, she has 8 distinct OCR variants (PAPER TRAIL Project, 2026a). Some differ by punctuation. Some by character substitution. Some by truncation.
Splink correctly identified most of these as the same person. But the fact that a prominent, frequently mentioned individual has 8 variants gives a sense of the fragmentation affecting less prominent names. If Marcinkova has 8 variants, how many does a person mentioned in only two documents have? How many singletons in the entity database are actually the second mention of someone already counted?
The Relationship Graph
From the 2.38 million entities, Script 05 built a co-occurrence graph (a network where entities are connected when they appear in the same document). This produced 29.5 million unique relationship pairs — the raw material for every network analysis downstream (PAPER TRAIL Project, 2026g).
The relationship count is dominated by high-frequency entities. Jeffrey Epstein co-occurs with tens of thousands of other entities. Deutsche Bank co-occurs with every account holder in the financial records. These high-frequency hubs create dense subgraphs that the community detection algorithm must untangle into meaningful clusters (PAPER TRAIL Project, 2026h).
The 125,620 communities found by the Leiden algorithm (a method for identifying groups of entities that appear together more frequently with each other than with the rest of the network) represent groupings of tightly connected entities (PAPER TRAIL Project, 2026h). The 535,318 structural hole brokers (entities that bridge between otherwise disconnected communities — potential gatekeepers, intermediaries, or connections between separate parts of the network) were identified using Burt's constraint measure, a metric that quantifies how much an entity depends on its direct connections versus bridging between groups (Burt, 1992; PAPER TRAIL Project, 2026h).
What 2.38 Million Entities Means
This is not 2.38 million unique real-world entities. It is 2.38 million entity mentions extracted by an automated system from degraded documents, reduced to 519,000 clusters by probabilistic matching, with an estimated 468,000 additional entities never detected (PAPER TRAIL Project, 2026a). The number is simultaneously too large (inflated by OCR artifacts and duplication) and too small (missing a third of the estimated true population).
Every downstream analysis — network topology, temporal change-points, cross-domain synthesis — is built on this imperfect foundation. The pipeline does not pretend otherwise. It measures the imperfection (using the Chao1 estimator), corrects what it can (through Splink resolution and garbage cluster dissolution), and adjusts confidence scores for what it cannot (PAPER TRAIL Project, 2026i).
The 2.38 million is not the answer. It is the starting point.
References
Burt, R. S. (1992). Structural holes: The social structure of competition. Harvard University Press.
Fellegi, I. P., & Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64(328), 1183–1210.
Honnibal, M., & Montani, I. (2017). spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing [Computer software]. https://spacy.io
PAPER TRAIL Project. (2026a). Entity extraction script (spaCy, 8 workers) [Computer software]. app/scripts/04_extract_entities.py
PAPER TRAIL Project. (2026b). Entity type distribution visualization [Visualization]. visualizations/ppt_ep02_slide10_entity_donut.svg
PAPER TRAIL Project. (2026c). Singleton analysis [Data set]. _exports/validation/singleton_analysis.csv
PAPER TRAIL Project. (2026d). Entity deduplication script [Computer software]. app/scripts/04b_deduplicate_entities.py
PAPER TRAIL Project. (2026e). Splink entity resolution results [Data set]. _exports/entity_resolution/, 519K clusters.
PAPER TRAIL Project. (2026f). Garbage cluster dissolution (Script 19b) [Data set]. 8 clusters, 3,576 entities freed.
PAPER TRAIL Project. (2026g). Entity relationship graph [Database table]. PostgreSQL entity_relationships, 29.5M unique pairs, db=epstein_files.
PAPER TRAIL Project. (2026h). Leiden community detection and Burt's structural holes [Data set]. _exports/network/, 125,620 communities, 535,318 brokers.
PAPER TRAIL Project. (2026i). Chao1 completeness summary [Data set]. _exports/validation/chao1_summary.json
PAPER TRAIL Project. (2026j). NER coverage by data set visualization [Visualization]. visualizations/ppt_ep02_slide15_ner_coverage.svg