Garbage Cluster Dissolution: When Entity Resolution Connects Noise to Noise

Table of Contents

TLDR

A cleanup script (Script 19b) dissolved 8 entity groups containing 3,576 records that had been incorrectly merged because unreadable text that the scanning software produced from blurred or damaged pages (scanning garbage) in one document matched scanning garbage from another. The dissolution demonstrates a fundamental limitation: a software tool that uses statistics to decide which database records refer to the same person cannot distinguish corrupted real names from random character sequences, and when noise matches noise, the model creates phantom relationships (PAPER TRAIL Project, 2026a).

The Discovery

After the entity resolution tool (Splink) reduced 2.38 million raw entity records to 519,000 groups using its statistical matching model, quality review of the largest and highest-connectivity groups revealed a problem. Eight groups contained entities whose member strings were not corrupted versions of a real name but were unintelligible scanning output — strings like "f5.53't PasigottlIornterronersiner" from blank Deutsche Bank Know Your Customer forms (PAPER TRAIL Project, 2026a; PAPER TRAIL Project, 2026b).

These garbage strings had been assigned to the same group because the statistical model evaluates character-level similarity between compared records. Two garbage strings from different documents, each produced by automated scanning attempting to read blank or damaged page areas, can share enough random characters to exceed the match threshold. The model has no way to know that neither string represents a real entity. It sees character overlap, computes a match weight, and merges (Fellegi & Sunter, 1969; PAPER TRAIL Project, 2026c).

The 8 garbage groups contained a total of 3,576 entities. Some groups had hundreds of members — all garbage, all from different documents, all connected by the coincidence of similar-looking scanning noise (PAPER TRAIL Project, 2026a).

Why This Happens

The statistical model works on a probabilistic assumption: compared records contain real entity names with occasional character-level corruption (Fellegi & Sunter, 1969). Under this assumption, character similarity is evidence of shared identity. "David Rodgers" and "Davld Rodgers" share most characters because they refer to the same person. The model correctly assigns a high match weight and merges them.

But the assumption breaks when the input contains strings that are not corrupted versions of any real name. Scanning garbage is generated by several mechanisms: the scanning engine attempts to read blank form fields and produces random characters; it reads background noise, watermarks, or stains as text; it processes damaged areas where the original characters are irrecoverable (PAPER TRAIL Project, 2026b).

The resulting strings are not uniformly random — they tend to share certain character patterns because scanning engines have systematic biases in how they misinterpret blank space. Similar page layouts produce similar garbage. This means garbage strings from similar document types (such as multiple pages of the same Deutsche Bank form template) will have higher mutual similarity than true random strings. The statistical model interprets this similarity as evidence of shared identity (PAPER TRAIL Project, 2026c).

The Dissolution

The cleanup script (Script 19b) identified the 8 garbage groups through manual inspection. The detection criterion was straightforward: groups where the majority of member entities were unintelligible strings rather than recognizable name variants. The script reset the group assignment for all 3,576 affected entities, returning them to isolated-record status (PAPER TRAIL Project, 2026a).

The dissolved entities are not lost. They remain in the database as individual records, available for re-grouping if improved scanning (through vision-language model reprocessing, for example) produces intelligible text. But they are no longer connected to each other through false match relationships (PAPER TRAIL Project, 2026a).

The net effect on the group count was minimal — approximately 519,000 groups before and after, because the freed entities became isolated records rather than forming new groups. The significant effect was on the network graph: 8 false connection hubs were removed, preventing corrupted entities from contaminating the co-occurrence analysis, the community detection, and the cross-domain synthesis (PAPER TRAIL Project, 2026d).

The Compound Failure Mode

The garbage group problem is the second stage of a compound failure that begins with scanning hallucination (PAPER TRAIL Project, 2026b).

Stage one: Automated scanning reads a blank form, a damaged region, or a repeated page header and produces a phantom entity string. This is the phenomenon documented in OBS-5 and OBS-6 — "Poland" hallucinated from blank KYC field labels, "Krakow" hallucinated from UBS account statement headers (PAPER TRAIL Project, 2026b).

Stage two: Entity resolution takes the phantom entity and, finding character-level similarity to another phantom entity from a different document, merges them into a group. The group now connects documents that have no actual relationship — they share only the coincidence of similar scanning noise (PAPER TRAIL Project, 2026a).

Stage three (prevented by the cleanup script): If left in place, the garbage group feeds into downstream analysis. Co-occurrence calculations would find that these phantom entities co-occur with real entities in the same documents, creating false connections in the relationship graph. Community detection would place the garbage group alongside real entity communities. Cross-domain synthesis would attempt to correlate phantom patterns across analytical domains (PAPER TRAIL Project, 2026d).

Each stage amplifies the error. A single scanning hallucination on a blank form becomes a phantom entity. Entity resolution connects it to other phantoms. Network analysis propagates the false connections. The garbage group dissolution at stage two prevents the cascade from reaching stage three.

What It Teaches

The dissolution of 8 groups containing 3,576 entities is quantitatively small — less than 1% of the 519,000 total groups. But it is qualitatively important because it reveals the boundary condition of statistical record linkage (PAPER TRAIL Project, 2026a).

Entity resolution is powerful when the input contains real names with noise. It is dangerous when the input contains noise without real names. The model cannot tell the difference. It treats every string as a potentially corrupted version of some real entity and looks for matches accordingly. When the string is genuine noise, the model finds matches that are also noise, and produces a group that looks structurally identical to a legitimate merge (Fellegi & Sunter, 1969).

The only detection method available was manual inspection — reading the group members and recognizing that none of them were intelligible. This does not scale. With 519,000 groups, inspecting every one is impractical. The 8 dissolved groups were found because they were among the largest, which drew attention during quality review. Smaller garbage groups, containing only 2-3 members, may still exist undetected (PAPER TRAIL Project, 2026a).

This is an honest limitation of the pipeline. Entity resolution reduced 2.38 million entities to 519,000 groups, enabling cross-dataset entity tracking for the first time. But it also introduced false connections that required a cleanup pass, and an unknown number of smaller false connections may remain. The pipeline reports both results — the reduction and the residual uncertainty.

References

Fellegi, I. P., & Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64(328), 1183-1210. https://doi.org/10.1080/01621459.1969.10501049

PAPER TRAIL Project. (2026a). Garbage cluster dissolution — 8 clusters, 3,576 entities (Script 19b) [Data]. Referenced in CLAUDE.md

PAPER TRAIL Project. (2026b). Observations — OBS-5 and OBS-6 scanning artifacts [Data]. OBSERVATIONS.md

PAPER TRAIL Project. (2026c). Implementation specification — Fellegi-Sunter model and match thresholds [Technical report]. research/IMPLEMENTATION.md

PAPER TRAIL Project. (2026d). Entity resolution — 519,000 clusters from 2.38 million entities (Script 19) [Software]. app/scripts/19_entity_resolution.py