OBS-5/6: When Scanning Software Hallucinates Countries From Blank Forms

Table of Contents

TLDR

Two observations were retracted after visual inspection revealed that scanning software (OCR, or optical character recognition — software that reads text from scanned images) hallucinated geographic references from blank form fields and repeating document headers. "Poland" came from the pre-printed text "Persons/Non-U.S. Persons" on an empty KYC form (a "Know Your Customer" form that banks use to verify client identity), and "Krakow" was manufactured from the header "Resource Management Account" repeated across 190 pages of UBS statements. Both serve as methodological lessons about systematic error generation at scale.

The Poland That Was Not There

OBS-5 reported a potentially significant finding: a Polish national listed as a beneficial owner on an Epstein Deutsche Bank account. In a corpus connected to international trafficking networks, a Polish connection would have been investigatively significant — particularly given Poland's active Investigation Team No. 5 targeting trafficking and Russian intelligence links to the Epstein network (PAPER TRAIL Project, 2026a).

The source was EFTA01268833.pdf, page 17, a Deutsche Bank KYC form (U.S. Department of Justice [DOJ], 2025a). The scanning software had extracted text that appeared to reference Poland as a nationality for a beneficial owner.

Visual inspection of the actual document revealed the truth. The "Second Beneficial Owner," "Third Beneficial Owner," and "Fourth Beneficial Owner" sections were completely blank. No names. No addresses. No nationalities. The page bears a "CLOSED" stamp dated December 17, 2019.

What the scanning software had done was misread the pre-printed form label. The actual text reads: "14. SSN (U.S. Persons/ Non-U.S. Persons)." The scanner rendered this as: "14. UN (US Peon/ Stolle. Poland." The word "Poland" was manufactured from "Persons/Non" through character-level misrecognition on a blank government form (PAPER TRAIL Project, 2026b).

This single hallucination (a term used when scanning software generates text that does not exist in the original document) cast immediate doubt on the 6,876 "Poland" entity mentions across 690 documents in 4 data sets. A significant fraction of those mentions may be identical scanning artifacts generated from the same blank KYC form template used across hundreds of Deutsche Bank documents.

The Krakow That Never Existed

OBS-6 reported "AA Resource Krakow" as a financial entity appearing across 190 pages of Deutsche Bank documents. A Krakow-based financial entity in Epstein records would have been a direct investigative lead for Polish authorities (PAPER TRAIL Project, 2026b).

The source document, EFTA01275697.pdf, is a 27.2 MB, 190-page file (DOJ, 2025b). But it is not a Deutsche Bank document at all. It is a Ghislaine Maxwell UBS Financial Services account statement package — specifically, a Resource Management Account from February 2014. The financial advisors listed are Scott Stackman and Lyle Casriel. The account value went from $0.00 on January 31 to $27,047.74 on February 28, 2014, funded by a wire from JPMorgan Chase. The seizure reference numbers run from SDNY_GM_00023306 through SDNY_GM_00023496.

The scanning software processed the repeating header "Resource Management Account" across all 190 pages and produced "Resource Krakow &moult" — transforming a financial product name into a phantom Polish city reference. The mangling was consistent across every page, generating 190 instances of a non-existent geographic entity.

A search for any other "Krakow" references in the entire 2.1-million-document corpus found exactly four matches — all containing "KRAKOWER, JUDITH R," a surname in House Oversight records. There are zero verified Krakow geographic references in the corpus. The city does not appear in any legitimate capacity.

Why Both Were Retracted

Both observations followed the same failure pattern. A scanning engine processed document features that contain no information — blank fields, repeating headers — and generated entity-like text that matched investigatively significant patterns (PAPER TRAIL Project, 2026b). The extraction pipeline, designed to find entities, dutifully recorded these phantom references alongside genuine data.

The retraction was straightforward once visual inspection was applied. But the observations persisted through multiple processing stages before being caught, because the text outputs were plausible enough to survive automated quality checks. "Poland" is a real country. "Krakow" is a real city. Neither triggered the kind of obvious error that automated filters catch.

Structural Lessons

These are not random errors. They are systematic artifacts produced by the interaction between scanning technology and document structure. Blank government forms with pre-printed labels are a recurring document type in financial compliance records. Every blank KYC form in the Deutsche Bank corpus has the potential to generate the same hallucination. Repeating headers across multi-page statements will produce the same mangled text on every page, inflating entity counts by the number of pages rather than the number of genuine mentions.

The mitigation is VLM (Vision Language Model) reprocessing — a more advanced type of document scanning that uses artificial intelligence to understand the visual layout of a page, allowing it to distinguish between filled and blank form fields rather than just extracting text character by character (PAPER TRAIL Project, 2026c). But the observations also validate a simpler rule: high-mention entities concentrated in contiguous document ranges are likely boilerplate artifacts and must be verified against source documents before being trusted.

Both observations are preserved in the project record — not as leads, but as methodological lessons. They demonstrate exactly the kind of error-correction discipline that forensic document analysis requires: the willingness to retract findings that dissolve under source verification, no matter how investigatively convenient they would have been.

References

PAPER TRAIL Project. (2026a). Stakeholder analysis: Poland (STK-1). [Data analysis: STAKEHOLDERS.md].

PAPER TRAIL Project. (2026b). Observation log: OBS-5 and OBS-6. [Data analysis: OBSERVATIONS.md].

PAPER TRAIL Project. (2026c). VLM reprocessing. [Script: app/scripts/08_vlm_reprocess.py].

PAPER TRAIL Project. (2026d). Tradecraft: Anti-apophenia controls. [Data analysis: research/TRADECRAFT.md, Section 5].

U.S. Department of Justice. (2025a). Epstein files: Data Set 10. EFTA01268833.pdf, page 17. justice.gov/epstein.

U.S. Department of Justice. (2025b). Epstein files: Data Set 10. EFTA01275697.pdf (190 pages). justice.gov/epstein.