OCR Hallucinations

Table of Contents

TLDR

OCR engines (software that converts images of text into searchable characters) produce phantom entities from blank form labels and repeated document headers — "Poland" from a KYC field reading "Persons/Non" and "Krakow" from a UBS header reading "Resource Management Account." Both generated thousands of false entity references before being caught and retracted, validating the pipeline's controls against false pattern detection.

When Machines See What Is Not There

Optical character recognition, or OCR, turns images of text into searchable characters. It is the foundation of the entire document processing pipeline — without OCR, the 2.1 million files in the corpus would be unsearchable photographs. But OCR engines make mistakes, and at scale, those mistakes become systematic (PAPER TRAIL Project, 2026a).

The two most instructive failures in this project both involved the same mechanism: OCR engines interpreting pre-printed text on documents as meaningful content when the actual fields were blank. The resulting phantom entities propagated through the name detection system, the entity database, and the relationship graph before being caught (PAPER TRAIL Project, 2026a).

OBS-5: The Phantom Poland

Observation 5 began as a potentially significant finding: the word "Poland" appeared in connection with Deutsche Bank KYC forms (Know Your Customer — identity verification documents that banks are required to complete for each account holder) in Data Set 10. Given that Poland had opened a formal investigation into Epstein trafficking connections, a Polish reference in German banking documents seemed noteworthy (PAPER TRAIL Project, 2026a).

Investigation revealed the source. The KYC form at EFTA01268833.pdf, page 17, contains a pre-printed field label reading "SSN (U.S. Persons/ Non-U.S. Persons)." The OCR engine rendered this as "UN (US Peon/ Stolle. Poland." The field itself was completely blank — no Social Security number, no data of any kind (PAPER TRAIL Project, 2026b). The "Second Beneficial Owner," "Third Beneficial Owner," and "Fourth Beneficial Owner" sections were all empty. A "CLOSED" stamp dated 12/17/19 confirmed the form was a dead record.

But the damage was already done in the entity database. The OCR artifact generated 6,876 "Poland" entity mentions across 690 documents in 4 data sets. Most were the same blank form label repeated across hundreds of pages of KYC paperwork. The automated name detection software could not distinguish between "Poland" appearing in a sentence about international wire transfers and "Poland" hallucinated from a blank government form (PAPER TRAIL Project, 2026a).

OBS-6: The Phantom Krakow

Observation 6 was even more dramatic. The word "Krakow" appeared repeatedly in what turned out to be Ghislaine Maxwell's UBS account statements — 190 pages in a single PDF (EFTA01275697.pdf, 27.2 MB). A reference to the Polish city Krakow in Maxwell's bank records would have been a significant cross-domain finding (PAPER TRAIL Project, 2026a).

The source was the UBS header "Resource Management Account," printed at the top of every page. Across 190 pages of repetition, the OCR engine eventually rendered it as "Resource Krakow & moult." The mangled text was then extracted as a location entity and propagated into the database (PAPER TRAIL Project, 2026a).

A corpus-wide search confirmed the absence of any real Krakow reference. The only other matches were 4 House Oversight documents containing the surname "KRAKOWER, JUDITH R" — a person, not a city. Zero verified references to the Polish city Krakow exist in the 2.1 million document corpus (PAPER TRAIL Project, 2026c).

The Diagnostic Pattern

Both hallucinations share a diagnostic signature: high-mention entities concentrated in contiguous document ranges. When an entity appears thousands of times but only within documents from a narrow range of sequential identifiers, it is likely a boilerplate artifact (repeating text like headers or form labels that appears on many pages). The KYC forms share document numbers within the same volume. The UBS statements are 190 consecutive pages in a single file (PAPER TRAIL Project, 2026a).

This pattern is now a documented detection rule. When a new entity surfaces with high frequency but narrow document range, the first question is not "what does this mean?" but "is this real?" (PAPER TRAIL Project, 2026d).

Why Retraction Matters

Both observations were formally retracted in the project's observation log. They remain in the file, not deleted but clearly marked as methodological lessons (PAPER TRAIL Project, 2026a). This is deliberate. In investigative analysis, the temptation is always to see patterns that confirm hypotheses. Poland is investigating Epstein connections — therefore "Poland" in bank records must be significant. Maxwell has international financial ties — therefore "Krakow" in her account statements must mean something.

The retraction process is the control against apophenia — the human tendency to perceive meaningful connections in random information. This tendency is the central risk in any large-scale document analysis. The OCR hallucination problem is apophenia automated: machines generating false patterns that humans then interpret as evidence (PAPER TRAIL Project, 2026a).

The VLM Solution

Vision-language models (AI systems that process images and text together, reading a page the way a human would) offer a partial solution. Unlike text-only OCR, a VLM can see that a form field is blank, that a header is boilerplate, that handwritten text differs from printed labels (PAPER TRAIL Project, 2026e). The VLM reprocessing script was designed specifically for documents that failed text-only OCR, and its ability to distinguish filled from empty fields directly addresses the OBS-5 class of error.

But VLM processing at scale introduces its own challenges. The model runs on a single RTX 4070 with 8 GB of video memory. Processing 2.1 million documents through the VLM would take months. The triage approach — using the VLM selectively on documents flagged by text-only OCR failures — is a practical compromise, but it means not every phantom entity will be caught by the better model (PAPER TRAIL Project, 2026e).

The OCR-degraded sender name ".EFFERY EOSTIEN" in Observation 1 belongs to the same class of error. The difference is that OBS-1 led to a false identification (Robert Crumb) that was caught by external corroboration, while OBS-5 and OBS-6 led to false geographic references caught by source document inspection (PAPER TRAIL Project, 2026f). Different detection mechanisms, same underlying problem: OCR errors propagating through automated extraction into findings that look real but are not.

The 2.38 million entities in the corpus include an unknown number of phantoms. Not all of them will be as obvious as "Poland" or "Krakow." The singleton problem — 197,945 entities appearing in only one document — likely includes thousands of OCR artifacts that will never be individually verified (PAPER TRAIL Project, 2026g). The Chao1 estimate (a statistical method for estimating total entities, borrowed from ecology) of 468,000 missing entities includes an unknown fraction that are not missing at all, because they never existed (PAPER TRAIL Project, 2026h).

This is the cost of automation at scale. The pipeline processes what it can see. Sometimes what it sees is not there.

References

PAPER TRAIL Project. (2026a). Observations log (OBS-5 and OBS-6 retractions) [Observation log]. OBSERVATIONS.md

PAPER TRAIL Project. (2026b). KYC form source document [Source document]. EFTA01268833.pdf, p. 17, Data Set 10.

PAPER TRAIL Project. (2026c). Entity database [Database table]. PostgreSQL entities, db=epstein_files.

PAPER TRAIL Project. (2026d). Known data quality issues [Project documentation]. CLAUDE.md

PAPER TRAIL Project. (2026e). VLM reprocessing script [Computer software]. app/scripts/08_vlm_reprocess.py

PAPER TRAIL Project. (2026f). OBS-1 degraded sender name and refutation [Observation log]. OBSERVATIONS.md

PAPER TRAIL Project. (2026g). Singleton analysis [Data set]. _exports/validation/singleton_analysis.csv

PAPER TRAIL Project. (2026h). Chao1 completeness summary [Data set]. _exports/validation/chao1_summary.json