EP03: Ghosts in the Machine — When Machines Hallucinate

Table of Contents

TLDR

Episode 3 of PAPER TRAIL confronts the pipeline's most dangerous failure mode: phantom entities generated by OCR hallucination. Of the top 25 highest-mention entities, 12 were phantoms — artifacts of blank forms, repeated boilerplate, and degraded handwriting that the pipeline mistook for real people, organizations, and locations (PAPER TRAIL Project, 2026a).

What the Episode Covers

EP03 opens with the 42.7% singleton rate and works backward to its cause. The pipeline's OCR stage — converting scanned PDFs and images to searchable text — introduces systematic errors that propagate through every downstream analysis. The episode categorizes four hallucination vectors: blank government forms with pre-printed field labels (e.g., "Persons/Non" extracted as an entity), repeated document headers and footers mangled across hundreds of pages, handwritten text on Deutsche Bank KYC forms producing unreadable output, and general image degradation from multi-generation photocopies (PAPER TRAIL Project, 2026a).

Two case studies anchor the episode. OBS-5 documented a finding that the word "Poland" appeared in Deutsche Bank compliance records, suggesting a geographic connection. Investigation revealed the source was EFTA01268833.pdf page 17 — a blank KYC form where the pre-printed field label "Persons/Non" was OCR-misread as "Poland." The form was stamped CLOSED 12/17/19 and contained no handwritten data. OBS-6 reported "Krakow" appearing in UBS account statements. The source was EFTA01275697.pdf — 190 pages of Ghislaine Maxwell's UBS records where the repeated header "Resource Management Account" was mangled into "Krakow" by the OCR engine (PAPER TRAIL Project, 2026b).

The Retractions

Both observations were retracted — publicly, with full disclosure of how the error occurred and why it was not caught earlier. The episode presents the retraction as a feature of the methodology, not a failure. A pipeline that cannot detect its own hallucinations is unreliable. A pipeline that detects them, discloses them, and retracts them is demonstrating exactly the kind of self-correction that the Daubert standard requires (PAPER TRAIL Project, 2026b).

The morphing animation — showing "Persons/Non" transforming character by character into "Poland" — visualizes how a single OCR misread cascades through entity extraction, deduplication, and graph construction. By the time the phantom reaches the co-occurrence network, it appears as a real geographic entity with dozens of relationships to legitimate entities.

The VLM Solution

EP03 introduces the vision-language model (Qwen2.5-VL-7B running on a consumer RTX 4070 with 8 GB VRAM) as the mitigation layer. Unlike traditional OCR, a VLM processes the document as an image, understanding layout, distinguishing filled from blank fields, and reading handwritten text with higher fidelity. Script 08 reprocesses documents flagged as high-risk for OCR artifacts, providing a second opinion that can override the original transcription (PAPER TRAIL Project, 2026c).

The episode is explicit about the VLM's limitations: it runs at roughly 1/10th the speed of batch OCR, requires GPU resources, and introduces its own error modes. It is a correction layer, not a replacement. The 29.5 million entity relationships computed in EP02 still rest on the original OCR foundation — the VLM catches errors but does not retroactively fix every downstream computation.

Why This Episode Matters

EP03 establishes the project's intellectual honesty standard. The retractions are presented before the findings, not after. The pipeline's failure modes are named, categorized, and demonstrated. The audience learns to distrust high-mention entities concentrated in contiguous document ranges — a heuristic that applies to any large-scale document analysis, not just this corpus. Every subsequent episode inherits this disclosure: the data is imperfect, the pipeline makes errors, and the methodology includes mechanisms to catch them.

References

PAPER TRAIL Project. (2026a). EP03 slide content: OCR hallucination vectors, singleton analysis, phantom entity detection [Presentation]. communications/ep03_slides/

PAPER TRAIL Project. (2026b). Observations log: OBS-5 and OBS-6 retracted [Research document]. OBSERVATIONS.md

PAPER TRAIL Project. (2026c). VLM reprocessing pipeline [Computer software]. app/scripts/08_vlm_reprocess.py


This research is sponsored by Subthesis.