TLDR
A pipeline of 27+ Python scripts transforms 2.1 million raw government documents into a searchable PostgreSQL database with 2.38 million extracted entities, 29.5 million relationship pairs, 224 parsed wire transfers, and 2,894 FedEx shipments. The pipeline spans eight processing stages from metadata ingestion through cross-domain synthesis.
From Raw Files to Structured Intelligence
Two million documents in PDF and image format are useless for systematic analysis. You cannot search them, cross-reference them, or identify patterns across them without first converting them into structured data. That is what the processing pipeline does — it takes the raw output of government document releases and transforms it into a relational database where every entity, every wire transfer, every shipment, and every co-occurrence relationship can be queried in milliseconds.
The pipeline grew organically as analytical needs expanded. What started as a straightforward ingestion-and-OCR workflow (optical character recognition — the process of converting images of text into searchable text) eventually became 27+ scripts organized into eight processing stages: metadata ingestion, OCR transcription, automated name detection, deduplication, graph construction, temporal analysis, visual document reprocessing, and cross-domain synthesis (PAPER TRAIL Project, 2026a).
The Eight Stages
Stage 1: Ingestion. Script 01 reads every file path, identifier, and file type into the PostgreSQL documents table, creating the master record for all 2.1 million files (PAPER TRAIL Project, 2026a).
Stage 2: OCR. Scripts 02 and 03 convert PDFs and images into searchable markdown text. This is where the first quality challenges emerge — OCR engines produce phantom names from blank forms, mangle handwritten text, and generate garbled output from degraded documents (PAPER TRAIL Project, 2026b).
Stage 3: Automated Name Detection. Script 04 runs an open-source language processing tool called spaCy (a software library that identifies names of people, places, and organizations in text) across 2,046,260 documents using 8 parallel workers, extracting 2,383,898 entities: 57% organizations, 37% persons, 6% locations (PAPER TRAIL Project, 2026c). Script 04b then removes obvious duplicates.
Stage 4: Graph Construction. Script 05 builds co-occurrence relationships — when two entities appear in the same document, they are linked. This produced 29.5 million unique relationship pairs, forming the raw material for network analysis (PAPER TRAIL Project, 2026d).
Stage 5: Domain-Specific Parsing. Script 11 parses 2,894 FedEx shipments from seized invoices (PAPER TRAIL Project, 2026e). Script 16 classifies 229,000 bank documents and extracts 224 wire transfers totaling $24.1 million (PAPER TRAIL Project, 2026f). Sub-scripts 16b through 16g use vision-language models (AI systems that can read images the way a human would, interpreting layout, handwriting, and degraded text) to recover missing fields from hard-to-read documents.
Stage 6: Advanced Analysis. Script 19 performs probabilistic entity resolution — a statistical matching method that determines whether two slightly different name records refer to the same real-world entity — using Splink software, reducing 2.38 million entities to 519,000 clusters (PAPER TRAIL Project, 2026g). Script 20 runs a temporal change-point detection algorithm (a method that identifies moments when document activity patterns suddenly shifted), finding 889 breakpoints (PAPER TRAIL Project, 2026h). Script 21 identifies 125,620 communities using Leiden detection (a method for finding groups of entities that appear together more often than expected) and 535,318 structural hole brokers (entities that bridge otherwise disconnected groups) (PAPER TRAIL Project, 2026i). Script 22 scores 434,000 documents by how unusual their entity combinations are — documents with rare combinations of names score higher, flagging them for closer review (PAPER TRAIL Project, 2026j).
Stage 7: Institutional Forensics. Script 18 runs six analytical modules: willful blindness detection (searching for compliance language that signals alerts being dismissed), ownership graph construction, compliance timeline analysis, spoliation estimation (detecting whether documents may have been destroyed), Granger causality testing (checking whether corporate formations preceded suspicious wire activity), and accountability matrix generation (PAPER TRAIL Project, 2026k).
Stage 8: Synthesis. Script 25b integrates outputs from all upstream scripts into a unified event registry of 232,083 events with 143,791 bridge rows linking entities to events across domains (PAPER TRAIL Project, 2026l). It produces 9 investigative leads, 3 Analysis of Competing Hypotheses matrices (structured frameworks for evaluating which explanation best fits the evidence), and 38 formal evidence chain nodes.
What Each Number Means
The numbers are not the point. What matters is what the pipeline makes possible that was not possible before.
The 29.5 million entity relationships let you ask: "Who appears in documents alongside this person, and how often?" (PAPER TRAIL Project, 2026d). The 889 temporal change-points reveal when document activity patterns shifted — and whether those shifts align with known events like arrests, bank account closures, or legislative actions (PAPER TRAIL Project, 2026h). The 519,000 entity clusters mean that eight different OCR-mangled spellings of the same pilot's name are recognized as one person (PAPER TRAIL Project, 2026g).
The 224 wire transfers, individually, are just rows in a table. Connected through the synthesis engine, they reveal a migration pattern from Deutsche Bank to TD Bank in early 2019, a $14.66 million flow through a single lawyer's trust account, and $2.65 million disbursed through the Butterfly Trust to women with Eastern European surnames for "tuition" (PAPER TRAIL Project, 2026f).
The Pipeline Is the Methodology
Every script in the pipeline uses peer-reviewed algorithms with known error rates, selected for Daubert admissibility (the legal standard that federal courts use to determine whether expert analysis is scientifically reliable enough to be presented as evidence) (PAPER TRAIL Project, 2026m). The entity resolution software implements a statistical matching method originally developed by Fellegi and Sunter for linking records across databases (Fellegi & Sunter, 1969). The change-point detection uses a specific mathematical cost function calibrated against 50 verified dates (PAPER TRAIL Project, 2026n). The community detection uses a tuned resolution parameter that controls how finely the network is divided into groups. These are not arbitrary choices — they are defensible parameters documented in calibration reports.
The pipeline is not a black box. It is 27 scripts, each with defined inputs and outputs, each producing exports that can be independently verified. That transparency is the point.
References
Fellegi, I. P., & Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64(328), 1183–1210.
PAPER TRAIL Project. (2026a). Full script listing with descriptions [Project documentation]. CLAUDE.md
PAPER TRAIL Project. (2026b). OCR hallucination analysis (OBS-5, OBS-6) [Observation log]. OBSERVATIONS.md
PAPER TRAIL Project. (2026c). Entity extraction results [Database table]. PostgreSQL entities table, 2,383,898 rows, db=epstein_files.
PAPER TRAIL Project. (2026d). Entity relationship graph [Database table]. PostgreSQL entity_relationships table, 29.5M unique pairs, db=epstein_files.
PAPER TRAIL Project. (2026e). FedEx shipment records [Database table]. PostgreSQL fedex_shipments table, 2,894 rows, db=epstein_files.
PAPER TRAIL Project. (2026f). Wire transfer records [Database table]. PostgreSQL wire_transfers table, 224 rows, db=epstein_files.
PAPER TRAIL Project. (2026g). Splink entity resolution results [Data set]. _exports/entity_resolution/
PAPER TRAIL Project. (2026h). PELT temporal change-point detection results [Data set]. _exports/temporal/
PAPER TRAIL Project. (2026i). Leiden community detection and structural hole analysis [Data set]. _exports/network/
PAPER TRAIL Project. (2026j). Document surprisal scoring [Data set]. _exports/surprisal/
PAPER TRAIL Project. (2026k). Institutional forensics analysis [Computer software]. app/scripts/18_institutional_analysis.py
PAPER TRAIL Project. (2026l). Cross-domain synthesis results [Data set]. _exports/synthesis/
PAPER TRAIL Project. (2026m). Validation and Daubert admissibility framework [Research document]. research/VALIDATION.md
PAPER TRAIL Project. (2026n). Calibration methodology and parameter documentation [Research document]. research/CALIBRATION.md
PAPER TRAIL Project. (2026o). Pipeline DAG visualization [Visualization]. visualizations/ppt_ep02_slide04_pipeline_dag.svg