Single PC Pipeline

Table of Contents

TLDR

The entire 2.1 million document Epstein corpus was processed on a single Windows PC: an Intel i9-13950HX with 24 cores, an NVIDIA RTX 4070 with 8 GB of video memory, and 64 GB of RAM running PostgreSQL 16. No cloud services, no distributed computing, no paid API calls. Every entity extracted, every wire parsed, every flight log read by machine happened on one desk (PAPER TRAIL Project, 2026a).

The Machine

The specification reads like a gaming laptop, not a forensic analysis platform. An Intel i9-13950HX provides 24 cores and 32 threads for CPU-bound work. An NVIDIA GeForce RTX 4070 with 8 GB of video memory handles vision-language model inference (a type of AI that can read both text and images). Sixty-four gigabytes of RAM supports the database. The operating system is Windows 11 Pro. The storage holds approximately 331 GB of source documents across 17 database sources (PAPER TRAIL Project, 2026a; PAPER TRAIL Project, 2026b).

This is a consumer-grade machine. It costs less than a month of cloud computing at the scale required to process 2.1 million documents. The deliberate choice to run everything locally was not about cost, though. It was about control.

Why Not the Cloud

No data leaves the local system. The Epstein corpus contains government-released documents that include victim names, home addresses, financial account numbers, and Deutsche Bank portal credentials (PAPER TRAIL Project, 2026c). Uploading this material to any cloud provider would create copies outside the researcher's control, subject to the provider's data retention policies, law enforcement requests, and employee access controls.

The single-machine constraint eliminates this entire category of risk. The documents exist in one place. The database exists in one place. The analytical outputs exist in one place. There is no cloud storage bucket to misconfigure, no API key to leak, no cloud provider subpoena to comply with.

What Runs on CPU

The CPU handles the heavy parallel workloads. Automated name extraction (using the spaCy natural language processing library with its large English model) runs across 8 parallel workers, processing 2,046,260 documents and producing 2,383,898 entities (PAPER TRAIL Project, 2026d). The 24-core processor keeps all 8 workers fed without bottlenecking on context switching.

Date extraction, entity deduplication, co-occurrence graph construction, and temporal analysis all run on CPU. The entity resolution tool (Splink, a software tool that uses statistics to decide which database records refer to the same person) uses a specialized analytical database for its scoring, reducing 2.38 million raw entities to 519,000 unified groups (PAPER TRAIL Project, 2026e). Change-point detection (which finds sudden shifts in document activity over time), community detection (which groups related entities), and broker analysis (which identifies entities bridging gaps between groups) are all CPU-bound operations that benefit from the high thread count (PAPER TRAIL Project, 2026f; PAPER TRAIL Project, 2026g).

The observation extraction pipeline (Script 28) runs a 7-billion-parameter language model through the local inference server on the GPU, but its scoring, deduplication, and database operations are CPU-bound. At approximately 0.5 documents per minute, it processes 500 scored documents overnight (PAPER TRAIL Project, 2026a).

What Runs on GPU

The RTX 4070's 8 GB of video memory is the tightest constraint in the system. It runs Qwen2.5-VL-7B (a vision-language model) for tasks that require understanding both text and page layout: extracting text from handwritten flight logs, reading degraded bank documents, and reprocessing scanning failures (PAPER TRAIL Project, 2026h).

The results justify the constraint. Vision-language model processing extracted 4,286 flights from unredacted handwritten logs at zero errors, recovered 47 wire transfer dates and 44 originator names from degraded Deutsche Bank documents, and reprocessed 70 regulatory forms to recover 36 wire recipients worth $1.47 million (PAPER TRAIL Project, 2026h; PAPER TRAIL Project, 2026i). All of this ran on a consumer GPU that retails for approximately $550.

The video memory limitation creates stability issues under sustained load. The local inference server crashes approximately every 10 to 20 calls due to memory pressure. The pipeline handles this with automatic restart logic and a resume flag that skips already-processed documents. It is not elegant. It works (PAPER TRAIL Project, 2026a).

PostgreSQL as the Backbone

PostgreSQL 16 serves as the single master database with tuned parameters: 16 GB reserved for the database's own cache and 48 GB signaled to the query planner as available through the operating system's file cache. This leaves 16 GB of RAM for the operating system, scripts, and GPU operations — a tight but functional allocation (PAPER TRAIL Project, 2026a).

A fast full-text search system built into the database (GIN-indexed tsvector columns) enables sub-second queries across the entire corpus. The willful blindness linguistic analysis in the institutional forensics module runs seven category searches against the full document text using these indexes. The corpus search agent provides eight subcommands that all resolve to database queries (PAPER TRAIL Project, 2026j; PAPER TRAIL Project, 2026k).

The database holds 2,100,266 document records, 2,383,751 entities, 29.5 million entity relationship pairs, 224 wire transfers, 2,894 FedEx shipments, 229,000 classified bank documents, 889 temporal change-points, 232,083 synthesis events, and 143,791 entity-event bridge rows. All on localhost. All queryable in milliseconds (PAPER TRAIL Project, 2026a).

27 Scripts, One Machine

The pipeline encompasses 27-plus Python scripts covering metadata ingestion, text scanning, image transcription, name extraction, entity deduplication, graph construction, temporal density estimation, media transcription, vision-language model reprocessing, stratified export, date extraction, FedEx parsing, FedEx analysis, congressional vetting, letter generation, enclosure generation, bank record classification, institutional forensics, entity resolution, change-point detection, network topology, document prioritization scoring, completeness validation, cross-domain synthesis, corpus search, claim verification, and corpus audit (PAPER TRAIL Project, 2026a).

Each script reads from and writes to the same PostgreSQL instance. Each runs on the same machine. The entire analytical chain from raw PDF to cross-domain synthesis is reproducible by anyone with this hardware and the public corpus. No proprietary models. No cloud dependencies. No black boxes.

The point is not that this is the optimal architecture. A cluster would be faster. Cloud GPUs would handle larger models. The point is that it is sufficient. Forensic-scale document analysis — 2.1 million files, 2.38 million entities, 29.5 million relationships — does not require institutional infrastructure. It requires one machine, open-source tools, and the willingness to let it run overnight.

References

PAPER TRAIL Project. (2026a). Project configuration and database specification [Data]. CLAUDE.md

PAPER TRAIL Project. (2026b). Corpus audit — disk size across 17 sources (Script 27) [Data]. _exports/audit/corpus_inventory.csv

PAPER TRAIL Project. (2026c). Observations — OBS-2: Deutsche Bank portal credentials [Data]. OBSERVATIONS.md

PAPER TRAIL Project. (2026d). Named entity recognition — 2,046,260 documents, 2,383,898 entities (Script 04) [Data]. Database: epstein_files

PAPER TRAIL Project. (2026e). Entity resolution — 519,000 clusters (Script 19) [Software]. app/scripts/19_entity_resolution.py

PAPER TRAIL Project. (2026f). PELT change-point detection — 889 breakpoints (Script 20) [Data]. _exports/temporal/

PAPER TRAIL Project. (2026g). Network topology — Leiden communities and broker analysis (Script 21) [Data]. _exports/network/

PAPER TRAIL Project. (2026h). VLM flight log extraction — 4,286 flights, zero errors (Script 16f) [Data]. Database: epstein_files

PAPER TRAIL Project. (2026i). VLM wire recovery — 47 dates, 44 originators (Script 16g) [Data]. Database: epstein_files

PAPER TRAIL Project. (2026j). Institutional forensics — willful blindness module (Script 18) [Software]. app/scripts/18_institutional_analysis.py

PAPER TRAIL Project. (2026k). Corpus search agent (Script 25) [Software]. app/scripts/25_corpus_search.py