Download and Verify

Table of Contents

TLDR

The corpus audit script (Script 27) audits the entire corpus in 17 seconds, cross-referencing 2.1 million database records against files on disk across 17 sources. Four-tier overlap detection (identifier, filename, size+type, and cryptographic hash) catches duplicates before re-ingestion. Gap analysis identified the 3rd Batch phone logs (8,544 documents, only a 6-page sampler obtained) as the largest acquisition gap (PAPER TRAIL Project, 2026a).

The Problem of Trust

When you download 331 GB of government documents from multiple sources over multiple months, three questions arise immediately. First: did everything arrive? Second: did anything arrive twice? Third: what is still missing? (PAPER TRAIL Project, 2026a).

Script 27 answers all three in a single pass. It comprises five modules — corpus inventory, new download scanning, overlap detection, gap analysis, and report generation — that together produce a complete audit of the corpus state. The full run takes approximately 17 seconds (PAPER TRAIL Project, 2026a).

Corpus Inventory

The inventory module counts every file by source, comparing what the database says should exist against what actually sits on disk. This sounds trivial until you encounter Data Set 10 (PAPER TRAIL Project, 2026a).

DS10 has 950,921 page-level records in the database but only 504,000 distinct files on disk. The discrepancy is not data loss. Deutsche Bank documents are multi-page PDFs. The database creates a record for each page (for page-level analysis), while the disk holds one file per PDF. Without accounting for this distinction, the inventory would report 446,891 missing files that never existed as separate documents (PAPER TRAIL Project, 2026a).

The fix was a deduplication clause in the inventory query that counts each file path only once. Without it, the disk size sum for DS10 was inflated from 81.1 GB to 14.3 TB — a factor-of-180 error caused by counting the same file once per page. This is the kind of bug that looks catastrophic in a report and trivial in a query, and it is exactly why automated auditing matters (PAPER TRAIL Project, 2026a).

Data Set 9, by contrast, shows a perfect disk-to-database match: 531,256 files on both sides, no discrepancy. DS15 is the largest source by disk size at 115.4 GB, followed by DS9 at 94.5 GB and DS10 at 81.1 GB (after deduplication) (PAPER TRAIL Project, 2026b).

Four-Tier Overlap Detection

When new documents arrive — downloaded from House Oversight, obtained from court records, or extracted from FOIA responses — they must be checked against the existing corpus before ingestion. Duplicating a document corrupts entity counts, inflates relationship networks, and creates phantom co-occurrences (PAPER TRAIL Project, 2026a).

Script 27 detects overlaps at four tiers of increasing confidence:

T0 (Identifier match): If the new file has an EFTA identifier, DOJ-OGR prefix, or other standardized document ID that already exists in the database, it is an exact match. This catches formal duplicates between institutional releases (PAPER TRAIL Project, 2026a).

T1 (Filename match): If the filename matches an existing record, it is likely the same document from a different source. This catches cross-source duplicates where the same file was released by both DOJ and House Oversight (PAPER TRAIL Project, 2026a).

T2 (Size + type match): If a file has the same byte size and file type as an existing record, it may be the same document with a different name. This catches renamed duplicates (PAPER TRAIL Project, 2026a).

T3 (SHA-256 hash match): The definitive tier. SHA-256 is a cryptographic fingerprint — if two files produce the same hash, they are byte-for-byte identical regardless of name, location, or source (PAPER TRAIL Project, 2026a).

The tiered approach is necessary because computing SHA-256 for 2.1 million files is expensive. Tiers 0 through 2 eliminate obvious duplicates quickly. Tier 3 is reserved for ambiguous cases and is run with the --hash flag (PAPER TRAIL Project, 2026a).

When House Oversight estate records were downloaded, the overlap detector confirmed that Request No. 1 (the 238-page birthday book) and Request No. 2 (the 10-page last will) already existed in the database at T0 level. Request No. 4 (the Non-Prosecution Agreement) and Request No. 8 (the contact book) were genuinely new — flagged for ingestion (PAPER TRAIL Project, 2026c).

File Size Backfill

The database originally lacked file size data for many records. The backfill operation populated 1,503,016 rows — 100,000 in an initial pass and 1,403,016 in a follow-up sweep — leaving only 16 rows with missing file sizes. Those 16 are failed DS1 documents where the source files are missing from disk (PAPER TRAIL Project, 2026a).

File sizes matter for overlap detection (T2 tier), storage planning, and anomaly detection. A PDF that should be 50 pages but weighs 200 bytes is either corrupted or redacted to near-emptiness. The backfill made these checks possible across the full corpus.

Gap Analysis

The gap analysis module cross-references 10 House Oversight Committee sources against the database and disk (PAPER TRAIL Project, 2026d). The findings are sobering.

DOJ-OGR identifiers in source_id=15 total 33,655 documents — slightly exceeding the expected 33,295. This confirms that House Oversight Source 3 (the DOJ 33K page release) is likely complete in the corpus (PAPER TRAIL Project, 2026d).

The largest identified gap is the 3rd Batch phone logs: 8,544 documents spanning Epstein's 2002-2005 phone records. Only a 6-page sampler has been obtained. The full set — who Epstein called, who called him, when, and how often — remains unacquired (PAPER TRAIL Project, 2026d).

Additional gaps include approximately 20,000 pages of additional estate documents, the Clinton deposition transcripts, and Democratic committee email correspondence. Three Google Drive folders from House Oversight Sources 2, 3, and 4 have been identified but not yet downloaded (PAPER TRAIL Project, 2026d).

Why This Matters

Corpus integrity is not a technical concern. It is an evidentiary concern. Every claim in the PAPER TRAIL series traces back to database queries. If the database contains duplicates, entity counts are inflated. If it contains gaps, evidence is missing. If new documents are ingested without overlap checking, the same wire transfer could be counted twice.

Script 27 runs in 17 seconds and produces a full markdown report plus four CSV exports (PAPER TRAIL Project, 2026a). It is the difference between trusting the corpus and knowing the corpus. In a project where every number gets published, that distinction matters.

References

PAPER TRAIL Project. (2026a). Corpus audit and gap analysis (Script 27) [Software]. app/scripts/27_corpus_audit.py

PAPER TRAIL Project. (2026b). Corpus inventory — file counts and disk sizes by source [Data]. _exports/audit/corpus_inventory.csv

PAPER TRAIL Project. (2026c). Overlap detection results — estate records [Data]. _exports/audit/overlap_results.csv

PAPER TRAIL Project. (2026d). Gap analysis — House Oversight sources [Data]. _exports/audit/gap_analysis.csv

PAPER TRAIL Project. (2026e). Corpus audit report [Data]. _exports/audit/corpus_audit_*.md