TLDR
Seven production Scalable Vector Graphics (SVG) images — a type of image that stays sharp at any size — were created at 1920x1080 resolution with a consistent dark theme for EP02 "The Pipeline," visualizing the processing pipeline, entity type distribution, Named Entity Recognition (NER) coverage gaps, Chao1 completeness metrics, and dataset saturation. These vector graphics serve as the visual backbone of the series, encoding analytical outputs into presentation-ready formats (PAPER TRAIL Project, 2026).
Why Scalable Vector Graphics
When your data visualization needs to work in a PowerPoint slide, a web page, a PDF export, and a screen recording simultaneously, the format matters. Raster formats (PNG, JPEG) degrade at different display sizes. Scalable Vector Graphics (SVG) renders mathematically at any resolution, keeping text sharp and lines clean whether projected on a conference room screen or viewed on a phone.
The seven EP02 production SVGs were designed at 1920x1080 pixels (standard HD presentation resolution) with a dark theme: dark background, light text, accent colors for data categories. The dark theme serves two purposes: it reduces visual fatigue during long presentations, and it creates high contrast for the data elements that carry meaning (PAPER TRAIL Project, 2026).
The Seven Visualizations
Pipeline DAG (ppt_ep02_slide04_pipeline_dag.svg). The 16-script processing pipeline rendered as a directed acyclic graph (a flowchart where data moves in one direction without looping back) across eight processing stages. Scripts are nodes, data flows are edges, and the entire graph appears in a "dimmed" state — representing the pipeline before execution. This is the "before" image; subsequent slides show the pipeline illuminating as each stage completes (PAPER TRAIL Project, 2026).
Entity Donut (ppt_ep02_slide10_entity_donut.svg). A three-segment donut chart showing entity type distribution: Organization 57%, Person 37%, Location 6%. The center annotation displays the 1.55:1 organization-to-person ratio — a number that tells its own story. In a corpus about a criminal network built around a person, organizations outnumber people by more than three to two. The shell companies, the trusts, the LLCs — they are the majority (PAPER TRAIL Project, 2026).
NER Coverage (ppt_ep02_slide15_ner_coverage.svg). Fifteen horizontal bars representing NER processing coverage by dataset. Most bars show high coverage. Two bars — DS9 and DS11 — are highlighted in red at 0%, representing the 863,000 email documents that had not yet been processed at the time EP02 was produced. This visualization captured a moment in time; NER for DS9 and DS11 has since been completed (PAPER TRAIL Project, 2026).
Progress Ring (ppt_ep02_slide17_progress_ring.svg). A circular arc filled to 40.8% with "857K documents" at center. This was the Chao1 completion estimate at EP02 production time — the pipeline had observed 857,000 unique documents contributing to entity extraction. The number has since updated to 63.7% completeness (821,633 observed entities out of an estimated 1,290,141) (PAPER TRAIL Project, 2026).
Chao1 Formula (ppt_ep02_slide19_chao1_formula.svg). The Chao1 species richness estimator — a statistical method that estimates the total number of entities likely to exist based on how many appear only once or twice — presented as a three-step visual formula. Step one: f1 (singletons, entities appearing only once) = 197,945. Step two: f2 (doubletons, entities appearing exactly twice) = 41,816. Step three: the Chao1 equation with actual numbers substituted, showing how singleton and doubleton counts project the total entity population. This visualization transforms an abstract statistical concept into a concrete calculation the audience can follow (Chao, 1984; PAPER TRAIL Project, 2026).
Frequency Spectrum (ppt_ep02_slide20_frequency_spectrum.svg). Four vertical bars representing the frequency distribution: 197,000 entities appearing once (singletons), 42,000 appearing twice (doubletons), 59,000 appearing three times (tripletons), and 522,000 appearing four or more times. The singleton bar dominates — nearly one in four entities in the corpus appears in only one document. This is either a signal of genuine rare entities or a symptom of OCR fragmentation (where the text-recognition software splits one name into multiple variants), and the visualization makes the scale of the ambiguity visible (PAPER TRAIL Project, 2026).
Saturation Heatmap (ppt_ep02_slide21_saturation_heatmap.svg). A grid of 15 datasets colored by their Chao1 completeness percentage, using a green-to-red-to-black gradient. Green datasets are well-saturated; red datasets have significant entity populations remaining; black datasets are deeply incomplete. DS10 at 57.4% appears in the transitional zone. This visualization answers the question: where should the next analytical effort be directed? (PAPER TRAIL Project, 2026).
Design Consistency
All seven SVGs share a design language: identical background color, consistent font choices, matching accent palettes. This consistency is not aesthetic preference — it is functional. When a viewer sees the same visual language across slides, the cognitive load of parsing each new visualization drops. The brain recognizes the format and focuses on the data.
The EP01 episode also produced 11 SVGs in the same format, covering the release timeline, dataset breakdown, corpus dimensions, compliance gap, and privacy inversion. Together, the 18 SVGs across EP01 and EP02 establish the visual vocabulary that subsequent episodes inherit (PAPER TRAIL Project, 2026).
Snapshots in Time
Several of the EP02 visualizations captured interim data that has since changed. The 40.8% progress ring is now 63.7%. The DS9 and DS11 NER bars at 0% are now complete. These are not errors — they are honest representations of the pipeline's state at each episode's production date. The series documents a process, and processes change. The visualizations record where the analysis stood, not where it ended up.
References
Chao, A. (1984). Nonparametric estimation of the number of classes in a population. Scandinavian Journal of Statistics, 11(4), 265-270.
PAPER TRAIL Project. (2026). Seven EP02 SVG visualizations [Visualization]. visualizations/ppt_ep02_slide04_pipeline_dag.svg through ppt_ep02_slide21_saturation_heatmap.svg.
PAPER TRAIL Project. (2026). EP01 SVG visualizations (11 files) [Visualization]. visualizations/ppt_ep01_slide07_release_timeline.svg and others.
PAPER TRAIL Project. (2026). Entity distribution: Organization 57%, Person 37%, Location 6% [Data]. PostgreSQL db=epstein_files, entities table.
PAPER TRAIL Project. (2026). Chao1 current estimate: 63.7% (821,633 / 1,290,141) [Data]. _exports/validation/chao1_summary.json.
PAPER TRAIL Project. (2026). NER coverage: DS9 + DS11 now complete [Data]. Script 04 status.
PAPER TRAIL Project. (2026). Frequency spectrum: 197K/42K/59K/522K [Data]. _exports/validation/frequency_spectrum.csv.