EP02: The Pipeline — 1.18 Million Entities and the Missing Third
TLDR Episode 2 of PAPER TRAIL introduces the 16-script processing pipeline that converted 2.1 million raw government documents into structured, searchable data...
8 investigations
TLDR Episode 2 of PAPER TRAIL introduces the 16-script processing pipeline that converted 2.1 million raw government documents into structured, searchable data...
TLDR Episode 10 of PAPER TRAIL covers the 863,000-email sub-corpus spanning Data Sets 9 and 11 — scanned printouts processed through Relativity litigation...
TLDR Data Set 9 (531,256 files) and Data Set 11 (331,655 files) together constitute 863,000 email documents — the largest sub-corpus. DS11 was publicly...
TLDR A pipeline of 27+ Python scripts transforms 2.1 million raw government documents into a searchable PostgreSQL database with 2.38 million extracted...
TLDR Data Set 9 is the largest single data set by file count: 531,256 email PDFs occupying 94.5 GB. It achieved a perfect disk-to-database match (zero delta),...
TLDR Seven production Scalable Vector Graphics (SVG) images — a type of image that stays sharp at any size — were created at 1920x1080 resolution with a...
TLDR Automated name detection software extracted 2,383,898 entities from 2.05 million documents — 57% organizations, 37% persons, 6% locations. Nearly one in...
TLDR Probabilistic entity resolution using a statistical matching method reduced 2.38 million raw Named Entity Recognition entries to 519,000 unified clusters...