DS9 vs. DS11: 863,000 Emails the DOJ Mislabeled
TLDR Data Set 9 (531,256 files) and Data Set 11 (331,655 files) together constitute 863,000 email documents — the largest sub-corpus. DS11 was publicly...
6 investigations
TLDR Data Set 9 (531,256 files) and Data Set 11 (331,655 files) together constitute 863,000 email documents — the largest sub-corpus. DS11 was publicly...
TLDR A pipeline of 27+ Python scripts transforms 2.1 million raw government documents into a searchable PostgreSQL database with 2.38 million extracted...
TLDR Data Set 9 is the largest single data set by file count: 531,256 email PDFs occupying 94.5 GB. It achieved a perfect disk-to-database match (zero delta),...
TLDR Seven production Scalable Vector Graphics (SVG) images — a type of image that stays sharp at any size — were created at 1920x1080 resolution with a...
TLDR Automated name detection software extracted 2,383,898 entities from 2.05 million documents — 57% organizations, 37% persons, 6% locations. Nearly one in...
TLDR Probabilistic entity resolution using a statistical matching method reduced 2.38 million raw Named Entity Recognition entries to 519,000 unified clusters...