The Many Rodgers: How Entity Resolution Handles Scanning Chaos
TLDR A software tool that uses statistics to decide which database records refer to the same person (called Splink) merged eight or more scanning variants of...
5 investigations
TLDR A software tool that uses statistics to decide which database records refer to the same person (called Splink) merged eight or more scanning variants of...
TLDR Nearly one in four entities in the corpus (197,945 out of 821,633) appears in only one document. An entity that appears in only one document — called a...
TLDR A cleanup script (Script 19b) dissolved 8 entity groups containing 3,576 records that had been incorrectly merged because unreadable text that the...
TLDR Two observations were retracted after visual inspection revealed that scanning software (OCR, or optical character recognition — software that reads text...
TLDR OCR engines (software that converts images of text into searchable characters) produce phantom entities from blank form labels and repeated document...