889 Breakpoints: Detecting Temporal Shifts in Document Activity

Table of Contents

TLDR

An algorithm that finds sudden shifts in document activity patterns -- called PELT -- detected 889 verified breakpoints in document activity time series, moments when the volume of documents associated with specific entities or topics changed significantly (PAPER TRAIL Project, 2026a). These breakpoints are anchored against 50 verified calibration dates from primary government and court sources, and five of those calibration dates were themselves corrected through external verification (PAPER TRAIL Project, 2026b).


What Is a Change-Point?

A change-point is a moment in a time series when the underlying statistical properties shift. In plain terms: the pattern changes. Applied to a document corpus spanning more than 20 years, change-point detection identifies when the rate of document creation associated with a particular entity, topic, or dataset suddenly increases or decreases. These shifts often correspond to real-world events -- arrests, lawsuits, regulatory actions, media coverage -- that generate or suppress document production.

PELT (Pruned Exact Linear Time) is one of the fastest exact methods for detecting multiple change-points in a time series (Killick et al., 2012). Unlike approximate methods, PELT guarantees it will find the optimal segmentation under a given cost function and penalty. For a corpus this large, computational efficiency matters: we needed an algorithm that could process entity-level time series for millions of entities without running for days.

Configuration

Three parameters govern the algorithm's behavior, and each required careful selection (PAPER TRAIL Project, 2026c).

The cost function is L2, which detects shifts in the average document creation rate. L2 is appropriate for count data (documents per time period) and is robust to the kind of variance differences that appear in a corpus with uneven document density.

The penalty uses a statistical test that prevents finding too many false pattern breaks -- called Modified Bayesian Information Criterion (MBIC). MBIC was chosen over the standard version because it performs better on long time series with uneven segment lengths (PAPER TRAIL Project, 2026c). A corpus spanning 20-plus years contains periods of intense activity (arrests, trials) and periods of near-silence. The standard version tends to over-split dense periods and under-split sparse ones. MBIC corrects for this.

The minimum segment size is 14 days, preventing the algorithm from fitting noise in sparse periods while remaining sensitive to genuine activity shifts. Without this floor, the algorithm would detect false breakpoints in weeks where a single document happened to appear in an otherwise empty stretch.

From 966 to 889

The initial run detected 966 breakpoints. Database verification reduced this to 889 (PAPER TRAIL Project, 2026a). The 77 removed breakpoints were artifacts of incomplete date coverage -- time periods where missing dates created apparent activity shifts that did not reflect real changes in document production. This correction step is essential. An algorithm running on imperfect data will find patterns in the imperfections unless those imperfections are identified and excluded.

A diagnostic tool called CROPS (Changepoints for a Range of Penalties) was used to validate that the penalty setting produces stable results (Haynes et al., 2017). CROPS sweeps the penalty parameter across a range and counts the resulting change-points at each value. A stable algorithm produces a smooth curve; an unstable one shows sharp jumps. The CROPS diagnostics for this corpus confirmed stability (PAPER TRAIL Project, 2026a).

Calibration Anchors

Raw change-point detection tells you when something changed. It does not tell you why. To connect algorithmic output to reality, we anchored the 889 breakpoints against 50 verified calibration dates drawn from primary government and court sources spanning 2005 to 2026 (PAPER TRAIL Project, 2026d). These anchors include arrest dates, sentencing dates, civil filing dates, regulatory actions, and legislative events -- all verified against original documents.

Five of those calibration dates were themselves corrected during the verification process (PAPER TRAIL Project, 2026b). The Non-Prosecution Agreement date was off by 86 days. The jail start date was off by one day. The Giuffre v. Maxwell filing date was off by 262 days. The USVI v. JPMorgan settlement was off by 252 days. And the NYDFS consent order date was off by eight months. These corrections illustrate why calibration cannot rely on secondary sources -- even widely cited dates can be wrong.

Corpus Normalization

A naive change-point analysis would over-detect in high-volume periods. If Data Set 10 contains 500,000 documents and Data Set 1 contains 10,000, any entity appearing primarily in Data Set 10 will show artificially elevated activity simply because there are more documents in which it could appear. Corpus normalization adjusts for this, expressing document activity relative to the baseline volume for each time period (PAPER TRAIL Project, 2026c). The result is that change-points reflect genuine shifts in relative activity, not artifacts of uneven corpus composition.

Cross-Domain Integration

The 889 breakpoints feed into the cross-domain synthesis engine, where temporal change-points are correlated with events from other analytical domains (PAPER TRAIL Project, 2026e). When a breakpoint in document activity aligns with a Deutsche Bank account closure, a TD Bank migration, or the October 2005 FedEx cutoff, the temporal coincidence strengthens both findings. Conversely, breakpoints that align with nothing in other domains may represent analytical noise -- or they may represent events not yet identified.

The change-point analysis also interacts with the community grouping algorithm. Communities that show synchronized breakpoints -- multiple entities within the same community experiencing activity shifts at the same time -- may reflect coordinated responses to external events. Communities with asynchronous breakpoints may represent independent activity streams (PAPER TRAIL Project, 2026a).

Why This Matters

889 breakpoints across a 20-year corpus is approximately one structural shift every eight days, on average. That density reflects the complexity of the underlying events. This was not a single crime with a single timeline. It was two decades of corporate formation, financial transactions, legal proceedings, regulatory actions, and media coverage, all generating documents at varying rates. The change-point analysis imposes structure on that complexity, identifying the moments when the tempo changed.


References

Haynes, K., Eckley, I. A., & Fearnhead, P. (2017). Computationally efficient changepoint detection for a range of penalties (CROPS). Journal of Computational and Graphical Statistics, 26(1), 134-143. https://doi.org/10.1080/10618600.2015.1116445

Killick, R., Fearnhead, P., & Eckley, I. A. (2012). Optimal detection of changepoints with a linear computational cost. Journal of the American Statistical Association, 107(500), 1590-1598. https://doi.org/10.1080/01621459.2012.737745

PAPER TRAIL Project. (2026a). PELT change-point detection results [Data set]. _exports/temporal/changepoints_summary.csv, changepoints_by_entity.csv

PAPER TRAIL Project. (2026b). Corroboration report: Calibration timeline corrections [Data set]. research/CORROBORATION_REPORT.md, Section 4.

PAPER TRAIL Project. (2026c). Implementation methodology: PELT parameters [Data set]. research/IMPLEMENTATION.md, Section 4.

PAPER TRAIL Project. (2026d). Calibration timeline: 50 verified anchor dates [Data set]. research/CALIBRATION_TIMELINE.md

PAPER TRAIL Project. (2026e). Cross-domain synthesis engine [Script]. Script 25b, app/scripts/25_cross_domain_synthesis.py