125,620 Communities: Network Topology of 2.38 Million Entities

Table of Contents

TLDR

A grouping algorithm that finds clusters of closely connected entities -- called Leiden community detection -- applied to 29.5 million entity relationships identified 125,620 distinct communities and 535,318 entities that bridge otherwise disconnected groups (PAPER TRAIL Project, 2026a). The network reveals how information and relationships were compartmentalized across the Epstein document corpus, and which entities occupied positions of control between communities.


Building the Network

Before you can analyze a network, you have to build one. Our pipeline starts with 2.05 million processed documents, from which Named Entity Recognition extracted 2.38 million entities -- persons, organizations, and locations (PAPER TRAIL Project, 2026b). A co-occurrence graph was then constructed: a network where two entities are connected if they appear in the same document. The result is 29.5 million unique entity relationship pairs, forming one of the largest forensic document-based networks ever built from a single corpus (PAPER TRAIL Project, 2026c).

A document-based network is not a relationship graph in the social sense. Two entities appearing in the same document does not prove they have a relationship. But at scale, co-occurrence patterns become meaningful. Entities that repeatedly appear together across many documents have a statistical association that warrants investigation. Entities that never co-occur, despite both being common, may be deliberately compartmentalized.

Why Leiden, Not Louvain

Grouping algorithms identify clusters of densely connected nodes within a larger network. The most widely known algorithm, Louvain, has a well-documented flaw: it can produce disconnected communities -- clusters that contain nodes with no path between them within the community (Traag et al., 2019). For forensic work, this is unacceptable. A "community" that includes entities with no connection to each other is analytically useless.

The Leiden algorithm, published in 2019, guarantees that every community it produces is well-connected -- there are no disconnected subgroups within any community (Traag et al., 2019). This guarantee is why we chose Leiden over Louvain despite Louvain's greater name recognition.

Resolution Parameter

Leiden uses a method called the Constant Potts Model with a resolution parameter (gamma) that controls community granularity. A low gamma produces fewer, larger communities. A high gamma produces many small communities. We selected gamma=0.05 through a parameter sweep, balancing granularity against fragmentation (PAPER TRAIL Project, 2026d).

At gamma=0.05, the algorithm identified 125,620 distinct communities across the 2.38 million entity network (PAPER TRAIL Project, 2026a). The community size distribution follows a power law, as expected in social and document networks: a few very large communities contain thousands of entities, while the vast majority of communities contain only a handful.

535,318 Brokers

The second analytical layer examines gaps between groups that only certain entities bridge -- what network scientists call structural holes (Burt, 1992). The entities that occupy these bridging positions -- called brokers -- have disproportionate control over information flow. They connect groups that would otherwise be isolated from each other.

Our analysis identified 535,318 broker entities (PAPER TRAIL Project, 2026a). These are nodes with low dependence on any single group, meaning they have diverse, non-redundant connections across multiple communities. In practical terms, a measure of how much an entity depends on a single group versus bridging multiple groups -- called a constraint score -- is computed for each entity. A broker might be an attorney who appears in both corporate filings and court documents, or a financial entity that connects shipping records to wire transfer records.

A filtering technique was applied to prevent highly-mentioned entities from artificially inflating broker scores. In networks that follow a power law, a small number of nodes have extremely high connection counts (thousands of connections). Without filtering, these nodes would dominate the broker rankings simply because they appear in so many documents, regardless of whether they actually bridge distinct communities. The filtering removes the confounding effect of sheer mention frequency, isolating genuine structural positioning (PAPER TRAIL Project, 2026d).

Statistical Significance

A natural question is whether the community structure is real or an artifact of the algorithm. We address this through null hypothesis testing: the observed network structure is compared against 1,000 randomly rewired networks -- synthetic networks with the same connection distribution but randomized connections (PAPER TRAIL Project, 2026d). The quality gate threshold requires a network organization score (modularity Q) above 0.30. If the real network's score exceeds what random networks produce, the community structure is statistically significant and not simply a product of uneven connection counts.

What Communities Reveal

The grouping algorithm transforms an overwhelming 29.5-million-edge graph into interpretable clusters. Each community represents a group of entities that co-occur more frequently with each other than with entities outside the group. In a document corpus like this, communities often correspond to:

  • Legal proceedings: entities that appear together in court filings
  • Corporate structures: entities clustered around shared officers and addresses
  • Geographic clusters: entities associated with specific properties or jurisdictions
  • Temporal cohorts: entities that appear together during specific time periods

The inter-community connections, documented in exported data files, reveal which clusters are linked and through which bridge entities (PAPER TRAIL Project, 2026a). These bridges are often the most analytically interesting elements -- they show where otherwise separate domains of activity connect.

Downstream Integration

The community assignments feed into the cross-domain synthesis engine, where community groupings are cross-referenced with financial flows, shipping patterns, and temporal breakpoints (PAPER TRAIL Project, 2026e). A wire transfer between entities in different communities is more interesting than one between entities in the same community, because it represents a cross-cluster financial connection. Similarly, documents scored as containing unusually rare entity combinations often contain entity pairs that span community boundaries (PAPER TRAIL Project, 2026f).

The 125,620 communities are not an end product. They are a lens -- a way of seeing structure in a network too large for any human to comprehend as a whole.


References

Burt, R. S. (1992). Structural holes: The social structure of competition. Harvard University Press.

PAPER TRAIL Project. (2026a). Network topology exports [Data set]. _exports/network/communities_summary.csv, brokers.csv, community_edges.csv

PAPER TRAIL Project. (2026b). Named entity recognition pipeline [Script]. Script 04, app/scripts/04_extract_entities.py

PAPER TRAIL Project. (2026c). Entity relationships table [Data set]. entity_relationships, 29.5M unique pairs.

PAPER TRAIL Project. (2026d). Implementation methodology: Leiden tuning and Burt's constraint [Data set]. research/IMPLEMENTATION.md

PAPER TRAIL Project. (2026e). Cross-domain synthesis engine [Script]. Script 25b, app/scripts/25_cross_domain_synthesis.py

PAPER TRAIL Project. (2026f). Surprisal scoring [Data set]. Script 22, _exports/surprisal/

Traag, V. A., Waltman, L., & van Eck, N. J. (2019). From Louvain to Leiden: Guaranteeing well-connected communities. Scientific Reports, 9, 5233. https://doi.org/10.1038/s41598-019-41695-z