Live network data across epochs 161-191 revealed a routing blind spot: some GPU nodes were missing 28% of assigned inferences while the statistical SPRT mechanism collected enough samples to act. During that collection window — typically 10 to 50+ inferences — degraded nodes received equal traffic alongside healthy ones.

The Problem

The existing SPRT-based deactivation system is statistically rigorous but deliberately slow. It needs enough data points to reach confidence before removing a node. Meanwhile, GetRandomExecutor routes requests with equal probability to all nodes, regardless of recent performance.

Across 2.5 million inferences analyzed, the overall miss rate was 3.25% (81,000 misses), but the standard deviation of completion rates across nodes reached 7.4% — meaning some nodes performed far worse than others while SPRT had not yet flagged them.

Three-State Circuit Breaker

The fix introduces a fast circuit breaker that runs inside each epoch, composing with existing filters in the executor selection pipeline:

HEALTHY → EXCLUDED → PROBE

  • A node moves from HEALTHY to EXCLUDED when its miss rate exceeds 25% with at least 4 samples
  • After a cooldown period (starting at 50 blocks, roughly 5 minutes), it enters PROBE state
  • In PROBE, the node receives exactly one test inference. Success returns it to HEALTHY; failure doubles the cooldown and sends it back to EXCLUDED
  • Maximum cooldown caps at 500 blocks (approximately 50 minutes) through exponential backoff

Two safety mechanisms prevent cascading failures. First, if all nodes end up excluded, the filter passes the full list — the network continues operating rather than halting. Second, all circuit breaker state resets at epoch boundaries, ensuring no permanent exclusion at this layer.

Reputation-Adjusted Selection

Alongside the circuit breaker, executor selection now factors in historical reputation scores. Previously, selection weight was based purely on stake. Now the weight calculation multiplies stake by reputation (scaled 0-100), with a floor of 1%.

A node with reputation score 50 receives roughly half the traffic of an equally-staked node with a clean record. This creates a gradient rather than a binary on/off, giving recovering nodes proportionally less traffic while still allowing them to rebuild their score.

Implementation Details

The changes span 7 files with 671 additions across the keeper, module, and epoch group packages. The circuit breaker state lives in keeper/circuit_breaker.go with 12 unit test cases covering state transitions, edge cases, and the safety fallback. The reputation weight calculation sits in epochgroup/epoch_group.go as a composable helper.

The feature addresses three tracked issues: slow node investigation (#818), missed inference root cause analysis (#820), and the circuit breaker design specification (#942). SPRT remains unchanged as the long-term statistical backstop — the circuit breaker handles the fast response while SPRT handles permanent decisions.