Inference Shards: How Gonka v0.2.11 Moves AI Off-Chain

Gonka v0.2.11 introduces inference shards, a fundamental shift in how the network handles AI requests. Instead of recording every inference as an on-chain transaction, the new architecture processes them off-chain inside lightweight subnets, touching the blockchain only twice per session.

The Bottleneck

The current mainnet design processes each AI inference as a separate blockchain transaction. This caps throughput at roughly 277 inferences per second. Batching requests helps, but computation and state growth per request make this approach unscalable to hundreds of thousands of concurrent inferences.

The Solution: Subnets as Shards

PR #813 restructures the inference pipeline into three phases:

  1. Escrow creation -- the user locks 5-10 GNK on-chain. The chain samples 16 validator slots via weighted random selection (the same algorithm used in PoC validation) and assigns them as the session subgroup.
  2. Off-chain inference -- all AI requests flow directly to the subgroup over HTTP. No blockchain transactions. The user acts as the sequencer, ordering requests and propagating state updates to hosts.
  3. Settlement -- the user submits the final usage state, signed by a supermajority (2/3+ slot-weighted) of hosts. The chain verifies the proof, pays validators proportionally, and refunds unused escrow.

Each session effectively becomes its own mini-blockchain, a shard with extremely lightweight consensus.

Inside the Subnet Module

The team built a standalone subnet module with zero Cosmos SDK dependencies. This boundary ensures the subnet logic can be tested and reasoned about independently of the main chain.

The module implements a state machine tracking each inference through 7 states: pending, started, finished, challenged, validated, invalidated, and timed_out. It defines 8 off-chain message types covering the full inference lifecycle. Storage uses SQLite in WAL mode for append-only diff logs, with gossip protocol using K-fanout nonce propagation for state synchronization.

Security Model

Several mechanisms protect against adversarial behavior. Per-inference escrow accounting reserves tokens at start and releases surplus on completion. Probabilistic validation uses deterministic seeds derived from escrow ID signatures. Timeout verification requires signed votes from other hosts as evidence. Equivocation detection catches conflicting diffs via gossip, terminating the session. Warm key support allows operators to sign with authorized keys via on-chain authz grants.

Conservative Rollout

This release treats subnets as experimental. Group size is 16 slots per escrow, escrow range is 5-10 GNK, maximum 100 escrows per epoch, and unsettled escrows are pruned after 2 epochs. Access is restricted to whitelisted addresses only. The plan is to test on mainnet and relax restrictions in v0.2.12-v0.2.13.

What Comes Next

Several areas remain open: aggregating subnet host stats into block rewards and PoC reputation, formal adversarial analysis, end-to-end timeout testing, dynamic pricing adaptation, and BLS signature aggregation to reduce settlement transaction size.

Why It Matters

Inference shards transform Gonka from a chain where every AI request is a transaction into a two-layer architecture: the blockchain handles settlement and security, while lightweight subnets handle the actual AI workload. The 277 inferences/sec ceiling disappears as the system scales horizontally.