The Gonka team shipped a semantic cache system that could save node operators serious money on GPU costs — without requiring a single GPU to run.

The Problem: Repeated Work

Every AI inference request on the network costs GPU time. But many requests are similar or even identical. Without caching, each one burns the same compute resources as if it were brand new. Across hundreds of nodes running 24/7, that adds up fast.

Two Levels of Caching

The new system uses a two-tier approach:

L1 (Exact Match) works like a dictionary lookup. If someone asks the exact same question twice, the cached answer comes back instantly. Simple, fast, zero ambiguity.

L2 (Similarity Match) is where things get interesting. It uses a lightweight language model (all-MiniLM-L6-v2) running entirely on CPU to compare incoming requests against cached ones. If a new request is close enough to something already answered, the system can serve the cached result instead of firing up the GPU again.

The key innovation is the quality gate pipeline — four checks that every cache hit must pass before being served:

  1. Similarity check — is this request actually close enough?
  2. Verifier — does the cached response still make sense?
  3. Coherence floor — adaptive threshold that tightens for higher-similarity matches
  4. Loop closure — catches circular references where cache entries validate each other

No GPU Required

The entire L2 pipeline runs on CPU with int8-quantized models. Peak RAM usage in testing: just 23 MB. Node operators can add caching to their existing setup without buying extra hardware.

Real Numbers

The team ran four rounds of testing on Debian Bookworm (CPU-only):

  • 9,216 runs with 4 pattern slots: quality score 0.988
  • 15,360 runs in a K3s mesh with 6 slots: quality score 1.001 — meaning cache quality matched or exceeded live GPU inference
  • 11,520 runs with raw binary data and 197 slots: quality score 1.020

The optimal L2 similarity threshold landed at 4,250 basis points (F1 score: 0.986). The previous setting of 7,500 was too aggressive and rejected 64% of valid cache hits.

Economics

For specialized nodes (handling a narrow range of request types), hit rates improve dramatically — up to 571x compared to a generalist node. At current H100 GPU rental prices ($2.50/hour), the team estimates potential savings of roughly $155,800 per year across the full protocol.

Built Into the Chain

Cache quality feeds directly into the Proof of Compute consensus. A new CacheQualityWeight parameter (controlled by governance) gives validators a bonus at epoch settlement for maintaining high-quality caches. To prevent gaming, the maximum bonus is capped at 30% of a node's weight regardless of cache size.

Node operators can cross-check their local cache stats against on-chain CacheQualityEpochSummary data to verify everything adds up.

What's Next

The semantic cache shipped alongside the v0.2.11 upgrade preparation. With database pruning (#867) and validation optimizations (#874) also landing this week, the protocol is getting noticeably leaner heading into Q2 2026.