Tag

#paged-attention

Articles tagged "paged-attention" — 1 entry.

Article №27 foundations TensorRT-LLM 30 Apr 2026 ~22 minute read

Looking Beyond Spark — KV-Cache Arithmetic at Inference

The serving memory bill is not weights. It's KV cache, and KV scales with concurrent users × context length, not parameters. Same four bills as training; different weights. A 70B at 32 users × 16k context wants 168 GB just for KV — and the Spark teaches you the per-token math.

uses fieldkit.capabilities