Tag

#b200

Articles tagged "b200" — 1 entry.

Article №27 foundations TensorRT-LLM ~22 minute read
Looking Beyond Spark

Looking Beyond Spark — KV-Cache Arithmetic at Inference

The serving memory bill is not weights. It's KV cache, and KV scales with concurrent users × context length, not parameters. Same four bills as training; different weights. A 70B at 32 users × 16k context wants 168 GB just for KV — and the Spark teaches you the per-token math.

uses fieldkit.capabilities