Stage

Deployment

From experiment to something that runs reliably. Containers, services, updates, graceful degradation on one machine.

Article №27 foundations TensorRT-LLM ~22 minute read
Looking Beyond Spark

Looking Beyond Spark — KV-Cache Arithmetic at Inference

The serving memory bill is not weights. It's KV cache, and KV scales with concurrent users × context length, not parameters. Same four bills as training; different weights. A 70B at 32 users × 16k context wants 168 GB just for KV — and the Spark teaches you the per-token math.

uses fieldkit.capabilities

Article №13 deployment TensorRT-LLM + Triton Inference Server ~4 hours including two container pulls and three engine builds
Second Brain

TensorRT-LLM on the Spark — FP8 Isn't the Reason to Drop NIM. NVFP4 Is.

Dropping below NIM to raw TensorRT-LLM on a GB10 Spark. FP8 beats NIM's vLLM by 10-15% — barely worth the rebuild. NVFP4 beats it by 76% on decode, 43% on TTFT, and ships a 34%-smaller engine. The reason to drop NIM is the Blackwell-native 4-bit kernel, not FP8.