Tag

#fp8

Articles tagged "fp8" — 4 entries.

Article №27 foundations TensorRT-LLM 30 Apr 2026 ~22 minute read

Looking Beyond Spark — KV-Cache Arithmetic at Inference

The serving memory bill is not weights. It's KV cache, and KV scales with concurrent users × context length, not parameters. Same four bills as training; different weights. A 70B at 32 users × 16k context wants 168 GB just for KV — and the Spark teaches you the per-token math.

uses fieldkit.capabilities

Article №19 training NeMo 25 Apr 2026 ~30 min once the NeMo container is on disk — 7.4 min wall for the 16-config sweep, the rest is reading the numbers

Machine that Builds Machines

The GB10 Pretrain Envelope — Sweeping Batch, Sequence, and Precision on One Spark

Same 354M GPT, same training loop, swept across micro-batch (2,4,8,16), sequence length (1024,2048), and precision (bf16,fp8). 16 configurations, 30 steps each. Peak: 14,266 tokens/sec at batch=16, seq=1024, fp8 — 18% above the hand-rolled PyTorch baseline.

Article №13 deployment TensorRT-LLM + Triton Inference Server 23 Apr 2026 ~4 hours including two container pulls and three engine builds

Second Brain

TensorRT-LLM on the Spark — FP8 Isn't the Reason to Drop NIM. NVFP4 Is.

Dropping below NIM to raw TensorRT-LLM on a GB10 Spark. FP8 beats NIM's vLLM by 10-15% — barely worth the rebuild. NVFP4 beats it by 76% on decode, 43% on TTFT, and ships a 34%-smaller engine. The reason to drop NIM is the Blackwell-native 4-bit kernel, not FP8.

Article №05 inference NIM 22 Apr 2026 ~2 hours first install, ~2 minutes every restart after

Foundations

Your First NIM on a DGX Spark — What 24.8 Tokens Per Second Doesn't Tell You

First-contact notes on NVIDIA's DGX-Spark-specific Llama 3.1 8B NIM. 9.4 GB image, ~108 s warm-cache cold-start, 24.8 tok/s steady, OpenAI-compatible on :8000 — and a confidently wrong Python one-liner that clarifies what small-model FP8 buys and what it costs.

uses fieldkit.nim