GPU Util % utilisation
GPU Temp °C die
Unified GB of 128 · 8 GB guard
Throughput tok / second
TTFT ms · first token
Active Lane idle no warm brain
OpenRouter $0.00 spend · since start
Unified · 60 s 8 GB guard band shown at top

← Models

What it's for
  • Pick a local serving lane for a Hermes agent on the Spark
  • Size a MoE vs dense model against the 128 GB unified-memory envelope
  • Reproduce the tool-call-reliability + tok/s + sustained-load numbers

Audience — DGX Spark power users running a local, no-API-key agent harness.

Quant economics quality × speed per build
Variant tok/s
NIM · Nemotron-Nano-9B-v2 27.7
llama.cpp · Qwen3-30B-A3B (MoE, Q4_K_M) sweet spot 88.0
llama.cpp · Qwen3-32B (dense, Q4_K_M) 10.2
vLLM · Qwen3-30B-A3B (MoE, FP8) 55.9
vLLM · Qwen3-32B (dense, FP8) 6.6
Known drift bounded · honest
  • Tool-call reliability sample size format-error rate measured over 8 agentic tasks per lane; not a large-N guarantee.
  • Qwen3 context vs Hermes minimum Qwen3 lanes serve at native 40,960 tokens; Hermes's 64K floor is bypassed via model.context_length + auxiliary.compression.context_length overrides.