spark-hermes-profile

Which local lane should drive your always-on Spark agent?

free · mit 5 lanes 88 tok/s peak 3.4 min sustained

Positioning

Hermes runs the same agent loop over any OpenAI-compatible backend, but the lanes are not interchangeable: a fast lane that can't emit well-formed tool calls is useless to an agent. This profile measures the trade-off.

Audience. DGX Spark power users running a local, no-API-key agent harness.

Lane variants

The harness profile records every serving lane that was driven through the same Hermes agent on the same DGX Spark. Throughput is recorded as measured tokens-per-second on the box.

  • NIM · Nemotron-Nano-9B-v227.7 tok/s
  • llama.cpp · Qwen3-30B-A3B (MoE, Q4_K_M)88.0 tok/srecommended
  • llama.cpp · Qwen3-32B (dense, Q4_K_M)10.2 tok/s
  • vLLM · Qwen3-30B-A3B (MoE, FP8)55.9 tok/s
  • vLLM · Qwen3-32B (dense, FP8)6.6 tok/s
5 lanes · peak 88 tok/s

Recommended lane. llama.cpp · Qwen3-30B-A3B (MoE, Q4_K_M) — the lane the harness profile points at for an always-on Spark agent.

How to load

License: free · mit. Published to HuggingFace as a harness profile bundle (config files, lane recipe, doctor checklist).

from huggingface_hub import snapshot_download

local = snapshot_download("Orionfold/spark-hermes-profile")
print(local)  # local path to the harness bundle

Known drift

Every measurement has a measurement window. These are the bounds the harness profile is honest about.

Tool-call reliability sample size
format-error rate measured over 8 agentic tasks per lane; not a large-N guarantee.
Qwen3 context vs Hermes minimum
Qwen3 lanes serve at native 40,960 tokens; Hermes's 64K floor is bypassed via model.context_length + auxiliary.compression.context_length overrides.

Companion field note

The harness profile pairs with the field note hermes-serving-lane-on-spark — read the article for the lane-bakeoff narrative, the unified-memory math, and the tool-call reliability gate that decided the recommended lane.

Read the field note