spark-hermes-profile
Which local lane should drive your always-on Spark agent?
Positioning
Hermes runs the same agent loop over any OpenAI-compatible backend, but the lanes are not interchangeable: a fast lane that can't emit well-formed tool calls is useless to an agent. This profile measures the trade-off.
- Pick a local serving lane for a Hermes agent on the Spark
- Size a MoE vs dense model against the 128 GB unified-memory envelope
- Reproduce the tool-call-reliability + tok/s + sustained-load numbers
Audience. DGX Spark power users running a local, no-API-key agent harness.
Lane variants
The harness profile records every serving lane that was driven through the same Hermes agent on the same DGX Spark. Throughput is recorded as measured tokens-per-second on the box.
Recommended lane. llama.cpp · Qwen3-30B-A3B (MoE, Q4_K_M) — the lane the harness profile points at for an always-on Spark agent.
How to load
License: free · mit. Published to HuggingFace as a harness profile bundle (config files, lane recipe, doctor checklist).
from huggingface_hub import snapshot_download
local = snapshot_download("Orionfold/spark-hermes-profile")
print(local) # local path to the harness bundle Known drift
Every measurement has a measurement window. These are the bounds the harness profile is honest about.
- Tool-call reliability sample size
- format-error rate measured over 8 agentic tasks per lane; not a large-N guarantee.
- Qwen3 context vs Hermes minimum
- Qwen3 lanes serve at native 40,960 tokens; Hermes's 64K floor is bypassed via model.context_length + auxiliary.compression.context_length overrides.
Companion field note
The harness profile pairs with the field note hermes-serving-lane-on-spark — read the article for the lane-bakeoff narrative, the unified-memory math, and the tool-call reliability gate that decided the recommended lane.
Read the field note