harness
Which local lane should drive your always-on Spark agent?
Hermes runs the same agent loop over any OpenAI-compatible backend, but the lanes are not interchangeable: a fast lane that can't emit well-formed tool calls is useless to an agent. This profile measures the trade-off.
- Pick a local serving lane for a Hermes agent on the Spark
- Size a MoE vs dense model against the 128 GB unified-memory envelope
- Reproduce the tool-call-reliability + tok/s + sustained-load numbers
Audience — DGX Spark power users running a local, no-API-key agent harness.
| Variant | tok/s |
|---|---|
| NIM · Nemotron-Nano-9B-v2 | 27.7 |
| llama.cpp · Qwen3-30B-A3B (MoE, Q4_K_M) sweet spot | 88.0 |
| llama.cpp · Qwen3-32B (dense, Q4_K_M) | 10.2 |
| vLLM · Qwen3-30B-A3B (MoE, FP8) | 55.9 |
| vLLM · Qwen3-32B (dense, FP8) | 6.6 |
- Tool-call reliability sample size format-error rate measured over 8 agentic tasks per lane; not a large-N guarantee.
- Qwen3 context vs Hermes minimum Qwen3 lanes serve at native 40,960 tokens; Hermes's 64K floor is bypassed via model.context_length + auxiliary.compression.context_length overrides.