spark-hermes-cost-router

When does local stop being enough? Measure first, then route.

free · mit 3 lanes

Positioning

A Spark holds one strong model warm at a time and pays no per-token cost for it. A frontier model on OpenRouter is the per-token-billed ceiling. The interesting decision is *when to escalate* — and the only honest answer is the measured leak rate, not the public-docs 60-80% cost-savings pitch. This router ships the predicates that decide, plus the snapshot prices that let you reproduce the dollar curve.

Audience. DGX Spark power users running a local-first agent harness who want to escalate to frontier only when local can't reliably answer — and to *know* what that fraction is.

Lane variants

The harness profile records every serving lane that was driven through the same Hermes agent on the same DGX Spark. Throughput is recorded as measured tokens-per-second on the box.

  • Local Spark — Qwen3-30B-A3B MoE Q4_K_M
  • OpenRouter cheap-tier — gpt-4o-mini
  • OpenRouter frontier — claude-opus-4.1
3 lanes

How to load

License: free · mit. Published to HuggingFace as a harness profile bundle (config files, lane recipe, doctor checklist).

from huggingface_hub import snapshot_download

local = snapshot_download("Orionfold/spark-hermes-cost-router")
print(local)  # local path to the harness bundle

Known drift

Every measurement has a measurement window. These are the bounds the harness profile is honest about.

Suite size
12 prompts × N=3 attempts per strategy (108 calls per full run). Not a large-N guarantee; production workloads will exhibit their own leak rates.
OpenRouter snapshot prices
Captured 2026-05-28T14:32:06.836115+00:00. openai/gpt-4o-mini = $0.15 per 1M input + $0.60 per 1M output; anthropic/claude-opus-4.1 = $15.00 per 1M input + $75.00 per 1M output. Prices change; re-snapshot before reproduction.
Leak rate
33.3% measured leak rate. Tuned to this 12-prompt suite's synthetic-but-graded difficulty distribution.
Token threshold
complex-tier `min_input_tokens=3000` (= 3000 tokens) was tuned to this suite. A workload with a different long-to-short ratio should re-tune this single integer.

Companion field note

The harness profile pairs with the field note hermes-cost-routing-local-and-openrouter — read the article for the lane-bakeoff narrative, the unified-memory math, and the tool-call reliability gate that decided the recommended lane.

Read the field note