Reading

The Hermes Harness on a DGX Spark — A Local Cockpit That Holds Tools, With No API Key

Installing the Hermes agent harness on a DGX Spark and running the first local agent turn against the cached Nemotron-Nano-9B-v2 NIM — reliable tool calls, no API key, no cloud hop. The defensible angle is NIM-first; everyone else's Spark Hermes write-up leads with Ollama.

Series Harnesses

№45 agentic NIM 26 May 2026 ~1 hour, most of it the NIM's first cold-start intermediate NVIDIA DGX Spark Manav Sehgal

Terms in this piece3

Agent harnessThe software shell around a model that turns single completions into a loop: it parses the model's tool calls, executes them (shell, file read, web fetch), feeds results back, and repeats until the task is done. The model reasons; the harness acts. Hermes, Claude Code, Cursor, and Codex CLI are all harnesses. Swap the model behind one and the loop is unchanged.
Tool callingThe protocol by which a model asks the harness to run something. The model emits a structured tool_calls block — a function name plus JSON arguments — instead of (or alongside) prose; the harness runs the function and returns the result as a new message. It's the difference between a chatbot that describes reading a file and an agent that actually reads it. Reliability here is binary-critical: a malformed tool call stalls the whole loop.
MCP — Model Context ProtocolAn open standard for exposing tools and data to an agent as a uniform server interface. Hermes speaks it, which means later in this series I can expose fieldkit itself — quantize, measure, publish, retrieve — as MCP tools and let the harness operate the Spark, not just read its files. That's the keystone the series builds toward; today's read_file is the trailhead.

Every model I’ve published on this machine has been a thing you download and run — a quantized GGUF, a card, a notebook, a one-line pip of fieldkit. Useful, finished, inert. What’s been missing is the other half of the loop: the cockpit. Not another model to run, but the thing you actually drive the box from — the harness that turns a published model and an API into a daily-use personal agent that can read your files, run a command, and hand the result back to the model, all on one desk, with nothing leaving it.

This is the first article in a new series about exactly that. The harness is Hermes Agent (Nous Research, MIT-licensed), and the question this piece answers is the one that decides whether the whole series is worth writing: can a frontier open-source agent harness drive a model that runs entirely on the Spark — with reliable tool calls, and no API key? The answer is yes, and the load-bearing detail is which model. Every other DGX Spark Hermes write-up I’ve seen leads with Ollama. This one leads with the tuned NIM Nemotron lane — the same nemotron-nano-9b-v2-dgx-spark container I’ve measured at 325 tok/s — because that’s the lane nobody else documents and the one that makes the agent feel local instead of merely private.

Why a local cockpit is a different proposition

The three application arcs on this blog — a Second Brain that RAGs over my corpus, an LLM Wiki that compiles knowledge at ingest, a Machine that Builds Machines that runs experiments overnight — all answer what you run on the Spark. A harness answers a different question: what you drive it from, and how much of yourself you’re willing to hand it. The moment an agent can read your files and run commands, “private” stops being about where the weights live and starts being about where the tool calls resolve. A cloud-hosted agent that reaches into your home directory has to send the contents of that directory somewhere to reason about it. A local one doesn’t.

That’s the uber-theme tie for this series, and it’s sharper here than anywhere else on the blog: the Spark is always on at home, so a hardened local Hermes becomes a private always-on agent you can text from your phone — 100% local, no cloud, no API key, no per-token bill, escalating to a paid model only when a task genuinely needs one. Independence isn’t a nice-to-have for an agent that holds tools; it’s the whole point. Today is step one of that: install, wire it to the local NIM, and prove the tool-call loop closes without a key.

Where Hermes sits, and why NIM is the hero

Hermes is provider-agnostic. Out of the box it’ll talk to Anthropic, OpenAI, OpenRouter, or a dozen others if you hand it a key — but it also speaks the plain OpenAI /v1/chat/completions dialect, which is exactly what a local NIM serves. That’s the seam this series lives in: point Hermes’s provider at http://127.0.0.1:8000/v1 and the harness has no idea it’s no longer talking to the cloud. The agent loop, the 40-odd built-in tools, the skills system — all of it runs against a model that never leaves the Spark.

The whole turn happens between the two endpoints of your own LAN. The accent node is the only one that "thinks" — and it's local, so the decision to read your file never leaves the box.

NIM is the hero lane for a specific, measured reason: it ships the correct tokenizer, chat template, and engine config for Nemotron, where stock inference servers have historically mangled them. I trust this lane to emit well-formed tool calls in a way I don’t trust a hand-rolled server. The cost is cold-start and memory footprint, which the rest of this piece quantifies honestly.

Installing the harness

The install is a single piped script. I never pipe a remote script to a shell without reading it first, so I pulled install.sh down and walked its 2,071 lines: for a non-root user it git-clones into ~/.hermes/, builds a uv virtualenv, drops a hermes shim in ~/.local/bin, and only reaches for sudo to install optional niceties like ripgrep. No Docker pulls, nothing system-wide. Reversible with rm -rf ~/.hermes. With that confirmed:

curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash
# ... clones, builds venv, bundles 90 skills to ~/.hermes/skills/
hermes --version
# Hermes Agent v0.14.0 (2026.5.16)

The install bundled 90 skills into ~/.hermes/skills/ on the way in — and they’re in the agentskills.io SKILL.md format, which is the same format Claude Code uses. That cross-compatibility is a thread I’ll pull hard in a later article; for now it’s a pleasant signal that the skills I’ve already written are portable. The thing I cared about today was the health check:

$ hermes doctor
◆ Python Environment        ✓ Python 3.11.15   ✓ Virtual environment active
◆ Required Packages         ✓ OpenAI SDK   ✓ HTTPX   ✓ PyYAML   ...
◆ Configuration Files       ✓ ~/.hermes/.env   ✓ ~/.hermes/config.yaml (v24)
◆ Directory Structure       ✓ skills/   ✓ memories/   ✓ SOUL.md   ...
◆ Tool Availability         ✗ discord (missing DISCORD_BOT_TOKEN)   ✗ spotify ...

Every core check is green — Python, packages, config, directory layout — which is the only part that gates a working agent on aarch64 / DGX OS. But hermes doctor also emits a wall of red ✗ marks below that, and the first time you run it the instinct is to panic. They’re integrations you haven’t configured: Discord, Spotify, web-search providers that want API keys you don’t have. None of them matter for a local agent. This honestly tripped my own tooling — when I codified the doctor parse into fieldkit, my first pass treated every ✗ as a failure and declared a clean install broken.

Wiring Hermes to the local NIM

Here’s the gotcha that would have cost me an hour if I hadn’t read the config comments carefully. Hermes ships a native nvidia provider — and it is not what you want. That provider points at build.nvidia.com, the cloud NIM endpoint, and demands an NVIDIA_API_KEY. For a model running on your own box you use the custom provider, the generic OpenAI-compatible path, with an explicit base_url. (Hermes aliases ollama, vllm, and llamacpp to custom too — they’re all the same code path.)

First the model. I started the cached NIM the way I always do — --network host, the cache mount, the NGC key from an env-file, and NIM_MAX_BATCH_SIZE=32, the batch knob I’d measured at 325 tok/s on this hybrid-Mamba model. It warmed in 145 seconds and settled at 91 GB used, 29 GB free — comfortably inside the 128 GB envelope, with no room to spare for a second large model, which is exactly the discipline the next article is about. Then I pointed Hermes at it:

hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8000/v1
hermes config set model.default  nvidia/nemotron-nano-9b-v2
# ~/.hermes/.env — a dummy key; the NIM accepts any non-empty bearer
OPENAI_BASE_URL=http://127.0.0.1:8000/v1
OPENAI_API_KEY=local

OPENAI_API_KEY=local is the quiet headline of the whole piece. It’s a placeholder the NIM doesn’t check — there is no real credential anywhere in this setup, no cloud account, no billing relationship. The harness is fully wired and there’s nothing to leak.

Does the lane actually do tool calls?

Before trusting the full agent loop, I tested the narrow thing the whole series hinges on: can the Nemotron NIM emit a well-formed tool call at all? A harness can’t paper over a model that can’t. One direct /v1/chat/completions with a tools array and tool_choice: auto settled it:

finish_reason: "tool_calls"
tool_calls: [{"type":"function",
  "function":{"name":"get_weather","arguments":"{\"city\": \"Paris\"}"}}]

Clean. Correct function, correct JSON arguments, the right finish_reason. The model’s reasoning showed up in the content field ahead of the call, which is the Nemotron <think>-prefix behavior I’ve documented elsewhere — but the structured call came through untouched. The lane can do the agent-critical thing. Now the harness.

The first local agent turn

Hermes has a headless one-shot mode — hermes -z "<prompt>" — that prints only the reply, and a --yolo flag that bypasses the interactive tool-approval prompt so it runs unattended. I planted a file with a known phrase and asked the agent to use a tool to read it back. This forces a real read_file call routed through the local model — the full loop, not a chat completion:

$ echo "The secret pass-phrase is ORIONFOLD-NIM-7741." > secret.txt
$ hermes -z "Read secret.txt and tell me the exact pass-phrase. Use your tools." --yolo

[reasoning] ... I used the read_file tool with a limit of 500 lines and an
offset of 1 ... the file has one line containing the pass-phrase ...
The secret pass-phrase in `secret.txt` is: **ORIONFOLD-NIM-7741**.

That’s the loop closing. The local model decided to call read_file, Hermes executed it against my filesystem, fed the contents back, and the model composed the answer — every step on the Spark, no key, no network. The reasoning trace even names the exact tool and arguments it chose. A chatbot would have told me it couldn’t read the file; the harness read it.

One turn is an anecdote, so I ran a small battery of four tasks that each force a different tool — a directory listing, a line count, a create-then-read round-trip, and a shell command — and recorded whether each produced a well-formed tool call and a correct final answer:

Task	Tool	Wall	Tool call	Answer
read a planted phrase	`read_file`	~40 s	✅	✅ exact
count lines in a file	`read_file`	42 s	✅	✅ “4 lines”
create + read back a file	write + read	72 s	✅	✅ verified
today’s date via shell	shell	44 s	✅	✅ `2026-05-26`
list a directory	list/glob	41 s	✅	❌ reported “empty”

Four of five tool calls were well-formed with zero format errors; three of five final answers were fully correct. The one miss is honest and worth keeping: on the directory listing, the model called the tool, got the results, and then summarized them wrong — reported the folder empty when it wasn’t. That’s a small-model reasoning slip, not a harness or tool-format failure, and it’s precisely why the next article measures tool-call reliability as a first-class number across serving lanes rather than asserting it.

Codifying the path in fieldkit

Everything above is reproducible by hand, but I don’t want to rebuild the NIM launch recipe and the config-rendering from memory every session, so it’s now a small deterministic surface in fieldkit — the same package that backs the rest of this blog. The lane launch, the warm-wait, the unified-memory guard, and the provider: custom config all collapse to a few lines:

from fieldkit.harness import LaneSpec, serve_lane, configure_hermes

# Brings the NIM up (guarded against OOM-stacking via fieldkit.capabilities),
# waits for warm, tears it down on exit — one model at a time.
with serve_lane(LaneSpec("nim", "nemotron-nano-9b-v2-dgx-spark", port=8000)) as lane:
    config, env = configure_hermes(lane=lane, model="nvidia/nemotron-nano-9b-v2")
    # config.render() -> the model: block of ~/.hermes/config.yaml
    # env.render()    -> ~/.hermes/.env  (base_url + a dummy key + slow-serving timeout)

The serve_lane guard reuses the same fieldkit.capabilities memory math the rest of the blog uses for envelope sizing — it refuses to start a lane that would tip the 128 GB budget, and the context manager’s teardown is what enforces the one-model-at-a-time rule the NIM’s 91 GB footprint demands. It’s deliberately thin. The harness module isn’t trying to be Hermes; it’s trying to make the Spark-specific parts — the NIM recipe, the memory guard, the config shape — repeatable.

What this unlocks

With the cockpit installed and proven local, three things are newly buildable this week. A private file agent that triages a directory, summarizes documents, and renames things on request — pointed at your actual home folder, because the reasoning never leaves it. A no-bill scripting assistant wired into a shell binding, where “ask the agent to write and run a one-off script” costs electricity and nothing else, so you stop rationing the calls the way a metered API trains you to. And a foundation for the always-on phone agent that closes the series: the same hermes -z turn, reached through Hermes’s messaging gateway, hardened, answering from your desk while you’re out.

The honest caveat is the one the battery surfaced: a 9B model is a capable actor but a fallible reasoner. It will occasionally execute a perfect tool call and then misread the result. For agent work where a wrong summary is cheap to catch, that’s fine; where it isn’t, the answer is a bigger lane or a verifier — both of which the next two articles are about.

Closing

The DGX Spark earns its keep here by collapsing a distance that the cloud keeps wide: the distance between the agent’s reasoning and your data. A local harness driving a local NIM means the decision to read a file, run a command, or call a tool happens on the same machine the file lives on — no key, no hop, no bill, 145 seconds from cold to a closed agent loop. That’s a different kind of “private” than a local model alone buys you, and it’s the foundation the rest of this series is built on.

Next up: the serving-lane bakeoff — Qwen3 35B-A3B MoE versus a 27B dense model on the 128 GB envelope, measured on tok/s, sustained load, and the number that actually decides a harness’s worth: tool-call reliability per lane. The cockpit is installed; now we make it fast without tipping the box over.