Stage
Inference
Serving models locally. NIM, Triton, TensorRT-LLM — what each replaces in your API bill, what each costs in cold-start and memory.
Cost-Routing the Hermes Harness — When Local Stops Being Enough on a DGX Spark
The local 30B-MoE on a Spark is at $0 marginal cost — until it isn't. H6 measures the failure-mode curve: where does local stop being enough, and what does the dollar curve look like when you escalate to OpenRouter only when you have to?
uses fieldkit.harnessfieldkit.eval
The Hermes Vertical Router on a DGX Spark — One Brain Always Warm, Five Specialists Summoned on Demand
Five published Orionfold verticals plus the pinned MoE brain become a router on one Spark — not by parallel inference (the unified-memory envelope forbids that), but by a deterministic keyword classifier that dispatches the prompt and serves the right specialist one-at-a-time.
uses fieldkit.harness
Picking the Hermes Brain on a DGX Spark — When Throughput Stops Being the Answer
The Hermes serving-lane bakeoff couldn't pick a winner: all five lanes cleared the tool-call format bar. A graded brain-quality rubric breaks the tie — and shows the fastest serving lane is also the better agent, by a margin throughput could never have measured.
uses fieldkit.evalfieldkit.harness
The Hermes Serving Lane on a DGX Spark — MoE vs Dense, and the Number That Actually Picks the Lane
Five Hermes serving lanes on one DGX Spark: Qwen3-30B-A3B MoE vs Qwen3-32B dense across vLLM, llama.cpp, and NIM. The MoE runs ~8.5× faster for the same memory — but the lane is picked by tool-call reliability, which took two config fights to get to 0% everywhere.
uses fieldkit.capabilitiesfieldkit.harnessfieldkit.nim
The Hermes Harness on a DGX Spark — A Local Cockpit That Holds Tools, With No API Key
Installing the Hermes agent harness on a DGX Spark and running the first local agent turn against the cached Nemotron-Nano-9B-v2 NIM — reliable tool calls, no API key, no cloud hop. The defensible angle is NIM-first; everyone else's Spark Hermes write-up leads with Ollama.
uses fieldkit.nimfieldkit.capabilitiesfieldkit.harness
Pass@k After the Seventh Patch — Three Shapes ESamp Takes on Spark
Patches were six. The Pass@k harness surfaced a seventh — a one-line slice in the residual tap that only fires once batches shrink mid-run. Once cleared, ESamp takes three shapes: flat on saturated cells, lifting both rates on instruct headroom, and +6.67pp pass@8 on the unsaturated reasoning cell.
uses fieldkit.evalfieldkit.capabilities
Two Patches Were Six — ESamp Lands at 97.4% on a Patched Spark
Article #2 closed at two patches. Applying them surfaced six — including the silent return-shape adapter that broke the consumer's port. Once cleared, ESamp lands at 97.4% of baseline on patched Qwen 2.5 7B, within 1.4 pp of the paper's reference.
uses fieldkit.evalfieldkit.capabilities
Test-Time Distilling on Spark — Same Compute Envelope, Wider Semantic Reach
ESamp adds a tiny test-time-trained probe to vLLM that converts decoding from lexical resampling into semantic exploration. The runtime is vLLM-native — and that is a Spark catalog-gap story before it is a benchmark.
uses fieldkit.evalfieldkit.capabilities
AutoResearchBench on Spark — Two NIMs, One Bench, Two Failure Modes
Two Spark-tuned NIMs run AutoResearchBench's three Deep-Research example questions. Llama-3.1-8B crashes by turn 5-6 on its 8K context; Nemotron-Nano-9B-v2 finishes cleanly at 128K. Both score 0% Accuracy@1 — for completely different reasons.
uses fieldkit.nimfieldkit.evalfieldkit.capabilities
Looking Beyond Spark — KV-Cache Arithmetic at Inference
The serving memory bill is not weights. It's KV cache, and KV scales with concurrent users × context length, not parameters. Same four bills as training; different weights. A 70B at 32 users × 16k context wants 168 GB just for KV — and the Spark teaches you the per-token math.
uses fieldkit.capabilities
Second Brain as a Tool — Wrapping the RAG Stack in MCP for Claude Code
Closing the Second Brain arc. Four MCP tools wrap the RAG chain — embed, retrieve, optionally rerank, generate — and any Claude Code session anywhere on the box becomes a grounded research client. 200 lines of Python, one launcher, one .mcp.json entry.
Ragas, Reranked — What 44 Held-Out Questions Say About the Second Brain Stack
A Ragas-style harness written in 200 lines of stdlib Python, run locally on the DGX Spark, against four variants of the Second Brain RAG chain. Naive RAG scores 3.30 / 5. Rerank RAG scores 4.27. LoRA+RAG is a surprise — it does not beat naive. Retrieval is where the points come from.
uses fieldkit.eval
One Rail, Three Policies — NeMo Guardrails on the Retrieval Path
NeMo Guardrails drops a policy gate between retrieval and generation. One install, three per-arc configs — PII for Second Brain, style for LLM Wiki, code-safety for Autoresearch — and a 15-query benchmark: 100% block recall, 100% clean pass. Rails are scaffolding; detectors are the content.
uses fieldkit.rag
Bigger Generator, Same Grounding — 8B vs 49B vs 70B on One Retrieval Chain
The rerank-and-fusion article bet that a bigger generator would heal the 8B Google-IPO refusal. Ran the A/B across three sizes on one retrieval chain. Bet lost: Nemotron-Super-49B over-refuses the 8B baseline; Llama 3.3 70B narrows the gap, not closes it. The refusal was the scaffold working.
uses fieldkit.rag
Hybrid Retrieval on the Spark — BM25, Dense, Fusion, Rerank
Four retrieval modes on one corpus — naive dense, BM25, Reciprocal Rank Fusion, Nemotron rerank. Dense is already 92% recall@5; rerank adds a point at K=10 and reorders the top. The 8B generator still refuses where retrieval is perfect — grounding, not retrieval, is the new bottleneck.
uses fieldkit.rag
Three Endpoints, One Answer — Naive RAG on a DGX Spark
Three endpoints in one curl chain — a query embeds through Nemotron, pgvector returns top-5 chunks in under 80 ms, and a Llama 3.1 8B NIM stuffs them into a strict-context prompt. The chain works; the 8B generator still refuses on questions its own context answers.
uses fieldkit.ragfieldkit.eval
Where Your Vectors Live — pgvector on a DGX Spark
The substrate between the embed call and the retrieve call — pgvector 0.8.2 running as a Postgres 16 container on GB10, with 1000 Nemotron vectors, HNSW and ivfflat both indexed, and a planner that prefers seq scan until you tell it otherwise.
uses fieldkit.rag
Your Own Semantic Space — a Nemotron Embedding NIM on a DGX Spark
The embedding endpoint that every downstream RAG, wiki, and agent piece will reuse — a 2048-dim Nemotron Retriever NIM running locally on GB10, ready 52 seconds after docker run and holding 28 docs/s under batched load.
uses fieldkit.rag
Your First NIM on a DGX Spark — What 24.8 Tokens Per Second Doesn't Tell You
First-contact notes on NVIDIA's DGX-Spark-specific Llama 3.1 8B NIM. 9.4 GB image, ~108 s warm-cache cold-start, 24.8 tok/s steady, OpenAI-compatible on :8000 — and a confidently wrong Python one-liner that clarifies what small-model FP8 buys and what it costs.
uses fieldkit.nim
Field-Fixing the Hermes Harness on a DGX Spark — When the NIM Won't Stream Tool Calls, and Other Rough Edges
Fifth in the Harnesses series: the field fixes that take a fresh Hermes agent on a local NIM from 'mostly works' to 'just works.' Leads with the one that bit hardest — the Spark NIM ships a non-streaming tool parser, fixed by bind-mounting NVIDIA's own streaming parser.
uses fieldkit.harness
Gates Before the Advisor — Recall Floors, Raw-Base Preflights, and the Bench That Ate Its Own Spec
Before the Advisor trained: a 182-source corpus pack with recall gates on two retrieval lanes (BM25 and live pgvector + NIM embedder), raw-base preflights that failed two NVIDIA bases honestly, and the rebuild that caught the bench's own spec contaminating its retrieval context.
RaguTeam at SemEval-2026 Task 8: Meno and Friends in a Judge-Orchestrated LLM Ensemble for Faithful Multi-Turn Response Generation — Spark reproduction notes
Reproducing the RaguTeam SemEval-2026 T8 winning system on a DGX Spark — judge-orchestrated 7-LLM ensemble (Qwen3-4B-FP8 + Meno-Lite-0.1 7B local + remote members) with Qwen3-32B judge, then extracting the pattern into `fieldkit.ensemble` + `fieldkit.judge`.