Stage

Deployment

From experiment to something that runs reliably. Containers, services, updates, graceful degradation on one machine.

Article №51 agentic Foundation 28 May 2026 ~4 hours including the OpenRouter bakeoff + harness publish

Cost-Routing the Hermes Harness — When Local Stops Being Enough on a DGX Spark

The local 30B-MoE on a Spark is at $0 marginal cost — until it isn't. H6 measures the failure-mode curve: where does local stop being enough, and what does the dollar curve look like when you escalate to OpenRouter only when you have to?

uses fieldkit.harnessfieldkit.eval

Article №48 agentic Foundation 26 May 2026 ~3 hours, including the live tool-call gate against a local NIM

Harnesses

Hermes Drives the Spark via fieldkit-as-MCP — The Agent That Operates Its Own Machine

The keystone of the Harnesses series: expose a curated slice of fieldkit as MCP tools and the local Hermes agent can measure, quantize, publish, and retrieve on the box itself. The gate is a real llama-bench run the agent drove end-to-end — 0% tool-call format error, no API key.

uses fieldkit.harnessfieldkit.capabilitiesfieldkit.quantfieldkit.publishfieldkit.rag

Article №46 deployment NIM 26 May 2026 ~3 hours, most of it model pulls and four cold-starts

Harnesses

The Hermes Serving Lane on a DGX Spark — MoE vs Dense, and the Number That Actually Picks the Lane

Five Hermes serving lanes on one DGX Spark: Qwen3-30B-A3B MoE vs Qwen3-32B dense across vLLM, llama.cpp, and NIM. The MoE runs ~8.5× faster for the same memory — but the lane is picked by tool-call reliability, which took two config fights to get to 0% everywhere.

uses fieldkit.capabilitiesfieldkit.harnessfieldkit.nim

Article №45 agentic NIM 26 May 2026 ~1 hour, most of it the NIM's first cold-start

Harnesses

The Hermes Harness on a DGX Spark — A Local Cockpit That Holds Tools, With No API Key

Installing the Hermes agent harness on a DGX Spark and running the first local agent turn against the cached Nemotron-Nano-9B-v2 NIM — reliable tool calls, no API key, no cloud hop. The defensible angle is NIM-first; everyone else's Spark Hermes write-up leads with Ollama.

uses fieldkit.nimfieldkit.capabilitiesfieldkit.harness

Article №40 deployment llama.cpp 16 May 2026 ~5 hours end-to-end on a DGX Spark

Machine that Builds Machines

Orionfold/II-Medical-8B-GGUF on Spark — five medical-reasoning variants, MedMCQA mini-eval, ChatML reasoning format

Five GGUF variants of Intelligent-Internet/II-Medical-8B (Qwen3-8B + DAPO reasoning recipe) measured on a DGX Spark. Q5_K_M lands at 36.4 tok/s, 5.45 GB, and 52% on a MedMCQA n=50 mini-eval — above F16. First reasoning recipe in the series.

uses fieldkit.quantfieldkit.publishfieldkit.evalfieldkit.lineage

Article №39 deployment llama.cpp 15 May 2026 ~5 hours end-to-end on a DGX Spark

Machine that Builds Machines

Orionfold/SecurityLLM-GGUF on Spark — five cyber variants, CyberMetric mini-eval, MCQ letter scoring

Five GGUF variants of ZySec-AI/SecurityLLM measured on a DGX Spark — Q4_K_M scores 40% on CyberMetric MCQ at 47.7 tok/s and 4.1 GB; the smaller variants matched or beat F16's 34%. Third vertical card; zero fieldkit source changes.

uses fieldkit.quantfieldkit.publishfieldkit.evalfieldkit.lineage

Article №38 deployment llama.cpp 14 May 2026 ~5 hours end-to-end on a DGX Spark

Machine that Builds Machines

Orionfold/Saul-7B-Instruct-v1-GGUF on Spark — five legal variants, LegalBench mini-eval, four-axis measurement card

Five GGUF variants of Equall/Saul-7B-Instruct-v1 measured on a DGX Spark — Q5_K_M scores 72% on LegalBench (n=50, contains) at 20 tok/s and 4.8 GB. Each card carries perplexity, sustained tok/s, thermal envelope, and a 5-task LegalBench subset score.

uses fieldkit.quantfieldkit.publishfieldkit.evalfieldkit.lineage

Article №37 deployment llama.cpp 14 May 2026 ~6 hours end-to-end on a DGX Spark

Machine that Builds Machines

Orionfold/finance-chat-GGUF on Spark — five variants, FinanceBench mini-eval, four-axis measurement card

Five GGUF variants of AdaptLLM/finance-chat measured on a DGX Spark — Q8_0 perplexity-matches F16 losslessly, Q4_K_M ships at 31 tok/s. Each card carries perplexity, sustained tok/s, thermal envelope, and FinanceBench accuracy.

uses fieldkit.quantfieldkit.publishfieldkit.evalfieldkit.lineage

Article №27 foundations TensorRT-LLM 30 Apr 2026 ~22 minute read

Looking Beyond Spark

Looking Beyond Spark — KV-Cache Arithmetic at Inference

The serving memory bill is not weights. It's KV cache, and KV scales with concurrent users × context length, not parameters. Same four bills as training; different weights. A 70B at 32 users × 16k context wants 168 GB just for KV — and the Spark teaches you the per-token math.

uses fieldkit.capabilities

Article №13 deployment TensorRT-LLM + Triton Inference Server 23 Apr 2026 ~4 hours including two container pulls and three engine builds

Second Brain

TensorRT-LLM on the Spark — FP8 Isn't the Reason to Drop NIM. NVFP4 Is.

Dropping below NIM to raw TensorRT-LLM on a GB10 Spark. FP8 beats NIM's vLLM by 10-15% — barely worth the rebuild. NVFP4 beats it by 76% on decode, 43% on TTFT, and ships a 34%-smaller engine. The reason to drop NIM is the Blackwell-native 4-bit kernel, not FP8.