Stage
Agentic
Multi-agent orchestration without a cluster. Sandboxing, tool use, long horizons — what is and is not possible on a single machine.
The Meta-Program on a DGX Spark — When the Tool You Build With Is an Instance of the Thing You Build
The opener for the Machine-that-Builds-Machines arc. The book describes a meta-program on a SaaS platform; this is the same pattern on one personal box — a pane → hands → engine loop where the spec is the application and the skills are configuration over code.
The Machine Manages Its Own Memory — and the Bug the Mocks Slept Through
Driving the Arena recall layer end-to-end on its own corpus: reindex → score → gate, dispatched through the control plane, recall@5 measured against 44 held-out questions. The first real drain caught a bug eight mock-injected unit tests had slept through — the case for operating the thing you built.
uses fieldkit.memoryfieldkit.arenafieldkit.harnessfieldkit.eval
The Machine Improves Itself — Closed-Loop RLVR on a DGX Spark, Where the Eval Harness Is the Reward
Closed-loop RLVR on one box: an eval→reward→fine-tune loop where the Spark's own verifiers ARE the reward — no learned reward model. The hero finding is defensive: pick the checkpoint on a frozen held-out split, never the training pool, or the loop reports success while it regresses.
uses fieldkit.rlfieldkit.rewardfieldkit.evalfieldkit.lineage
Cost-Routing the Hermes Harness — When Local Stops Being Enough on a DGX Spark
The local 30B-MoE on a Spark is at $0 marginal cost — until it isn't. H6 measures the failure-mode curve: where does local stop being enough, and what does the dollar curve look like when you escalate to OpenRouter only when you have to?
uses fieldkit.harnessfieldkit.eval
The Hermes Vertical Router on a DGX Spark — One Brain Always Warm, Five Specialists Summoned on Demand
Five published Orionfold verticals plus the pinned MoE brain become a router on one Spark — not by parallel inference (the unified-memory envelope forbids that), but by a deterministic keyword classifier that dispatches the prompt and serves the right specialist one-at-a-time.
uses fieldkit.harness
Picking the Hermes Brain on a DGX Spark — When Throughput Stops Being the Answer
The Hermes serving-lane bakeoff couldn't pick a winner: all five lanes cleared the tool-call format bar. A graded brain-quality rubric breaks the tie — and shows the fastest serving lane is also the better agent, by a margin throughput could never have measured.
uses fieldkit.evalfieldkit.harness
Hermes Drives the Spark via fieldkit-as-MCP — The Agent That Operates Its Own Machine
The keystone of the Harnesses series: expose a curated slice of fieldkit as MCP tools and the local Hermes agent can measure, quantize, publish, and retrieve on the box itself. The gate is a real llama-bench run the agent drove end-to-end — 0% tool-call format error, no API key.
uses fieldkit.harnessfieldkit.capabilitiesfieldkit.quantfieldkit.publishfieldkit.rag
Hardening the Hermes Harness on a DGX Spark — The Box Contains It, You Don't Trust the Model
Before you leave a tool-wielding agent running on your desk, harden it. One pure function turns Hermes' permissive defaults into a desk-grade posture, then a scripted hostile-tool-call test proves it: egress denied at the sandbox, secrets in .env only, the config surviving a restart.
uses fieldkit.harness
The Hermes Harness on a DGX Spark — A Local Cockpit That Holds Tools, With No API Key
Installing the Hermes agent harness on a DGX Spark and running the first local agent turn against the cached Nemotron-Nano-9B-v2 NIM — reliable tool calls, no API key, no cloud hop. The defensible angle is NIM-first; everyone else's Spark Hermes write-up leads with Ollama.
uses fieldkit.nimfieldkit.capabilitiesfieldkit.harness
Adaptive Turn Clipping on a Single Spark — A²TGPO, Studied from Source
A²TGPO redesigns how Information Gain feeds GRPO: turn-group normalization, variance-rescaled accumulation, and adaptive turn-level clipping. The paper's release is the code; the Spark's contribution is the lineage primitive that records what each trial learned.
uses fieldkit.capabilitiesfieldkit.trainingfieldkit.lineage
Reading the Lineage Primitive — cxcscmu Auto-Research, Studied from release_artifacts
cxcscmu's own lineage_on vs lineage_off ablation closes the case: same agent, same trial budget, same prompt template — only the rendered lineage block differs, and the run with lineage produces 5.3× more keeps and 3.2× less wall-time waste. This piece extracts that primitive into fieldkit.lineage.
uses fieldkit.capabilitiesfieldkit.trainingfieldkit.lineage
T²PO on Spark — When the Training Pool Says 28/32 and Held-out Says 9/158
T²PO's two deltas on the Phase 6 ClawGym harness: mean turns 5.00 → 4.61, task_complete 154/158, but the per-assertion ceiling stays flat at 47.7%. The strongest training-side step (45) is the worst held-out checkpoint — pool saturation lies on a single Spark.
uses fieldkit.capabilitiesfieldkit.evalfieldkit.training
ClawGym GRPO on Spark — Closing the Loop the SFT Adapter Couldn't
Phase 5 SFT taught the agent to keep working but never to stop. 34 GRPO steps with a shaped reward unlearn the failure mode — same model, same base, same LoRA-init, but task_complete climbs 0/158 → 154/158, mean turns drop 12 → 5, and per-assertion still inches up +3.1 pp.
ClawGym on Spark — A 7B Base, A LoRA Adapter, and the +15 pp the Adapter Earned
ClawGym shipped only a .github profile, so we built the substrate ourselves — persona task synth, sandbox harness, 200-task corpus, LoRA SFT, matched-base eval. The adapter earns +3.8 pp task pass and +15.0 pp per-assertion against its own base. The diagnostic is the lift.
uses fieldkit.nim
AutoResearchBench on Spark — Two NIMs, One Bench, Two Failure Modes
Two Spark-tuned NIMs run AutoResearchBench's three Deep-Research example questions. Llama-3.1-8B crashes by turn 5-6 on its 8K context; Nemotron-Nano-9B-v2 finishes cleanly at 128K. Both score 0% Accuracy@1 — for completely different reasons.
uses fieldkit.nimfieldkit.evalfieldkit.capabilities
Was the Agent Researching, or Flailing? An Observability Pass on the Trajectory
A8 said the LoRA mode-collapsed because the trajectory was thin. This puts numbers on it: 6 of 13 knobs ever touched, 72% of proposals repeated a prior pair, and the proposer's k=5 history window is the structural cause.
Distilling the Architect — A 3B LoRA Trained on the Agent's Own Trajectory
A4's 50-iter trajectory becomes training data for a Qwen2.5-3B LoRA proposer. Holding out 8 iters, the 3B mode-collapses onto d_model=768 (the trajectory's most-frequent keep) and matches 0 / 8 exact; the 8B at T=0.5 matches 4 / 8 of its own past picks.
Derisking the Cloud Pretrain — How a $5K Spark Saves $50K on H100 Rentals
The Spark is too small for a serious pretrain — but it's the right size for the recipe-search that precedes one. Cull 100 candidate architectures down to 3 on one Spark for ~$1 of electricity, then book the cloud node knowing what to train. The expected savings per campaign run into the thousands.
What the Agent Actually Built — Five Articles in Plain English, and Why You Probably Don't Want to Train From Scratch
Five technical articles in one day built an unattended AI research loop on a desk for $0.02 of electricity. The plain-English readout: what the agent built (not a usable model), what it changes for one person, and a four-tier roadmap from LoRA in minutes to from-scratch in weeks.
The Autoresearch Loop — 50 Iterations of an LLM Editing Its Own Trainer Overnight
NIM Llama 3.1 8B drives a structured-perturbation agent loop against a 354M GPT pretrain. 50 iterations, 73.4 min wall, 0.07 kWh of electricity. 8 keeps, 42 reverts, 0 rail blocks, 0 crashes. Best result: val_bpb 10.8534, +0.93% over baseline at d_model=768.
Guardrails Before the Agent Edits — Code-Edit Policy as a Programmatic Funnel
Five programmatic rails between the Autoresearch agent's proposal and any mutation of train.py — schema, menu, range, cross-constraint, diff lint. 27 adversarial test cases: block recall 1.0, clean pass 1.0, every rail attribution correct. Zero LLM-as-judge calls.
Second Brain as a Tool — Wrapping the RAG Stack in MCP for Claude Code
Closing the Second Brain arc. Four MCP tools wrap the RAG chain — embed, retrieve, optionally rerank, generate — and any Claude Code session anywhere on the box becomes a grounded research client. 200 lines of Python, one launcher, one .mcp.json entry.
The Sandbox Tax That Wasn't — NemoClaw vs OpenClaw on One DGX Spark
I ran NemoClaw's sandboxed agent stack and the host Ollama-OpenClaw CLI side by side on one DGX Spark with the same 123B Nemotron model. The sandbox overhead I went looking for is real but modest (~2× raw inference); the real tax is onboarding, and NemoClaw paid it at install time.
Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction — Spark reproduction notes
Reproducing DCI-Agent-Lite on a DGX Spark — NIM-served 8B agent + ripgrep + filesystem corpus, no embedder or vector DB; extracts the operator vocabulary as `fieldkit.rag.operators` and quantifies how much of the existing pgvector + reranker stack DCI lets you delete.
Field-Fixing the Hermes Harness on a DGX Spark — When the NIM Won't Stream Tool Calls, and Other Rough Edges
Fifth in the Harnesses series: the field fixes that take a fresh Hermes agent on a local NIM from 'mostly works' to 'just works.' Leads with the one that bit hardest — the Spark NIM ships a non-streaming tool parser, fixed by bind-mounting NVIDIA's own streaming parser.
uses fieldkit.harness
Governed Routing With Receipts — When the Local Lane Consults the Frontier, and What It Costs
The Advisor's router is deterministic and observables-only: it escalates on detectable failure signals — a citation outside the retrieved set, a rank-sanity anomaly — never on vibes. Route bakeoffs at $0 and $0.0033, a no-egress gate for private state, and a receipt a script re-verifies.
Heterogeneous Scientific Foundation Model Collaboration — Spark reproduction notes
Wrap a domain foundation model (Pangu-Weather) as a Triton tool, drive it from a NIM-served Llama 3.1 8B planner via NemoClaw, and show when specialist routing beats language-only reasoning — all inside the Spark 128 GB envelope.
SkillOS: Learning Skill Curation for Self-Evolving Agents — Spark reproduction notes
Reproducing the SkillOS curator/executor split on a DGX Spark — both Qwen3-8B (frozen executor + LoRA-trained curator) over a markdown SkillRepo with BM25 retrieval, then extracting the pattern into `fieldkit.skills`.