Agentic

Article №54 observability Foundation 03 Jun 2026 ~4 hours end-to-end — bring up the cockpit, drive a reindex + two RAG-evals through the control plane, score 44 questions, and ship the artifact

The Meta-Program on a DGX Spark — When the Tool You Build With Is an Instance of the Thing You Build

The opener for the Machine-that-Builds-Machines arc. The book describes a meta-program on a SaaS platform; this is the same pattern on one personal box — a pane → hands → engine loop where the spec is the application and the skills are configuration over code.

Second Brain

The Machine Manages Its Own Memory — and the Bug the Mocks Slept Through

Driving the Arena recall layer end-to-end on its own corpus: reindex → score → gate, dispatched through the control plane, recall@5 measured against 44 held-out questions. The first real drain caught a bug eight mock-injected unit tests had slept through — the case for operating the thing you built.

uses fieldkit.memoryfieldkit.arenafieldkit.harnessfieldkit.eval

Article №53 fine-tuning Foundation 03 Jun 2026 ~16 min read — a synthesis of a proven run plus the engine it became

Article №51 agentic Foundation 28 May 2026 ~4 hours including the OpenRouter bakeoff + harness publish

The Machine Improves Itself — Closed-Loop RLVR on a DGX Spark, Where the Eval Harness Is the Reward

Closed-loop RLVR on one box: an eval→reward→fine-tune loop where the Spark's own verifiers ARE the reward — no learned reward model. The hero finding is defensive: pick the checkpoint on a frozen held-out split, never the training pool, or the loop reports success while it regresses.

uses fieldkit.rlfieldkit.rewardfieldkit.evalfieldkit.lineage

Article №50 agentic Foundation 28 May 2026 ~3 hours including bakeoff + harness publish

Cost-Routing the Hermes Harness — When Local Stops Being Enough on a DGX Spark

The local 30B-MoE on a Spark is at $0 marginal cost — until it isn't. H6 measures the failure-mode curve: where does local stop being enough, and what does the dollar curve look like when you escalate to OpenRouter only when you have to?

uses fieldkit.harnessfieldkit.eval

Article №49 agentic NIM 28 May 2026 ~6 hours across three serving lanes, N=5 attempts per prompt

The Hermes Vertical Router on a DGX Spark — One Brain Always Warm, Five Specialists Summoned on Demand

Five published Orionfold verticals plus the pinned MoE brain become a router on one Spark — not by parallel inference (the unified-memory envelope forbids that), but by a deterministic keyword classifier that dispatches the prompt and serves the right specialist one-at-a-time.

uses fieldkit.harness

Article №48 agentic Foundation 26 May 2026 ~3 hours, including the live tool-call gate against a local NIM

Picking the Hermes Brain on a DGX Spark — When Throughput Stops Being the Answer

The Hermes serving-lane bakeoff couldn't pick a winner: all five lanes cleared the tool-call format bar. A graded brain-quality rubric breaks the tie — and shows the fastest serving lane is also the better agent, by a margin throughput could never have measured.

uses fieldkit.evalfieldkit.harness

Article №47 agentic Foundation 26 May 2026 ~2 hours, most of it the hostile-tool-call containment battery

Hermes Drives the Spark via fieldkit-as-MCP — The Agent That Operates Its Own Machine

The keystone of the Harnesses series: expose a curated slice of fieldkit as MCP tools and the local Hermes agent can measure, quantize, publish, and retrieve on the box itself. The gate is a real llama-bench run the agent drove end-to-end — 0% tool-call format error, no API key.

uses fieldkit.harnessfieldkit.capabilitiesfieldkit.quantfieldkit.publishfieldkit.rag

Article №45 agentic NIM 26 May 2026 ~1 hour, most of it the NIM's first cold-start

Hardening the Hermes Harness on a DGX Spark — The Box Contains It, You Don't Trust the Model

Before you leave a tool-wielding agent running on your desk, harden it. One pure function turns Hermes' permissive defaults into a desk-grade posture, then a scripted hostile-tool-call test proves it: egress denied at the sandbox, secrets in .env only, the config surviving a restart.

uses fieldkit.harness

Article №36 fine-tuning NeMo 11 May 2026 ~30 min read

The Hermes Harness on a DGX Spark — A Local Cockpit That Holds Tools, With No API Key

Installing the Hermes agent harness on a DGX Spark and running the first local agent turn against the cached Nemotron-Nano-9B-v2 NIM — reliable tool calls, no API key, no cloud hop. The defensible angle is NIM-first; everyone else's Spark Hermes write-up leads with Ollama.

uses fieldkit.nimfieldkit.capabilitiesfieldkit.harness

Article №35 agentic NeMo 10 May 2026 ~28 min read

Adaptive Turn Clipping on a Single Spark — A²TGPO, Studied from Source

A²TGPO redesigns how Information Gain feeds GRPO: turn-group normalization, variance-rescaled accumulation, and adaptive turn-level clipping. The paper's release is the code; the Spark's contribution is the lineage primitive that records what each trial learned.

uses fieldkit.capabilitiesfieldkit.trainingfieldkit.lineage

Article №34 fine-tuning NeMo 09 May 2026 ~18.5 hours wall (50 T²PO steps + three evals)

Reading the Lineage Primitive — cxcscmu Auto-Research, Studied from release_artifacts

cxcscmu's own lineage_on vs lineage_off ablation closes the case: same agent, same trial budget, same prompt template — only the rendered lineage block differs, and the run with lineage produces 5.3× more keeps and 3.2× less wall-time waste. This piece extracts that primitive into fieldkit.lineage.

uses fieldkit.capabilitiesfieldkit.trainingfieldkit.lineage

Article №33 fine-tuning NeMo 05 May 2026 ~9 hours wall (34 GRPO steps + two evals)

T²PO on Spark — When the Training Pool Says 28/32 and Held-out Says 9/158

T²PO's two deltas on the Phase 6 ClawGym harness: mean turns 5.00 → 4.61, task_complete 154/158, but the per-assertion ceiling stays flat at 47.7%. The strongest training-side step (45) is the worst held-out checkpoint — pool saturation lies on a single Spark.

uses fieldkit.capabilitiesfieldkit.evalfieldkit.training

Article №32 fine-tuning NeMo 05 May 2026 ~3 days end-to-end (mostly waiting on rollouts)

ClawGym GRPO on Spark — Closing the Loop the SFT Adapter Couldn't

Phase 5 SFT taught the agent to keep working but never to stop. 34 GRPO steps with a shaped reward unlearn the failure mode — same model, same base, same LoRA-init, but task_complete climbs 0/158 → 154/158, mean turns drop 12 → 5, and per-assertion still inches up +3.1 pp.

Article №28 observability NIM 02 May 2026 ~3 hours — 30 min plumbing, ~20 min for the runs themselves, the rest is reading what they show

ClawGym on Spark — A 7B Base, A LoRA Adapter, and the +15 pp the Adapter Earned

ClawGym shipped only a .github profile, so we built the substrate ourselves — persona task synth, sandbox harness, 200-task corpus, LoRA SFT, matched-base eval. The adapter earns +3.8 pp task pass and +15.0 pp per-assertion against its own base. The diagnostic is the lift.

uses fieldkit.nim

Article №26 observability NIM Llama 3.1 8B 01 May 2026 ~2 hours wall — analysis runs in seconds, the rest is reading + writing

AutoResearchBench on Spark — Two NIMs, One Bench, Two Failure Modes

Two Spark-tuned NIMs run AutoResearchBench's three Deep-Research example questions. Llama-3.1-8B crashes by turn 5-6 on its 8K context; Nemotron-Nano-9B-v2 finishes cleanly at 128K. Both score 0% Accuracy@1 — for completely different reasons.

uses fieldkit.nimfieldkit.evalfieldkit.capabilities

Article №25 fine-tuning NeMo Customizer 01 May 2026 ~2 hours wall — 4 min LoRA training, 4 min race, the rest writing

Was the Agent Researching, or Flailing? An Observability Pass on the Trajectory

A8 said the LoRA mode-collapsed because the trajectory was thin. This puts numbers on it: 6 of 13 knobs ever touched, 72% of proposals repeated a prior pair, and the proposer's k=5 history window is the structural cause.

Article №24 training Foundation 30 Apr 2026 ~30 minute read · math + economics, no GPU required

Distilling the Architect — A 3B LoRA Trained on the Agent's Own Trajectory

A4's 50-iter trajectory becomes training data for a Qwen2.5-3B LoRA proposer. Holding out 8 iters, the 3B mode-collapses onto d_model=768 (the trajectory's most-frequent keep) and matches 0 / 8 exact; the 8B at T=0.5 matches 4 / 8 of its own past picks.

Looking Beyond Spark

Derisking the Cloud Pretrain — How a $5K Spark Saves $50K on H100 Rentals

The Spark is too small for a serious pretrain — but it's the right size for the recipe-search that precedes one. Cull 100 candidate architectures down to 3 on one Spark for ~$1 of electricity, then book the cloud node knowing what to train. The expected savings per campaign run into the thousands.

Article №23 foundations Foundation 25 Apr 2026 ~15 minute read · no GPU required

Looking Beyond Spark

What the Agent Actually Built — Five Articles in Plain English, and Why You Probably Don't Want to Train From Scratch

Five technical articles in one day built an unattended AI research loop on a desk for $0.02 of electricity. The plain-English readout: what the agent built (not a usable model), what it changes for one person, and a four-tier roadmap from LoRA in minutes to from-scratch in weeks.

Article №22 agentic NeMo 25 Apr 2026 ~3 hours — 90 min to scaffold the loop, 73 min for the unattended run, the rest is reading the trajectory

Article №21 agentic NeMo Guardrails 25 Apr 2026 ~2 hours — 30 min for the perturbation menu + structured proposal schema, 60 min for the 5 rails + 27-case adversarial bench, 30 min to write up

The Autoresearch Loop — 50 Iterations of an LLM Editing Its Own Trainer Overnight

NIM Llama 3.1 8B drives a structured-perturbation agent loop against a 354M GPT pretrain. 50 iterations, 73.4 min wall, 0.07 kWh of electricity. 8 keeps, 42 reverts, 0 rail blocks, 0 crashes. Best result: val_bpb 10.8534, +0.93% over baseline at d_model=768.

Article №17 agentic NIM 24 Apr 2026 ~90 minutes — 30 min to design the tool surface, 30 min to wire FastMCP + pgvector, 15 min to register with Claude Code, 15 min for the demo and trace

Guardrails Before the Agent Edits — Code-Edit Policy as a Programmatic Funnel

Five programmatic rails between the Autoresearch agent's proposal and any mutation of train.py — schema, menu, range, cross-constraint, diff lint. 27 adversarial test cases: block recall 1.0, clean pass 1.0, every rail attribution correct. Zero LLM-as-judge calls.

Second Brain

Second Brain as a Tool — Wrapping the RAG Stack in MCP for Claude Code

Closing the Second Brain arc. Four MCP tools wrap the RAG chain — embed, retrieve, optionally rerank, generate — and any Claude Code session anywhere on the box becomes a grounded research client. 200 lines of Python, one launcher, one .mcp.json entry.

Article №04 agentic NemoClaw 21 Apr 2026 ~2 hours after prerequisites

The Sandbox Tax That Wasn't — NemoClaw vs OpenClaw on One DGX Spark

I ran NemoClaw's sandboxed agent stack and the host Ollama-OpenClaw CLI side by side on one DGX Spark with the same 123B Nemotron model. The sandbox overhead I went looking for is real but modest (~2× raw inference); the real tax is onboarding, and NemoClaw paid it at install time.

Upcoming agentic NIM ~30 min read

Second Brain

Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction — Spark reproduction notes

Reproducing DCI-Agent-Lite on a DGX Spark — NIM-served 8B agent + ripgrep + filesystem corpus, no embedder or vector DB; extracts the operator vocabulary as `fieldkit.rag.operators` and quantifies how much of the existing pgvector + reranker stack DCI lets you delete.

Upcoming agentic Foundation planned ~2 hours

Upcoming agentic Foundation planned ~14 min read

Field-Fixing the Hermes Harness on a DGX Spark — When the NIM Won't Stream Tool Calls, and Other Rough Edges

Fifth in the Harnesses series: the field fixes that take a fresh Hermes agent on a local NIM from 'mostly works' to 'just works.' Leads with the one that bit hardest — the Spark NIM ships a non-streaming tool parser, fixed by bind-mounting NVIDIA's own streaming parser.

uses fieldkit.harness

Upcoming agentic NemoClaw ~30 min read

Governed Routing With Receipts — When the Local Lane Consults the Frontier, and What It Costs

The Advisor's router is deterministic and observables-only: it escalates on detectable failure signals — a citation outside the retrieved set, a rank-sanity anomaly — never on vibes. Route bakeoffs at $0 and $0.0033, a no-egress gate for private state, and a receipt a script re-verifies.

Upcoming agentic NemoClaw ~30 min read

Heterogeneous Scientific Foundation Model Collaboration — Spark reproduction notes

Wrap a domain foundation model (Pangu-Weather) as a Triton tool, drive it from a NIM-served Llama 3.1 8B planner via NemoClaw, and show when specialist routing beats language-only reasoning — all inside the Spark 128 GB envelope.