Tag
#agentic
Articles tagged "agentic" — 24 entries.
The Meta-Program on a DGX Spark — When the Tool You Build With Is an Instance of the Thing You Build
The opener for the Machine-that-Builds-Machines arc. The book describes a meta-program on a SaaS platform; this is the same pattern on one personal box — a pane → hands → engine loop where the spec is the application and the skills are configuration over code.
Cost-Routing the Hermes Harness — When Local Stops Being Enough on a DGX Spark
The local 30B-MoE on a Spark is at $0 marginal cost — until it isn't. H6 measures the failure-mode curve: where does local stop being enough, and what does the dollar curve look like when you escalate to OpenRouter only when you have to?
uses fieldkit.harnessfieldkit.eval
The Hermes Vertical Router on a DGX Spark — One Brain Always Warm, Five Specialists Summoned on Demand
Five published Orionfold verticals plus the pinned MoE brain become a router on one Spark — not by parallel inference (the unified-memory envelope forbids that), but by a deterministic keyword classifier that dispatches the prompt and serves the right specialist one-at-a-time.
uses fieldkit.harness
Picking the Hermes Brain on a DGX Spark — When Throughput Stops Being the Answer
The Hermes serving-lane bakeoff couldn't pick a winner: all five lanes cleared the tool-call format bar. A graded brain-quality rubric breaks the tie — and shows the fastest serving lane is also the better agent, by a margin throughput could never have measured.
uses fieldkit.evalfieldkit.harness
Hermes Drives the Spark via fieldkit-as-MCP — The Agent That Operates Its Own Machine
The keystone of the Harnesses series: expose a curated slice of fieldkit as MCP tools and the local Hermes agent can measure, quantize, publish, and retrieve on the box itself. The gate is a real llama-bench run the agent drove end-to-end — 0% tool-call format error, no API key.
uses fieldkit.harnessfieldkit.capabilitiesfieldkit.quantfieldkit.publishfieldkit.rag
Hardening the Hermes Harness on a DGX Spark — The Box Contains It, You Don't Trust the Model
Before you leave a tool-wielding agent running on your desk, harden it. One pure function turns Hermes' permissive defaults into a desk-grade posture, then a scripted hostile-tool-call test proves it: egress denied at the sandbox, secrets in .env only, the config surviving a restart.
uses fieldkit.harness
The Hermes Harness on a DGX Spark — A Local Cockpit That Holds Tools, With No API Key
Installing the Hermes agent harness on a DGX Spark and running the first local agent turn against the cached Nemotron-Nano-9B-v2 NIM — reliable tool calls, no API key, no cloud hop. The defensible angle is NIM-first; everyone else's Spark Hermes write-up leads with Ollama.
uses fieldkit.nimfieldkit.capabilitiesfieldkit.harness
Adaptive Turn Clipping on a Single Spark — A²TGPO, Studied from Source
A²TGPO redesigns how Information Gain feeds GRPO: turn-group normalization, variance-rescaled accumulation, and adaptive turn-level clipping. The paper's release is the code; the Spark's contribution is the lineage primitive that records what each trial learned.
uses fieldkit.capabilitiesfieldkit.trainingfieldkit.lineage
Reading the Lineage Primitive — cxcscmu Auto-Research, Studied from release_artifacts
cxcscmu's own lineage_on vs lineage_off ablation closes the case: same agent, same trial budget, same prompt template — only the rendered lineage block differs, and the run with lineage produces 5.3× more keeps and 3.2× less wall-time waste. This piece extracts that primitive into fieldkit.lineage.
uses fieldkit.capabilitiesfieldkit.trainingfieldkit.lineage
T²PO on Spark — When the Training Pool Says 28/32 and Held-out Says 9/158
T²PO's two deltas on the Phase 6 ClawGym harness: mean turns 5.00 → 4.61, task_complete 154/158, but the per-assertion ceiling stays flat at 47.7%. The strongest training-side step (45) is the worst held-out checkpoint — pool saturation lies on a single Spark.
uses fieldkit.capabilitiesfieldkit.evalfieldkit.training
ClawGym GRPO on Spark — Closing the Loop the SFT Adapter Couldn't
Phase 5 SFT taught the agent to keep working but never to stop. 34 GRPO steps with a shaped reward unlearn the failure mode — same model, same base, same LoRA-init, but task_complete climbs 0/158 → 154/158, mean turns drop 12 → 5, and per-assertion still inches up +3.1 pp.
ClawGym on Spark — A 7B Base, A LoRA Adapter, and the +15 pp the Adapter Earned
ClawGym shipped only a .github profile, so we built the substrate ourselves — persona task synth, sandbox harness, 200-task corpus, LoRA SFT, matched-base eval. The adapter earns +3.8 pp task pass and +15.0 pp per-assertion against its own base. The diagnostic is the lift.
uses fieldkit.nim
AutoResearchBench on Spark — Two NIMs, One Bench, Two Failure Modes
Two Spark-tuned NIMs run AutoResearchBench's three Deep-Research example questions. Llama-3.1-8B crashes by turn 5-6 on its 8K context; Nemotron-Nano-9B-v2 finishes cleanly at 128K. Both score 0% Accuracy@1 — for completely different reasons.
uses fieldkit.nimfieldkit.evalfieldkit.capabilities
Was the Agent Researching, or Flailing? An Observability Pass on the Trajectory
A8 said the LoRA mode-collapsed because the trajectory was thin. This puts numbers on it: 6 of 13 knobs ever touched, 72% of proposals repeated a prior pair, and the proposer's k=5 history window is the structural cause.
The Autoresearch Loop — 50 Iterations of an LLM Editing Its Own Trainer Overnight
NIM Llama 3.1 8B drives a structured-perturbation agent loop against a 354M GPT pretrain. 50 iterations, 73.4 min wall, 0.07 kWh of electricity. 8 keeps, 42 reverts, 0 rail blocks, 0 crashes. Best result: val_bpb 10.8534, +0.93% over baseline at d_model=768.
Guardrails Before the Agent Edits — Code-Edit Policy as a Programmatic Funnel
Five programmatic rails between the Autoresearch agent's proposal and any mutation of train.py — schema, menu, range, cross-constraint, diff lint. 27 adversarial test cases: block recall 1.0, clean pass 1.0, every rail attribution correct. Zero LLM-as-judge calls.
Second Brain as a Tool — Wrapping the RAG Stack in MCP for Claude Code
Closing the Second Brain arc. Four MCP tools wrap the RAG chain — embed, retrieve, optionally rerank, generate — and any Claude Code session anywhere on the box becomes a grounded research client. 200 lines of Python, one launcher, one .mcp.json entry.
The Sandbox Tax That Wasn't — NemoClaw vs OpenClaw on One DGX Spark
I ran NemoClaw's sandboxed agent stack and the host Ollama-OpenClaw CLI side by side on one DGX Spark with the same 123B Nemotron model. The sandbox overhead I went looking for is real but modest (~2× raw inference); the real tax is onboarding, and NemoClaw paid it at install time.
Access First, Models Second — How I Set Up My DGX Spark for Solo AI Work
Most DGX Spark walkthroughs open with CUDA and tokens/sec. This one opens with streaming, AI-pair-programming, sandboxed agents, and browser automation — the access layer. For a solo edge builder, that interaction stack is more load-bearing than the model stack.
Claw-Eval-Live on Spark — Spark reproduction notes
Stand up Claw-Eval-Live sandboxed-workflow protocol on Spark via NemoClaw + OpenShell, mock the business-service backends, run Llama 8B vs Nemotron 49B with deterministic-trace + LLM-judge grading, and chart where local agents land vs the paper 66.7 percent ceiling.
Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction — Spark reproduction notes
Reproducing DCI-Agent-Lite on a DGX Spark — NIM-served 8B agent + ripgrep + filesystem corpus, no embedder or vector DB; extracts the operator vocabulary as `fieldkit.rag.operators` and quantifies how much of the existing pgvector + reranker stack DCI lets you delete.
Field-Fixing the Hermes Harness on a DGX Spark — When the NIM Won't Stream Tool Calls, and Other Rough Edges
Fifth in the Harnesses series: the field fixes that take a fresh Hermes agent on a local NIM from 'mostly works' to 'just works.' Leads with the one that bit hardest — the Spark NIM ships a non-streaming tool parser, fixed by bind-mounting NVIDIA's own streaming parser.
uses fieldkit.harness
Heterogeneous Scientific Foundation Model Collaboration — Spark reproduction notes
Wrap a domain foundation model (Pangu-Weather) as a Triton tool, drive it from a NIM-served Llama 3.1 8B planner via NemoClaw, and show when specialist routing beats language-only reasoning — all inside the Spark 128 GB envelope.
SkillOS: Learning Skill Curation for Self-Evolving Agents — Spark reproduction notes
Reproducing the SkillOS curator/executor split on a DGX Spark — both Qwen3-8B (frozen executor + LoRA-trained curator) over a markdown SkillRepo with BM25 retrieval, then extracting the pattern into `fieldkit.skills`.